FST error every 10 seconds: "cannot send report broadcast"

We seem to get these errors on FSTs pretty regularly every 10 seconds, running EOS 5.4.1:


260408 02:05:50 time=1775606750.768946 func=Report                   level=ERROR logid=FstOfsStorage unit=fst@eos-fst-2.eos-fst.eos.svc.kermes-dev.local:1095 tid=00007f2a825fb640 source=Report:56                      tident=<service> sec=      uid=0 gid=0 name= geo="" xt="" ob="" msg="cannot send report broadcast"
260408 02:06:00 time=1775606760.769443 func=Report                   level=ERROR logid=FstOfsStorage unit=fst@eos-fst-2.eos-fst.eos.svc.kermes-dev.local:1095 tid=00007f2a825fb640 source=Report:56                      tident=<service> sec=      uid=0 gid=0 name= geo="" xt="" ob="" msg="cannot send report broadcast"
260408 02:06:10 time=1775606770.770140 func=Report                   level=ERROR logid=FstOfsStorage unit=fst@eos-fst-2.eos-fst.eos.svc.kermes-dev.local:1095 tid=00007f2a825fb640 source=Report:56                      tident=<service> sec=      uid=0 gid=0 name= geo="" xt="" ob="" msg="cannot send report broadcast"

Not sure what report, where it is broadcasting to, or why it failed.
How can I fix it?

Thanks!

It seems to have been related to a SSS authentication problem from the FSTs to the MGMs.

Actually this still happens sometimes, but not always. Restarting FSTs consistently fixes it, but I can’t trigger it by e.g. restarting QDBs or MGMs. It just seems to eventually start happening after a few days …

Hi Ryan,

I’m not an expert on EOS internals, but I wanted to share a similar experience we had that might be worth checking, just in case.

In our environment, we noticed that while most FSTs had heartbeat times within a few milliseconds, a specific FST would consistently show a much higher, fixed lag. It eventually turned out to be a time synchronization (Clock Skew) issue on that node.

It might not be the definitive cause of your problem—especially since a daemon restart wouldn’t fix the underlying OS clock drift—but it’s possible that a fresh restart temporarily resets the session’s drift tolerance or clears accumulated timing errors in the reporting logic.

Also, if you haven’t already, I highly recommend switching to the MQ on QDB configuration. In our case, moving the Message Queue to QuarkDB significantly improved the stability of the heartbeats and overall communication between FSTs and MGMs. If the MQ is struggling to handle reports, it could definitely trigger those “cannot send report broadcast” errors intermittently.

Hi Geonmo ,

Hmm interesting, thanks for the tip, I’ll keep an eye on that and check next time it happens.

Agreed, the no-MQ mode is much better, we have been using that.

Thanks!

Hi Ryan,

Just a quick follow-up after looking into the EOS/QDB telemetry architecture.

Actually, the 10-second interval of the error is a big clue. Under normal conditions, the FST is supposed to broadcast this report every 1 second. The fact that you are seeing it exactly every 10 seconds means the previous transmission failed, and the FST is hitting a 10-second retry back-off timer—which then continuously fails as well. This indicates a persistent connection failure rather than a sporadic one.

Since you are running in no-MQ mode, this continuous failure strongly points to the FST’s TCP/application session to the QuarkDB-backed MQ getting stuck in a “half-open” state.

Since we are managing the infrastructure rather than modifying the source code, here are a few server-side and architectural angles worth checking:

  1. Keepalive & Session Timeout Settings: Since only an FST daemon restart fixes the issue, the FST is likely failing to detect that its socket has gone stale. It might be worth checking the OS-level TCP keepalive settings on the FST nodes, or seeing if there are any timeout/eviction policies on the QuarkDB side that are dropping the session without the FST realizing it.

  2. QuarkDB Leader Flaps: Since the FST talks directly to QDB, intermittent network hiccups or frequent QuarkDB leader elections could trigger the initial drop. If a leader change occurs and the FST fails to cleanly migrate its publisher session to the new active leader, it will get trapped in this 10-second failure loop.

  3. MGM Listener Drops: Due to architectural differences between the legacy MQ and the QDB-Redis messaging layer, if the MGM’s subscriber/listener count drops to 0 momentarily on the server side, it can cause the messaging channel to misbehave.

Tracing the MGM listener state in real-time can be tricky, so as a practical first step, I highly recommend focusing on the QuarkDB cluster status first. Checking the QDB logs for any leader re-elections or cluster flaps around the exact timestamp the errors start might give you a clear answer.