Actually this still happens sometimes, but not always. Restarting FSTs consistently fixes it, but I can’t trigger it by e.g. restarting QDBs or MGMs. It just seems to eventually start happening after a few days …
I’m not an expert on EOS internals, but I wanted to share a similar experience we had that might be worth checking, just in case.
In our environment, we noticed that while most FSTs had heartbeat times within a few milliseconds, a specific FST would consistently show a much higher, fixed lag. It eventually turned out to be a time synchronization (Clock Skew) issue on that node.
It might not be the definitive cause of your problem—especially since a daemon restart wouldn’t fix the underlying OS clock drift—but it’s possible that a fresh restart temporarily resets the session’s drift tolerance or clears accumulated timing errors in the reporting logic.
Also, if you haven’t already, I highly recommend switching to the MQ on QDB configuration. In our case, moving the Message Queue to QuarkDB significantly improved the stability of the heartbeats and overall communication between FSTs and MGMs. If the MQ is struggling to handle reports, it could definitely trigger those “cannot send report broadcast” errors intermittently.
Just a quick follow-up after looking into the EOS/QDB telemetry architecture.
Actually, the 10-second interval of the error is a big clue. Under normal conditions, the FST is supposed to broadcast this report every 1 second. The fact that you are seeing it exactly every 10 seconds means the previous transmission failed, and the FST is hitting a 10-second retry back-off timer—which then continuously fails as well. This indicates a persistent connection failure rather than a sporadic one.
Since you are running in no-MQ mode, this continuous failure strongly points to the FST’s TCP/application session to the QuarkDB-backed MQ getting stuck in a “half-open” state.
Since we are managing the infrastructure rather than modifying the source code, here are a few server-side and architectural angles worth checking:
Keepalive & Session Timeout Settings: Since only an FST daemon restart fixes the issue, the FST is likely failing to detect that its socket has gone stale. It might be worth checking the OS-level TCP keepalive settings on the FST nodes, or seeing if there are any timeout/eviction policies on the QuarkDB side that are dropping the session without the FST realizing it.
QuarkDB Leader Flaps: Since the FST talks directly to QDB, intermittent network hiccups or frequent QuarkDB leader elections could trigger the initial drop. If a leader change occurs and the FST fails to cleanly migrate its publisher session to the new active leader, it will get trapped in this 10-second failure loop.
MGM Listener Drops: Due to architectural differences between the legacy MQ and the QDB-Redis messaging layer, if the MGM’s subscriber/listener count drops to 0 momentarily on the server side, it can cause the messaging channel to misbehave.
Tracing the MGM listener state in real-time can be tricky, so as a practical first step, I highly recommend focusing on the QuarkDB cluster status first. Checking the QDB logs for any leader re-elections or cluster flaps around the exact timestamp the errors start might give you a clear answer.