FST error every 10 seconds: "cannot send report broadcast"

rptaylor · April 8, 2026, 12:19am

We seem to get these errors on FSTs pretty regularly every 10 seconds, running EOS 5.4.1:


260408 02:05:50 time=1775606750.768946 func=Report                   level=ERROR logid=FstOfsStorage unit=fst@eos-fst-2.eos-fst.eos.svc.kermes-dev.local:1095 tid=00007f2a825fb640 source=Report:56                      tident=<service> sec=      uid=0 gid=0 name= geo="" xt="" ob="" msg="cannot send report broadcast"
260408 02:06:00 time=1775606760.769443 func=Report                   level=ERROR logid=FstOfsStorage unit=fst@eos-fst-2.eos-fst.eos.svc.kermes-dev.local:1095 tid=00007f2a825fb640 source=Report:56                      tident=<service> sec=      uid=0 gid=0 name= geo="" xt="" ob="" msg="cannot send report broadcast"
260408 02:06:10 time=1775606770.770140 func=Report                   level=ERROR logid=FstOfsStorage unit=fst@eos-fst-2.eos-fst.eos.svc.kermes-dev.local:1095 tid=00007f2a825fb640 source=Report:56                      tident=<service> sec=      uid=0 gid=0 name= geo="" xt="" ob="" msg="cannot send report broadcast"

Not sure what report, where it is broadcasting to, or why it failed.
How can I fix it?

Thanks!

rptaylor · April 9, 2026, 5:43pm

It seems to have been related to a SSS authentication problem from the FSTs to the MGMs.

rptaylor · April 28, 2026, 10:29pm

Actually this still happens sometimes, but not always. Restarting FSTs consistently fixes it, but I can’t trigger it by e.g. restarting QDBs or MGMs. It just seems to eventually start happening after a few days …

GeonmoRyu · May 6, 2026, 10:09am

Hi Ryan,

I’m not an expert on EOS internals, but I wanted to share a similar experience we had that might be worth checking, just in case.

In our environment, we noticed that while most FSTs had heartbeat times within a few milliseconds, a specific FST would consistently show a much higher, fixed lag. It eventually turned out to be a time synchronization (Clock Skew) issue on that node.

It might not be the definitive cause of your problem—especially since a daemon restart wouldn’t fix the underlying OS clock drift—but it’s possible that a fresh restart temporarily resets the session’s drift tolerance or clears accumulated timing errors in the reporting logic.

Also, if you haven’t already, I highly recommend switching to the MQ on QDB configuration. In our case, moving the Message Queue to QuarkDB significantly improved the stability of the heartbeats and overall communication between FSTs and MGMs. If the MQ is struggling to handle reports, it could definitely trigger those “cannot send report broadcast” errors intermittently.

rptaylor · May 6, 2026, 9:51pm

Hi Geonmo ,

Hmm interesting, thanks for the tip, I’ll keep an eye on that and check next time it happens.

Agreed, the no-MQ mode is much better, we have been using that.

Thanks!

GeonmoRyu · May 21, 2026, 9:47am

Hi Ryan,

Just a quick follow-up after looking into the EOS/QDB telemetry architecture.

Actually, the 10-second interval of the error is a big clue. Under normal conditions, the FST is supposed to broadcast this report every 1 second. The fact that you are seeing it exactly every 10 seconds means the previous transmission failed, and the FST is hitting a 10-second retry back-off timer—which then continuously fails as well. This indicates a persistent connection failure rather than a sporadic one.

Since you are running in no-MQ mode, this continuous failure strongly points to the FST’s TCP/application session to the QuarkDB-backed MQ getting stuck in a “half-open” state.

Since we are managing the infrastructure rather than modifying the source code, here are a few server-side and architectural angles worth checking:

Keepalive & Session Timeout Settings: Since only an FST daemon restart fixes the issue, the FST is likely failing to detect that its socket has gone stale. It might be worth checking the OS-level TCP keepalive settings on the FST nodes, or seeing if there are any timeout/eviction policies on the QuarkDB side that are dropping the session without the FST realizing it.
QuarkDB Leader Flaps: Since the FST talks directly to QDB, intermittent network hiccups or frequent QuarkDB leader elections could trigger the initial drop. If a leader change occurs and the FST fails to cleanly migrate its publisher session to the new active leader, it will get trapped in this 10-second failure loop.
MGM Listener Drops: Due to architectural differences between the legacy MQ and the QDB-Redis messaging layer, if the MGM’s subscriber/listener count drops to 0 momentarily on the server side, it can cause the messaging channel to misbehave.

Tracing the MGM listener state in real-time can be tricky, so as a practical first step, I highly recommend focusing on the QuarkDB cluster status first. Checking the QDB logs for any leader re-elections or cluster flaps around the exact timestamp the errors start might give you a clear answer.

rptaylor · June 16, 2026, 7:40pm

Thanks Geonmo for the tips!

We configure the FSTs with

fstofs.qdbcluster eos-qdb.eos.svc.cluster.local:7777

which has A records for each QDB member:

$ host eos-qdb.eos.svc.kermes-dev.local
eos-qdb.eos.svc.cluster.local has address 10.224.9.70
eos-qdb.eos.svc.cluster.local has address 10.224.8.221
eos-qdb.eos.svc.cluster.local has address 10.224.3.102

Actually I’m not 100% sure if this is a supported way for the FST to discover all 3 individual QDB members … ? Or how to query a MGM or FST to see what it knows about the QDB cluster.

I triggered a QDB failover as a test, but it did not seem to cause this FST error to occur, at least not within several minutes.

Awhile ago I did notice that the QDB members reported “yellow” health status because the storage volume was 50% full. I used larger volumes and did not notice the issue since then, but I’m not sure if it’s actually related and there were other changes since then as well.

GeonmoRyu · June 17, 2026, 2:34am

Hello, Ryan.

Thanks for sharing the details! This is very interesting.

Looking at your configuration (eos-qdb.eos.svc.cluster.local:7777), it heavily implies that you might be running this inside a Kubernetes (k8s) environment. Is that correct?

If so, this single DNS mapping with multiple A records could indeed be the root cause. Here are a few thoughts and architectural recommendations regarding QuarkDB in k8s:

1. QuarkDB Must Be Stateful (StatefulSet + Headless Service)

QuarkDB is a Stateful service based on the Raft consensus algorithm. It should never be deployed as a stateless deployment/replica. In fact, within the entire EOS ecosystem, QuarkDB is practically the only component that must be treated as strictly stateful.

Because k8s rotates Pod IPs naturally, using a single service endpoint causes the FST to lose track of individual members during a failover or IP rotation. To fix this, you should configure QuarkDB using a StatefulSet and a Headless Service so that each individual QDB member obtains a fixed, independent hostname (e.g., qdb-0.eos..., qdb-1.eos...). Then, you should explicitly list all three hostnames in the FST configuration, separated by spaces:

Plaintext

fstofs.qdbcluster qdb-0.eos.svc...:7777 qdb-1.eos.svc...:7777 qdb-2.eos.svc...:7777

2. How to Check What QDB Knows (Cluster Status)

You mentioned you weren’t sure how to query what the cluster knows. Since QuarkDB speaks the Redis protocol, you can inspect the Raft cluster topology directly.

Find the current QuarkDB Leader node, log into it, and run the following command:

Bash

redis-cli -p 7777 raft-info

Under normal and healthy conditions, you should look for these key indicators:

NODE-HEALTH: Should be GREEN
QUORUM-SIZE: Should be 2 (for a 3-node cluster)
REPLICA: You should see the clear list of active follower nodes.

3. Regarding the “Yellow” Health Status

You noted that the QDB members previously reported a “yellow” health status and suspected it was due to the storage volume being 50% full.

From our experience, NODE-HEALTH usually turns YELLOW when the Leader loses communication with one of its followers. In a 3-node cluster, if the Leader cannot talk to one of the followers, the cluster reaches its minimum required quorum size and shows a YELLOW warning. As soon as they reconnect, it goes back to GREEN.

Given this, the yellow status you saw might not have been a disk capacity issue, but rather an indicator of inter-node communication issues or network flaps between the QDB instances themselves. It’s definitely worth checking the network stability and firewall/routing between those QDB nodes.

Regards,

-- Geonmo

CERN Accelerating science

FST error every 10 seconds: "cannot send report broadcast"

1. QuarkDB Must Be Stateful (StatefulSet + Headless Service)

2. How to Check What QDB Knows (Cluster Status)

3. Regarding the “Yellow” Health Status