Help trouble shooting EOS

kmwf · January 27, 2026, 7:49pm

Hi All, We’re half of the ALICE-USA team. Our EOS has been down for the past four days and we are seeking assistance in troubleshooting it.

Restarting the mgm will resurrect it only for an hour or so.

One of our three mgm nodes is reporting a NODE-HEALTH of YELLOW in redis raft-info while the other two are green. The command “fs ls” appears to hang in the eos shell, and we are seeing many errors in the /var/log/eos/mgm/xrdlog.mgm like this:

....``lbl.gov:1094`` tid=00007f06e83f6700 source=XrdMgmOfs:858 tident= sec= uid=0 gid=0 name= geo=“” Unable to No such file or directory ; No such file or directory

We are using EOS version 5.2.31-1.el8.x86_64.

What are things to check here? Please let us know if more information would be helpful.

Thanks,

-k

esindril · January 28, 2026, 7:44am

Hi Karen,

It’s hard to say what the problem might be only from this information. Could you please send us the output of the following commands?
redis-cli -p 7777 raft-info

eos ns

eos fs ls

Did you start any new activities recently that might explain the change in behavior? Like massive draining or balancing campaigns?
If needed you can contact me directly and we can have a look at your instance.

Cheers,
Elvin

kmwf · January 28, 2026, 10:38pm

Karen Fernsler kmfernsler@lbl.gov

2:32 PM (4 minutes ago)

to Torben, Elvin

Hi Elvin,

Thanks so much for your reply! We have not initiated any draining or balancing campaign. Our Cyber people noticed that two of our FSTs did not have ipv6, so we brought up those interfaces and this is when the initial trouble with EOS falling offline started. I brought them back down and things stabilized for awhile but now we’re having issues again. It doesn’t make sense to me, but that was the only recent activity that occurred.

Here is the requested output:

[root@alicemgm0.lbl.gov ~]# redis-cli -p 7000 raft-info

TERM 141027
LOG-START 2683200000
LOG-SIZE 2733276333
LEADER alicemgm1.lbl.gov:7000
CLUSTER-ID LBL_HPCS
COMMIT-INDEX 2733276332
LAST-APPLIED 2733276332
BLOCKED-WRITES 0
LAST-STATE-CHANGE 1281 (21 minutes, 21 seconds)
MYSELF alicemgm0.lbl.gov:7000
VERSION 5.2.31.1
STATUS FOLLOWER
NODE-HEALTH YELLOW
JOURNAL-FSYNC-POLICY sync-important-updates
MEMBERSHIP-EPOCH 0
NODES alicemgm0.lbl.gov:7000,alicemgm1.lbl.gov:7000,alicemgm2.lbl.gov:7000
OBSERVERS
QUORUM-SIZE 2
[root@alicemgm0.lbl.gov ~]# eos ns
error: MGM root://localhost not online/reachable
[root@alicemgm0.lbl.gov ~]# eos fs ls
error: MGM root://localhost not online/reachable
[root@alicemgm0.lbl.gov ~]#

Thanks,

-k

esindril · January 29, 2026, 1:45pm

Hi Karen,

There is not much that I can say about the instance from the output. Clearly there is no MGM service running on that machine, but I can’t say anything about the reason. Maybe it’s worth having a look at the MGM logs and see if there is any crash or clear error message.

Otherwise, you can also contact me directly and we could arrange that I connect to your instance and investigate the issue. I think I’ve done this in the past with LBL.

Cheers,
Elvin

CERN Accelerating science

Help trouble shooting EOS

Karen Fernsler kmfernsler@lbl.gov