Help trouble shooting EOS

Hi All, We’re half of the ALICE-USA team. Our EOS has been down for the past four days and we are seeking assistance in troubleshooting it.

Restarting the mgm will resurrect it only for an hour or so.

One of our three mgm nodes is reporting a NODE-HEALTH of YELLOW in redis raft-info while the other two are green. The command “fs ls” appears to hang in the eos shell, and we are seeing many errors in the /var/log/eos/mgm/xrdlog.mgm like this:

....``lbl.gov:1094`` tid=00007f06e83f6700 source=XrdMgmOfs:858 tident= sec= uid=0 gid=0 name= geo=“” Unable to No such file or directory ; No such file or directory

We are using EOS version 5.2.31-1.el8.x86_64.

What are things to check here? Please let us know if more information would be helpful.

Thanks,

-k

Hi Karen,

It’s hard to say what the problem might be only from this information. Could you please send us the output of the following commands?
redis-cli -p 7777 raft-info

eos ns

eos fs ls

Did you start any new activities recently that might explain the change in behavior? Like massive draining or balancing campaigns?
If needed you can contact me directly and we can have a look at your instance.

Cheers,
Elvin

Karen Fernsler kmfernsler@lbl.gov

2:32 PM (4 minutes ago)
to Torben, Elvin

Hi Elvin,

Thanks so much for your reply! We have not initiated any draining or balancing campaign. Our Cyber people noticed that two of our FSTs did not have ipv6, so we brought up those interfaces and this is when the initial trouble with EOS falling offline started. I brought them back down and things stabilized for awhile but now we’re having issues again. It doesn’t make sense to me, but that was the only recent activity that occurred.

Here is the requested output:

[root@alicemgm0.lbl.gov ~]# redis-cli -p 7000 raft-info

  1. TERM 141027
  2. LOG-START 2683200000
  3. LOG-SIZE 2733276333
  4. LEADER alicemgm1.lbl.gov:7000
  5. CLUSTER-ID LBL_HPCS
  6. COMMIT-INDEX 2733276332
  7. LAST-APPLIED 2733276332
  8. BLOCKED-WRITES 0
  9. LAST-STATE-CHANGE 1281 (21 minutes, 21 seconds)

  10. MYSELF alicemgm0.lbl.gov:7000
  11. VERSION 5.2.31.1
  12. STATUS FOLLOWER
  13. NODE-HEALTH YELLOW
  14. JOURNAL-FSYNC-POLICY sync-important-updates

  15. MEMBERSHIP-EPOCH 0
  16. NODES alicemgm0.lbl.gov:7000,alicemgm1.lbl.gov:7000,alicemgm2.lbl.gov:7000
  17. OBSERVERS
  18. QUORUM-SIZE 2
    [root@alicemgm0.lbl.gov ~]# eos ns
    error: MGM root://localhost not online/reachable
    [root@alicemgm0.lbl.gov ~]# eos fs ls
    error: MGM root://localhost not online/reachable
    [root@alicemgm0.lbl.gov ~]#

Thanks,

-k

Hi Karen,

There is not much that I can say about the instance from the output. Clearly there is no MGM service running on that machine, but I can’t say anything about the reason. Maybe it’s worth having a look at the MGM logs and see if there is any crash or clear error message.

Otherwise, you can also contact me directly and we could arrange that I connect to your instance and investigate the issue. I think I’ve done this in the past with LBL.

Cheers,
Elvin