Hi All, We’re half of the ALICE-USA team. Our EOS has been down for the past four days and we are seeking assistance in troubleshooting it.
Restarting the mgm will resurrect it only for an hour or so.
One of our three mgm nodes is reporting a NODE-HEALTH of YELLOW in redis raft-info while the other two are green. The command “fs ls” appears to hang in the eos shell, and we are seeing many errors in the /var/log/eos/mgm/xrdlog.mgm like this:
....``lbl.gov:1094`` tid=00007f06e83f6700 source=XrdMgmOfs:858 tident= sec= uid=0 gid=0 name= geo=“” Unable to No such file or directory ; No such file or directory
We are using EOS version 5.2.31-1.el8.x86_64.
What are things to check here? Please let us know if more information would be helpful.
It’s hard to say what the problem might be only from this information. Could you please send us the output of the following commands? redis-cli -p 7777 raft-info
eos ns
eos fs ls
Did you start any new activities recently that might explain the change in behavior? Like massive draining or balancing campaigns?
If needed you can contact me directly and we can have a look at your instance.
Thanks so much for your reply! We have not initiated any draining or balancing campaign. Our Cyber people noticed that two of our FSTs did not have ipv6, so we brought up those interfaces and this is when the initial trouble with EOS falling offline started. I brought them back down and things stabilized for awhile but now we’re having issues again. It doesn’t make sense to me, but that was the only recent activity that occurred.
There is not much that I can say about the instance from the output. Clearly there is no MGM service running on that machine, but I can’t say anything about the reason. Maybe it’s worth having a look at the MGM logs and see if there is any crash or clear error message.
Otherwise, you can also contact me directly and we could arrange that I connect to your instance and investigate the issue. I think I’ve done this in the past with LBL.