Several eos servers with xrootd daemon dead

Hello,

Something very strange happened : in the course of the last 2/3 days, I observed 4 cases of EOS servers with the xrootd/fst daemon suddenly dying. First elements:

  • nothing in the xrdlog
  • the xrootd process is no longer active
  • nothing in the system metrics (cpu, ram,network load)
    The service restarts normally

I have no idea what can cause this, there have been no recent changes on the servers (plus they are monitored by tripwire).

Any similar experiences ?

This is:
Scientific Linux 6.6
Kernel : 2.6.32-696.10.3.el6.x86_64
eos-server-4.5.6-1.el6.x86_64

Thank you

JM

Hi JM,

We experienced similar issues with FST services silently failing on Cent6 last December with a 2.6.32-696 kernel on Cent6.

The FST had been stable for many months, was under no significant load, and resource contention was not applicable, but suddenly began failing - often within minutes of a restart, leading to the thread:

In our case, a kernel change at the time caused the havoc. Reverting to an older kernel immediately resolved the issue, and rebooting to the problematic kernel reproduced the issue immediately.

We subsequently moved to a newer kernel and the issue did not return.

The kernel we were experiencing the issue on (2.6.32-696.30.1.el6.x86_64) appears roughly similar vintage to the one you reported (2.6.32-696.10.3.el6.x86_64)

We reverted at the time to kernel-2.6.32-696.16.1.

Currently running 2.6.32-754.17.1.el6.x86_64 on that system since Aug

Pete

Hi Pete,

Thank you very much for the hint, I had forgotten your post and it is very much similar to what happened here. I am going to look for a recent kernel and I will have to schedule a reboot of the servers. What is strange is that there was no problems for months, there have been no system updates and suddenly I see this problem on 4 different servers. This is like if the issue were externally triggered…

JM

News on this:
There have been several (17 on Dec 26th!, 5 in the beginning of January) crashes during the christmas holiday. I had installed a cron job to restart EOS fst in case it crashed, so I believe the consequences where minimal.
However, I added an instruction to the restart script to list the last lines in the fst log and it seems that the last line is always related to MgmSyncer. Example:

200105 02:18:20 time=1578187100.067398 func=MgmSyncer level=INFO logid=static… unit=fst@nanxrd09.in2p3.fr:1095 tid=00007f8e832fb700 source=MgmSyncer:81 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=06275b85 mtime=1578186992

I can very well update all servers to the last kernel (this is Scientific Linux 6) but I am still unsure that this is related to a certain kernel.

JM

Hi JM,

Just curious - what kernel are the affected FSTs currently running?

Pete

Hi Pete,

This is 2.6.32-696.10.3.el6.x86_64

I could almost say “this was”, today I will finish updating remaining servers to the latest kernel in SL6.

JM

Hello all,
A crash happened early this morning, despite having updating the operating system (SL6.6). Again, the last line in the FST log was related to MgmSyncer:

200210 06:55:27 time=1581314127.051342 func=MgmSyncer level=INFO logid=static.............................. [unit=fst@nanxrd04.in2p3.fr:1095](mailto:unit=fst@nanxrd04.in2p3.fr:1095) tid=00007fd854eb7700 source=MgmSyncer:81 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=066737a1 mtime=1581314036

The load did not seem extreme at this time. There may be something to dig into…
I had installed a script that automatically restart the FST daemon if it is dead and still have a lock.

JM

Hello,
It continues to happen but I have seen messages in /var/log/messages that I did not notice before:

Feb 21 05:06:45 nanxrd12 kernel: xrootd[11021] general protection ip:7f05faf87d50 sp:7f05f0b88ce0 error:0 in libXrdEosFst-4.so[7f05fae0d000+33d000]
Feb 21 05:28:39 nanxrd12 kernel: xrootd[20259]: segfault at 0 ip 00007fed56e86a4d sp 00007fed41cf6cf0 error 4 in libXrdEosFst-4.so[7fed56d0c000+33d000]
Feb 21 05:30:06 nanxrd12 kernel: xrootd[23245]: segfault at 140 ip 0000003a4f634f30 sp 00007f4446860b70 error 4 in libc-2.12.so[3a4f600000+18b000]

There are références to a similar bug if googling “linux crash ‘error 4 in libc-2.12.so’” but I have been unable to directly relate them to what I observe.
This is probably sth that happens only on CentOS6
JM

Last night around 8PM this is 4 EOS servers that crashed simultaneously with the same message as in my previous post. It looks like that this is a problem with xrootd and is triggered by some kind of activity.
Though I could subscribe and post into the xrootd-l mailing list, I think it would be better if a member of the EOS development team do it and follow the issue. It may be a known issue of xrootd (reminder : this is on SL6).
Thanks JM

Which EOS/xrootd version are you running?

Hi Andreas,
This is 4.5.6:

[root@nanxrd01 ~]# rpm -qa | grep eos
    eos-client-4.5.6-1.el6.x86_64
    libmicrohttpd-0.9.38-eos.yves.slc6.x86_64
    eos-folly-2017.09.18.00-4.el6.x86_64
    eos-apmon-1.1.8-1.el6.x86_64
    eos-server-4.5.6-1.el6.x86_64
    [root@nanxrd01 ~]# rpm -qa | grep xrootd
    xrootd-client-libs-4.9.1-1.el6.x86_64
    xrootd-4.9.1-1.el6.x86_64
    xrootd-libs-4.9.1-1.el6.x86_64
    xrootd-client-4.9.1-1.el6.x86_64
    xrootd-server-4.9.1-1.el6.x86_64
    xrootd-selinux-4.9.1-1.el6.noarch
    xrootd-server-libs-4.9.1-1.el6.x86_64
    xrootd-private-devel-4.9.1-1.el6.x86_64

Hello,
During the weekend, we observed again a serie of crashes of the xrootd daemon on several FST servers. That happened between April 12th 8PM and April 13th 2AM. I believe that a specific activity combined with a bug or a vulnerability in the xrootd daemon causes this. I hope the origin of this issue can be found, it is rather disruptive. I have a cron job which restarts eos on the FST when the daemon is stopped and the lock is still there but it does not work 100%, so manual intervention is sometimes required.
JM