We experienced similar issues with FST services silently failing on Cent6 last December with a 2.6.32-696 kernel on Cent6.
The FST had been stable for many months, was under no significant load, and resource contention was not applicable, but suddenly began failing - often within minutes of a restart, leading to the thread:
In our case, a kernel change at the time caused the havoc. Reverting to an older kernel immediately resolved the issue, and rebooting to the problematic kernel reproduced the issue immediately.
We subsequently moved to a newer kernel and the issue did not return.
The kernel we were experiencing the issue on (2.6.32-696.30.1.el6.x86_64) appears roughly similar vintage to the one you reported (2.6.32-696.10.3.el6.x86_64)
We reverted at the time to kernel-2.6.32-696.16.1.
Currently running 2.6.32-754.17.1.el6.x86_64 on that system since Aug
Thank you very much for the hint, I had forgotten your post and it is very much similar to what happened here. I am going to look for a recent kernel and I will have to schedule a reboot of the servers. What is strange is that there was no problems for months, there have been no system updates and suddenly I see this problem on 4 different servers. This is like if the issue were externally triggered…
News on this:
There have been several (17 on Dec 26th!, 5 in the beginning of January) crashes during the christmas holiday. I had installed a cron job to restart EOS fst in case it crashed, so I believe the consequences where minimal.
However, I added an instruction to the restart script to list the last lines in the fst log and it seems that the last line is always related to MgmSyncer. Example:
It continues to happen but I have seen messages in /var/log/messages that I did not notice before:
Feb 21 05:06:45 nanxrd12 kernel: xrootd general protection ip:7f05faf87d50 sp:7f05f0b88ce0 error:0 in libXrdEosFst-4.so[7f05fae0d000+33d000]
Feb 21 05:28:39 nanxrd12 kernel: xrootd: segfault at 0 ip 00007fed56e86a4d sp 00007fed41cf6cf0 error 4 in libXrdEosFst-4.so[7fed56d0c000+33d000]
Feb 21 05:30:06 nanxrd12 kernel: xrootd: segfault at 140 ip 0000003a4f634f30 sp 00007f4446860b70 error 4 in libc-2.12.so[3a4f600000+18b000]
There are références to a similar bug if googling “linux crash ‘error 4 in libc-2.12.so’” but I have been unable to directly relate them to what I observe.
This is probably sth that happens only on CentOS6
Last night around 8PM this is 4 EOS servers that crashed simultaneously with the same message as in my previous post. It looks like that this is a problem with xrootd and is triggered by some kind of activity.
Though I could subscribe and post into the xrootd-l mailing list, I think it would be better if a member of the EOS development team do it and follow the issue. It may be a known issue of xrootd (reminder : this is on SL6).
During the weekend, we observed again a serie of crashes of the xrootd daemon on several FST servers. That happened between April 12th 8PM and April 13th 2AM. I believe that a specific activity combined with a bug or a vulnerability in the xrootd daemon causes this. I hope the origin of this issue can be found, it is rather disruptive. I have a cron job which restarts eos on the FST when the daemon is stopped and the lock is still there but it does not work 100%, so manual intervention is sometimes required.