A long due upgrade at JRC, some suggestions?

Hello,

I wanted to inform you that the upgrade of the whole instance to the latest 4.8.x version (4.8.91) was performed successfully last week, without any problem on eos side (we had our own hardware issues…). I want to thank all of you for your suggestions and your help.

Running this brand new version after such a huge step didn’t affect in any notable way the functioning of our systems. As expected, it works for the users at least as good as before, maybe better, but to be confirmed in the long term for the stability. For sure, on the management side we have a lot of improvements.

Unfortunately, though, we already have to report the first incident, with the MGM stopping working last night, after 8 days of functioning. The MGM stopped serving the fuse clients at 00:00:00 (the logs decreased immediately after that time) and the FSTs started to log query error, and finally stopped also answering also to the statistics request 6 hours later. But the daemon wasn’t crashed, it was running but just hanged. A restart restored the service.

Does this fact to get blocked at this specific midnight o’clock ring some bell to you ?

Nothing else very useful was found in the logs, no particular error message, just a sudden drop in the activity. Do you think it could be interesting to provide you with the logs (quite large in our case) ?

Hi Franck,

We definitely don’t have any special hooks to slow down the service at midnight. :wink: I believe this is just a coincidence. Can you give us a bit more details about what were the symptoms? Where any of the admin commands working? eos node ls eos ns eos fs ls?
Did you take a stacktrace of the MGM process? This might point to the thread that got (possibly) stuck and then was affecting all the others.
Yes, please send us the logs, maybe there is something relevant that we can use.

Cheers,
Elvin

Thank you @esindril for your answer.

Unfortunately, the MGM was restarted before a stacktrace could be extracted. We will think of it next time (which hopefully will not occur)

About the symptoms, I could say that the admin commands have been working around 6 hours after the clients couldn’t work anymore. This is extrapolated from what we observe in our metrics. They were extracted correctly for about 6 hours (so between midnight and 6am). However, when we arrived at the office starting 8am, they were not working any more (going timeout).

About the coincidence with the clock, I was wondering if some particular things were ran at that time, because there are several of these log lines which appear daily at 00:00:00.

221021 00:00:00 62154 Copr.  2004-2012 Stanford University, xrd version v4.12.8
221021 00:00:00 62154 xrootd mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 running Linux 3.10.0-1160.76.1.el7.x86_64

And literally, in the logs, everything started blocking at 00:00:00 or 00:00:01.

I send you a link in private message where you can download the MGM logs of that time (between 20/10/2022 19:00 and 21/10/2022 10:00, this includes also the restart). Note that our logs are rotated hourly and the log option is the following one (to avoid too many log lines) :

eos debug info --filter IdMap,Commit,_access,_utimes,readlink,_readlink,BroadcastReleaseFromExternal,BroadcastRefreshFromExternal,BroadcastMD,OpSetFile,RefreshEntry,HandleMD,FillContainerCAP,Store,OpGetLs,OpGetCap,log_info,Imply,OpSetDirectory`

Hi Franck,

Thanks for the logs, I will check them out and let you know if there is anything suspicious there.

The log message at 00:00:00 is printed by default by the XRootD daemon at midnight and it’s expected to see it in the logs.

Cheers,
Elvin