A long due upgrade at JRC, some suggestions?

For small changes you can probably use the script we are using for updates and adapt it.
Essentially it puts an FST in RO mode when nobody writes, updates the software, restarts and puts into RW.
Still for this big update, just announce a short downtime and update all the FSTs in one go.

Hello,

We are finally planning this upgrade for the next weeks, as we unfortunately had to work on other priorities until now.

We will so have a period of shutdown of our whole production instance, upgrade to latest version, then restart the MGM, then the FSTs, and hope to have a working instance with the latest 4.8.x version.

Do you have some suggestions about the following things :

  • which version to pick ? We see that the stable repository has 4.8.62, the testing repository has 4.8.91, and the version mentioned on the website is 4.8.66. But maybe you have an advice based on what you have been experiencing in running your instances during the last months.
  • is there anything that might have changed in the configuration that would be needed to review during this large upgrade step ?
  • we will of course make a backup of the QuarkDB data; do you retain necessary to also back up the LevelDB data on each FSTs ? If yes, what would be the correct way to do it ?
  • anything else to consider apart from these above points ?

Hi Franck,
we run 4.9.91 in production on CERNBOX. That should be good for you!

We also run now all LHC instances on EOS5.

From which version do you update?

There is no need to backup LevelDB on the FST.

Hi Andreas,

Thank you for your answer. We are still upgrading from 4.5.15/4.5.17, this is why we are asking for extra suggestions, if you think of any.

I suppose that you mean 4.8.91 for your cernbox production (I do not see any 4.9.91 version)

We also think to upgrade to EOS5 after that, but we first need to align on 4.x.x.

Just for the sake of completeness :

I noticed that the package eos-debuginfo is not available any more since version 4.8.23, is that normal ? Is there a replacement package that we are not aware of, ar a new mechanism for any potential debug information ?

Hi Franck,

Yes, the debug symbols are no longer stripped from the binaries/libraries, since this was crashing gdb when trying to debug stuff. Therefore, no need for an eos-debuginfo package.

Cheers,
Elvin

OK, thank you for your answer. So debugging will go on working as before, right ?

Another question, I’m not sure if this is linked to the upgrade, but I noticed that the Converter (used by GroupBalancer) seems to have an incompatibility with the eos vid publicaccesslevel limitation. Many files get rejected because the source cannot be read when the path is too deep in the hierarchy, with error :

220929 12:08:00 ERROR ConversionJob:383              msg="[ERROR] Server responded with an error: [3010] Unable to access - public access level restriction /eos/contingency/new/franck/test/copy/dir-52/dir-26/file-09; Permission denied" tpc_src=root://eos-contingency-master.cidsn.jrc.it:1094//eos/contingency/new/franck/test/copy/dir-52/dir-26/file-09 tpc_dst=root://eos-contingency-master.cidsn.jrc.it:1094//eos/contingency/proc/conversion/00000000000ca9e1:default.0#00100112^groupbalancer^ conversion_id=00000000000ca9e1:default.0#00100112^groupbalancer^ 

It is strange since, as I understand it, these third party copies for conversions are done with maximum privileges.

When increasing eos vid publicaccesslevel to a high value (I didn’t find how to disable the limitation at all), then the conversion of these deep files work.

Hi Franck,

Yes, you will still have the debug info.

For non-CERNBox enabled instances the public access level is by default 1024 which should cover even the most extreme cases.

The conversion jobs are ran by the daemon user so in order the have it run with max privileges the user daemon needs to be in the list of sudoes in your instance. Also the public access level only applied to requests which are done for user nobody i.e. uid=99, so it means that the TPC job is not able to authenticate with sss to the MGM. Is sss in the list of sec.protbind protocols accepted for both localhost and * ?

Cheers,
Elvin

Thank you for your answer.

The public access level was reduced for the sake of testing it, no problem to keep the default value of 1024.

The problem was indeed in the sudoer membership. I think that the daemon user was never member of it since we started our instances. It seems that this doesn’t give any issue for the conversions, except for overriding the public access level limitation. Now that the user was given the sudoer privileges, the conversions work even with low public access level.

sss protocol is correctly accepted for both localhost and *.

After the upgrade of the test instance to the last version 4.8.91. We noticed a quite longer startup time of the MGM with respect to the previous version, We were wondering if this is normal or might we have an issue in the configuration. It is not a very long time and is quite reasonable for a restart, i.e. around 20s, but previously this was more around 4-5 s in the test instance, which has few files and few FSTs (3).

By startup time I mean the time after which the eos server responds to the commands (for instance eos ns or eos ls). This is not due to the namespace booting, which is displayed in eos ns as less than 2 s, but seems to be linked to the preparation of the different threads.

Do you think this might be normal, or might there be something that blocks it temporarily ? We wouldn’t like that with the upgrade of the much bigger production instance this time became too long.

Yet another quick question :

Is the variable EOS_USE_NEW_INODES still necessary when working with high file id numbers (we met this issue few years ago) or is it now the default ?

Hi Franck,

The only thing that I can think of is that in the latest versions we moved the IoStat information from the file based setup to QuarkDB and the conversion happens automatically the first time you use the newer versions. I would take a few more samples i.e. restart the MGM and see exactly what is the distribution of the startup time. Besides this, there were no particular modifications that would explain this.

Concerning the EOS_USE_NEW_INODES variable, yes this still needs to be there, we will probably default to the new schema at some point in the EOS 5 version, but this is still mandatory even in the latest EOS 5 release.

Cheers,
Elvin

OK, thank you for your answer. The iostat conversion was probably transparently done at first launch.

How can I see the distribution of the startup time ? Looking at the log, or is there some more inquiring tool that I can trigger ?

Thank you,
Franck

There are no special metrics collected for the startup time. This you can collect this yourself based on the first timestamp in the logs and the moment when the master MGM is elected. So look for this log message:

220819 03:55:25 time=1660874125.953557 func=Supervisor               level=INFO  logid=35c44b7c-1e4f-11ed-b4d4-dc4a3e6b9f27 unit=mgm@esdss000.cern.ch:1094 tid=00007f62e6352700 source=QdbMaster:268                  tident=<service> sec=      uid=0 gid=0 name= geo="" old_is_master=true, is_master=true, old_master_id=esdss000.cern.ch:1094, master_id=esdss000.cern.ch:1094

Cheers,
Elvin

Actually, this line that you mention is coming very quick, like 1s after start :

221004 12:53:52 29813 Starting on Linux 3.10.0-1160.76.1.el7.x86_64
221004 12:53:53 time=1664880833.280162 func=Supervisor               level=INFO  logid=d58172d6-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61337fd700 source=QdbMaster:268                  tident=<service> sec=      uid=0 gid=0 name= geo="" old_is_master=false, is_master=true, old_master_id=, master_id=s-jrciprcids90v.cidsn.jrc.it:1094

However, the moment when the mgm starts answering queries is long after that, I identified this line as the answer to the eos ns command, so 16 seconds after :

221004 12:54:09 time=1664880849.128100 func=open                     level=INFO  logid=df8ccb90-43d2-11ed-b68d-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6170407700 source=XrdMgmOfsFile:502              tident=root.30231:389@localhost.localdomain sec=sss   uid=0 gid=0 name=daemon geo="JRC" op=read path=/proc/admin/ info=eos.rgid=0&eos.ruid=0&mgm.cmd.proto=GgQKAigB+AEB

I noticed these few things

  1. there are 2 moments when no messages are written during quite a long time (5 then 4 seconds) meaning that the MGM is not doing anything :
221004 12:53:58 time=1664880838.431826 func=Supervisor               level=INFO  logid=d58172d6-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61337fd700 source=QdbMaster:268                  tident=<service> sec=      uid=0 gid=0 name= geo="" old_is_master=true, is_master=true, old_master_id=s-jrciprcids90v.cidsn.jrc.it:1094, master_id=s-jrciprcids90v.cidsn.jrc.it:1094
221004 12:54:03 time=1664880843.367706 func=UpdateTransferList       level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60e4bfe700 source=GroupBalancer:101              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="group_balancer update transfers" scheduledtransfers=0

221004 12:54:04 time=1664880844.404425 func=RepairErrs               level=INFO  logid=d5772902-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60dc7fc700 source=Fsck:481                       tident=<service> sec=      uid=0 gid=0 name= geo="" msg="loop in fsck repair thread"
221004 12:54:08 time=1664880848.450424 func=AcquireLease             level=INFO  logid=d58172d6-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61337fd700 source=QdbMaster:506                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="qclient acquire lease call took 6ms"

  1. Few seconds before the MGM is unblocked, there are these messages :
++++++ (c) 2020 CERN/IT-ST EOS MGM Authorization (XrdMgmAuthz) 4.8.91
------ XrdMgmAuthz plugin initialization successful
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd mgm@s-jrciprcids90v.cidsn.jrc.it:1094 initialization completed.
221004 12:54:09 29819 XrootdXeq: ?:386@s-jrciprcids94v disc 0:00:00
221004 12:54:09 time=1664880849.096161 func=Configure                level=WARN  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6174290780 source=XrdMgmOfsConfigure:2351        tident= sec=(null) uid=99 gid=99 name=- geo="" write lock [ eosView ] held for 15000 milliseconds
221004 12:54:09 time=1664880849.096195 func=Configure                level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6174290780 source=XrdMgmOfsConfigure:2355        tident= sec=(null) uid=99 gid=99 name=- geo="" msg="after mutex test
221004 12:54:09 time=1664880849.098461 func=main                     level=WARN  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61307f7700 source=MutexLatencyWatcher:61         tident= sec=(null) uid=99 gid=99 name=- geo="" acquisition of mutex eosViewRWMutex took 14833 milliseconds
221004 12:54:09 time=1664880849.102495 func=backgroundThread         level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60cb3ff700 source=BulkRequestProcCleaner:71      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="BulkRequestProcCleaner did one round of cleaning, nbDeletedBulkRequests=0"
221004 12:54:09 29820 XrootdXeq: root.30231:389@localhost.localdomain pvt IPv4 login as daemon
221004 12:54:09 29819 XrootdXeq: root.30240:387@localhost.localdomain pvt IPv4 login as daemon
221004 12:54:09 time=1664880849.128017 func=IdMap                    level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6170407700 source=Mapping:1073                   tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name="daemon" sec.host="localhost.localdomain" sec.vorg="" sec.grps="daemon" sec.role="" sec.info="" sec.app="" sec.tident="root.30231:389@localhost.localdomain" vid.uid=0 vid.gid=0

The authentication plugin which is loaded a bit late, and this message about the mutex held for 15 seconds.

  1. there are these error messages, we might have errors in our configuration, but I can’t identify which ones
221004 12:54:04 time=1664880844.097849 func=get                      level=ERROR logid=d699f21a-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60bbbe0700 source=ConfigStore:73                 tident=<service> sec=      uid=0 gid=0 name= geo="" msg="Failed to load key from Configstore" err="msg=Failed Numeric conversion" key= error_msg=Invalid argument
221004 12:54:04 time=1664880844.097893 func=get                      level=ERROR logid=d699f21a-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60bbbe0700 source=ConfigStore:73                 tident=<service> sec=      uid=0 gid=0 name= geo="" msg="Failed to load key from Configstore" err="msg=Failed Numeric conversion" key= error_msg=Invalid argument

Maybe some of these are good leads.

Anyway I attach you the full log of starting phase so that you can maybe compare it to what it should be.

Hi Franck,

In the 4.8.* branch there is still a dummy test [1] to see if the lock warning functionality works and there is indeed a sleep of 15 seconds in the configuration step of the MGM. I was under the impression you were already testing eos 5. That part of the code is removed in eos5.

Cheers,
Elvin

[1] https://gitlab.cern.ch/dss/eos/-/blob/master/mgm/XrdMgmOfsConfigure.cc#L2352

OK, thank you.

So this 15 seconds time is normal, and fixed for every instance, so no particular surprise to be expected when we upgrade the production instance.

As written above, giving the big version step we have, we first upgrade to the last 4.8.x version, then the upgrade to v5 in a second moment.

Hello,

I wanted to inform you that the upgrade of the whole instance to the latest 4.8.x version (4.8.91) was performed successfully last week, without any problem on eos side (we had our own hardware issues…). I want to thank all of you for your suggestions and your help.

Running this brand new version after such a huge step didn’t affect in any notable way the functioning of our systems. As expected, it works for the users at least as good as before, maybe better, but to be confirmed in the long term for the stability. For sure, on the management side we have a lot of improvements.

Unfortunately, though, we already have to report the first incident, with the MGM stopping working last night, after 8 days of functioning. The MGM stopped serving the fuse clients at 00:00:00 (the logs decreased immediately after that time) and the FSTs started to log query error, and finally stopped also answering also to the statistics request 6 hours later. But the daemon wasn’t crashed, it was running but just hanged. A restart restored the service.

Does this fact to get blocked at this specific midnight o’clock ring some bell to you ?

Nothing else very useful was found in the logs, no particular error message, just a sudden drop in the activity. Do you think it could be interesting to provide you with the logs (quite large in our case) ?

Hi Franck,

We definitely don’t have any special hooks to slow down the service at midnight. :wink: I believe this is just a coincidence. Can you give us a bit more details about what were the symptoms? Where any of the admin commands working? eos node ls eos ns eos fs ls?
Did you take a stacktrace of the MGM process? This might point to the thread that got (possibly) stuck and then was affecting all the others.
Yes, please send us the logs, maybe there is something relevant that we can use.

Cheers,
Elvin

Thank you @esindril for your answer.

Unfortunately, the MGM was restarted before a stacktrace could be extracted. We will think of it next time (which hopefully will not occur)

About the symptoms, I could say that the admin commands have been working around 6 hours after the clients couldn’t work anymore. This is extrapolated from what we observe in our metrics. They were extracted correctly for about 6 hours (so between midnight and 6am). However, when we arrived at the office starting 8am, they were not working any more (going timeout).

About the coincidence with the clock, I was wondering if some particular things were ran at that time, because there are several of these log lines which appear daily at 00:00:00.

221021 00:00:00 62154 Copr.  2004-2012 Stanford University, xrd version v4.12.8
221021 00:00:00 62154 xrootd mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 running Linux 3.10.0-1160.76.1.el7.x86_64

And literally, in the logs, everything started blocking at 00:00:00 or 00:00:01.

I send you a link in private message where you can download the MGM logs of that time (between 20/10/2022 19:00 and 21/10/2022 10:00, this includes also the restart). Note that our logs are rotated hourly and the log option is the following one (to avoid too many log lines) :

eos debug info --filter IdMap,Commit,_access,_utimes,readlink,_readlink,BroadcastReleaseFromExternal,BroadcastRefreshFromExternal,BroadcastMD,OpSetFile,RefreshEntry,HandleMD,FillContainerCAP,Store,OpGetLs,OpGetCap,log_info,Imply,OpSetDirectory`

Hi Franck,

Thanks for the logs, I will check them out and let you know if there is anything suspicious there.

The log message at 00:00:00 is printed by default by the XRootD daemon at midnight and it’s expected to see it in the logs.

Cheers,
Elvin