A long due upgrade at JRC, some suggestions?

franck-jrc · March 14, 2022, 8:54am

Hello,

We have been delaying the upgrade of our EOS production instance for too long, for internal reasons, combined with the fact that our instance is very stable with the versions we are currently running : MGM 4.5.15, FSTs 4.5.17 and we prefer to avoid too many more frequent upgrades.

But now it is time to make a step forward, especially with the next promising EOS5 as we saw at last week workshop, so we would like to make a step towards a 4.x version first.

First question for EOS experts, or other site manager who have upgraded recently, which is the best version to pick for us, according to you ?

Second question, more practical : what is the suggested upgrade scenario, between these 2 :

Gradual : Upgrade the MGM, then within few days a rolling upgrade of the FSTs; this diminish the downtime period to just the upgrade of the MGM (using the HA requires a DNS change, so we prefer not use the switch and stick to the current master), but we will have the MGM and FSTs run with version mismatch. We know that this makes the converter (and so group balancer) unusable because of a change and it needs to be disabled (it is already the case), but is there any other known downside ? We are running a test instance with a MGM 4.8.x and FST 4.5.17 and this runs well, but the load is not comparable with the production instance, so behavior could change a lot.
Stop the whole instance, upgrade all nodes, and start them all at once. This would require more downtime period (more than 60 FSTs to upgrade, start back, and check that none of them has issue with the upgrade)

Any other suggestion about the large version change (intermediate version, configuration change, etc…) is welcome.

Thank you,

Franck

arminburger · March 23, 2022, 1:53pm

Any feedback on this?

Best
Armin

apeters · March 23, 2022, 6:15pm

Hi Armin,
we don’t have really hands on experience with such a large version jump. It is clear, that a short downtime and consistent upgrade is safer. I will try to see with Elvin tomorrow, if there is some incompatible change in the communication from 4.5.17 FSTs to new MGMs … will let you know!

armin-jrc · March 24, 2022, 9:04am

Thanks for checking!

So once we have an updated version, we should run updates more frequently then, right? How is this done at CERN for EOS instances with a lot of users where downtimes should probably be avoided, like the one used for CERNBox? For smaller version jumps I assume the updates can be done without downtimes then.

apeters · March 24, 2022, 1:41pm

For small changes you can probably use the script we are using for updates and adapt it.
Essentially it puts an FST in RO mode when nobody writes, updates the software, restarts and puts into RW.
Still for this big update, just announce a short downtime and update all the FSTs in one go.

franck-jrc · September 8, 2022, 2:37pm

Hello,

We are finally planning this upgrade for the next weeks, as we unfortunately had to work on other priorities until now.

We will so have a period of shutdown of our whole production instance, upgrade to latest version, then restart the MGM, then the FSTs, and hope to have a working instance with the latest 4.8.x version.

Do you have some suggestions about the following things :

which version to pick ? We see that the stable repository has 4.8.62, the testing repository has 4.8.91, and the version mentioned on the website is 4.8.66. But maybe you have an advice based on what you have been experiencing in running your instances during the last months.
is there anything that might have changed in the configuration that would be needed to review during this large upgrade step ?
we will of course make a backup of the QuarkDB data; do you retain necessary to also back up the LevelDB data on each FSTs ? If yes, what would be the correct way to do it ?
anything else to consider apart from these above points ?

apeters · September 9, 2022, 9:54am

Hi Franck,
we run 4.9.91 in production on CERNBOX. That should be good for you!

We also run now all LHC instances on EOS5.

From which version do you update?

There is no need to backup LevelDB on the FST.

franck-jrc · September 9, 2022, 10:28am

Hi Andreas,

Thank you for your answer. We are still upgrading from 4.5.15/4.5.17, this is why we are asking for extra suggestions, if you think of any.

I suppose that you mean 4.8.91 for your cernbox production (I do not see any 4.9.91 version)

We also think to upgrade to EOS5 after that, but we first need to align on 4.x.x.

franck-jrc · September 29, 2022, 10:59am

Just for the sake of completeness :

I noticed that the package eos-debuginfo is not available any more since version 4.8.23, is that normal ? Is there a replacement package that we are not aware of, ar a new mechanism for any potential debug information ?

esindril · September 29, 2022, 11:10am

Hi Franck,

Yes, the debug symbols are no longer stripped from the binaries/libraries, since this was crashing gdb when trying to debug stuff. Therefore, no need for an eos-debuginfo package.

Cheers,
Elvin

franck-jrc · September 29, 2022, 1:34pm

OK, thank you for your answer. So debugging will go on working as before, right ?

Another question, I’m not sure if this is linked to the upgrade, but I noticed that the Converter (used by GroupBalancer) seems to have an incompatibility with the eos vid publicaccesslevel limitation. Many files get rejected because the source cannot be read when the path is too deep in the hierarchy, with error :

220929 12:08:00 ERROR ConversionJob:383              msg="[ERROR] Server responded with an error: [3010] Unable to access - public access level restriction /eos/contingency/new/franck/test/copy/dir-52/dir-26/file-09; Permission denied" tpc_src=root://eos-contingency-master.cidsn.jrc.it:1094//eos/contingency/new/franck/test/copy/dir-52/dir-26/file-09 tpc_dst=root://eos-contingency-master.cidsn.jrc.it:1094//eos/contingency/proc/conversion/00000000000ca9e1:default.0#00100112^groupbalancer^ conversion_id=00000000000ca9e1:default.0#00100112^groupbalancer^

It is strange since, as I understand it, these third party copies for conversions are done with maximum privileges.

When increasing eos vid publicaccesslevel to a high value (I didn’t find how to disable the limitation at all), then the conversion of these deep files work.

esindril · September 29, 2022, 1:59pm

Hi Franck,

Yes, you will still have the debug info.

For non-CERNBox enabled instances the public access level is by default 1024 which should cover even the most extreme cases.

The conversion jobs are ran by the daemon user so in order the have it run with max privileges the user daemon needs to be in the list of sudoes in your instance. Also the public access level only applied to requests which are done for user nobody i.e. uid=99, so it means that the TPC job is not able to authenticate with sss to the MGM. Is sss in the list of sec.protbind protocols accepted for both localhost and * ?

Cheers,
Elvin

franck-jrc · September 29, 2022, 2:52pm

Thank you for your answer.

The public access level was reduced for the sake of testing it, no problem to keep the default value of 1024.

The problem was indeed in the sudoer membership. I think that the daemon user was never member of it since we started our instances. It seems that this doesn’t give any issue for the conversions, except for overriding the public access level limitation. Now that the user was given the sudoer privileges, the conversions work even with low public access level.

sss protocol is correctly accepted for both localhost and *.

franck-jrc · September 30, 2022, 2:53pm

After the upgrade of the test instance to the last version 4.8.91. We noticed a quite longer startup time of the MGM with respect to the previous version, We were wondering if this is normal or might we have an issue in the configuration. It is not a very long time and is quite reasonable for a restart, i.e. around 20s, but previously this was more around 4-5 s in the test instance, which has few files and few FSTs (3).

By startup time I mean the time after which the eos server responds to the commands (for instance eos ns or eos ls). This is not due to the namespace booting, which is displayed in eos ns as less than 2 s, but seems to be linked to the preparation of the different threads.

Do you think this might be normal, or might there be something that blocks it temporarily ? We wouldn’t like that with the upgrade of the much bigger production instance this time became too long.

Yet another quick question :

Is the variable EOS_USE_NEW_INODES still necessary when working with high file id numbers (we met this issue few years ago) or is it now the default ?

esindril · September 30, 2022, 3:45pm

Hi Franck,

The only thing that I can think of is that in the latest versions we moved the IoStat information from the file based setup to QuarkDB and the conversion happens automatically the first time you use the newer versions. I would take a few more samples i.e. restart the MGM and see exactly what is the distribution of the startup time. Besides this, there were no particular modifications that would explain this.

Concerning the EOS_USE_NEW_INODES variable, yes this still needs to be there, we will probably default to the new schema at some point in the EOS 5 version, but this is still mandatory even in the latest EOS 5 release.

Cheers,
Elvin

franck-jrc · October 3, 2022, 8:42am

OK, thank you for your answer. The iostat conversion was probably transparently done at first launch.

How can I see the distribution of the startup time ? Looking at the log, or is there some more inquiring tool that I can trigger ?

Thank you,
Franck

esindril · October 3, 2022, 2:33pm

There are no special metrics collected for the startup time. This you can collect this yourself based on the first timestamp in the logs and the moment when the master MGM is elected. So look for this log message:

220819 03:55:25 time=1660874125.953557 func=Supervisor               level=INFO  logid=35c44b7c-1e4f-11ed-b4d4-dc4a3e6b9f27 unit=mgm@esdss000.cern.ch:1094 tid=00007f62e6352700 source=QdbMaster:268                  tident=<service> sec=      uid=0 gid=0 name= geo="" old_is_master=true, is_master=true, old_master_id=esdss000.cern.ch:1094, master_id=esdss000.cern.ch:1094

Cheers,
Elvin

franck-jrc · October 4, 2022, 11:30am

Actually, this line that you mention is coming very quick, like 1s after start :

221004 12:53:52 29813 Starting on Linux 3.10.0-1160.76.1.el7.x86_64
221004 12:53:53 time=1664880833.280162 func=Supervisor               level=INFO  logid=d58172d6-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61337fd700 source=QdbMaster:268                  tident=<service> sec=      uid=0 gid=0 name= geo="" old_is_master=false, is_master=true, old_master_id=, master_id=s-jrciprcids90v.cidsn.jrc.it:1094

However, the moment when the mgm starts answering queries is long after that, I identified this line as the answer to the eos ns command, so 16 seconds after :

221004 12:54:09 time=1664880849.128100 func=open                     level=INFO  logid=df8ccb90-43d2-11ed-b68d-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6170407700 source=XrdMgmOfsFile:502              tident=root.30231:389@localhost.localdomain sec=sss   uid=0 gid=0 name=daemon geo="JRC" op=read path=/proc/admin/ info=eos.rgid=0&eos.ruid=0&mgm.cmd.proto=GgQKAigB+AEB

I noticed these few things

there are 2 moments when no messages are written during quite a long time (5 then 4 seconds) meaning that the MGM is not doing anything :

221004 12:53:58 time=1664880838.431826 func=Supervisor               level=INFO  logid=d58172d6-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61337fd700 source=QdbMaster:268                  tident=<service> sec=      uid=0 gid=0 name= geo="" old_is_master=true, is_master=true, old_master_id=s-jrciprcids90v.cidsn.jrc.it:1094, master_id=s-jrciprcids90v.cidsn.jrc.it:1094
221004 12:54:03 time=1664880843.367706 func=UpdateTransferList       level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60e4bfe700 source=GroupBalancer:101              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="group_balancer update transfers" scheduledtransfers=0

221004 12:54:04 time=1664880844.404425 func=RepairErrs               level=INFO  logid=d5772902-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60dc7fc700 source=Fsck:481                       tident=<service> sec=      uid=0 gid=0 name= geo="" msg="loop in fsck repair thread"
221004 12:54:08 time=1664880848.450424 func=AcquireLease             level=INFO  logid=d58172d6-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61337fd700 source=QdbMaster:506                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="qclient acquire lease call took 6ms"

Few seconds before the MGM is unblocked, there are these messages :

++++++ (c) 2020 CERN/IT-ST EOS MGM Authorization (XrdMgmAuthz) 4.8.91
------ XrdMgmAuthz plugin initialization successful
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd mgm@s-jrciprcids90v.cidsn.jrc.it:1094 initialization completed.
221004 12:54:09 29819 XrootdXeq: ?:386@s-jrciprcids94v disc 0:00:00
221004 12:54:09 time=1664880849.096161 func=Configure                level=WARN  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6174290780 source=XrdMgmOfsConfigure:2351        tident= sec=(null) uid=99 gid=99 name=- geo="" write lock [ eosView ] held for 15000 milliseconds
221004 12:54:09 time=1664880849.096195 func=Configure                level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6174290780 source=XrdMgmOfsConfigure:2355        tident= sec=(null) uid=99 gid=99 name=- geo="" msg="after mutex test
221004 12:54:09 time=1664880849.098461 func=main                     level=WARN  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f61307f7700 source=MutexLatencyWatcher:61         tident= sec=(null) uid=99 gid=99 name=- geo="" acquisition of mutex eosViewRWMutex took 14833 milliseconds
221004 12:54:09 time=1664880849.102495 func=backgroundThread         level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60cb3ff700 source=BulkRequestProcCleaner:71      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="BulkRequestProcCleaner did one round of cleaning, nbDeletedBulkRequests=0"
221004 12:54:09 29820 XrootdXeq: root.30231:389@localhost.localdomain pvt IPv4 login as daemon
221004 12:54:09 29819 XrootdXeq: root.30240:387@localhost.localdomain pvt IPv4 login as daemon
221004 12:54:09 time=1664880849.128017 func=IdMap                    level=INFO  logid=static.............................. unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f6170407700 source=Mapping:1073                   tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name="daemon" sec.host="localhost.localdomain" sec.vorg="" sec.grps="daemon" sec.role="" sec.info="" sec.app="" sec.tident="root.30231:389@localhost.localdomain" vid.uid=0 vid.gid=0

The authentication plugin which is loaded a bit late, and this message about the mutex held for 15 seconds.

there are these error messages, we might have errors in our configuration, but I can’t identify which ones

221004 12:54:04 time=1664880844.097849 func=get                      level=ERROR logid=d699f21a-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60bbbe0700 source=ConfigStore:73                 tident=<service> sec=      uid=0 gid=0 name= geo="" msg="Failed to load key from Configstore" err="msg=Failed Numeric conversion" key= error_msg=Invalid argument
221004 12:54:04 time=1664880844.097893 func=get                      level=ERROR logid=d699f21a-43d2-11ed-ab6a-62cb3bd524b1 unit=mgm@s-jrciprcids90v.cidsn.jrc.it:1094 tid=00007f60bbbe0700 source=ConfigStore:73                 tident=<service> sec=      uid=0 gid=0 name= geo="" msg="Failed to load key from Configstore" err="msg=Failed Numeric conversion" key= error_msg=Invalid argument

Maybe some of these are good leads.

Anyway I attach you the full log of starting phase so that you can maybe compare it to what it should be.

esindril · October 4, 2022, 12:14pm

Hi Franck,

In the 4.8.* branch there is still a dummy test [1] to see if the lock warning functionality works and there is indeed a sleep of 15 seconds in the configuration step of the MGM. I was under the impression you were already testing eos 5. That part of the code is removed in eos5.

Cheers,
Elvin

[1] https://gitlab.cern.ch/dss/eos/-/blob/master/mgm/XrdMgmOfsConfigure.cc#L2352

franck-jrc · October 4, 2022, 3:58pm

OK, thank you.

So this 15 seconds time is normal, and fixed for every instance, so no particular surprise to be expected when we upgrade the production instance.

As written above, giving the big version step we have, we first upgrade to the last 4.8.x version, then the upgrade to v5 in a second moment.

CERN Accelerating science

A long due upgrade at JRC, some suggestions?