Critical : Newly created folders can't be seen in fusex, overflow issue with id number?

franck-jrc · November 28, 2019, 8:35am

OK, probably /etc/sysconfig/eos_env indeed for us. On both MGM and FST, right ?

gbitzes · November 28, 2019, 8:35am

Technically yes, but for file IDs the limitation is far greater: Around 34B I think.

With the new inode encoding scheme, the limitations become 2^63 for both file IDs and container IDs, which is 9223372036854775808. For our purposes, this might as well be infinite.

Cheers,
Georgios

franck-jrc · November 28, 2019, 8:42am

What about the MQ ? We also need to restart it with the new version ?

gbitzes · November 28, 2019, 8:45am

It’s probably a good idea to do so, since the previous one is quite a few versions behind. Since you’re restarting all MGMs + FSTs anyway, restarting the MQ wouldn’t hurt.

No need to restart QDB.

franck-jrc · November 28, 2019, 11:38am

We tested the upgrade and inode scheme switch on a test instance (which didn’t yet reach the limit), it works, but we didn’t observe this auto crash. Is that normal ?

# eos version -m
eos.instance.name=contingency eos.instance.version=4.5.15 eos.instance.release=1 xrootd.version=v4.10.1 eos.encodepath=curl eos.inodeencodingscheme=1 eos.lazyopen=true 
EOS_CLIENT_VERSION=4.5.15 EOS_CLIENT_RELEASE=1

gbitzes · November 28, 2019, 12:04pm

You are right, turns out the auto-crash is only implemented on eosd. I had a quick look at the eosxd code, it’ll likely continue working fine through the inode update as it flushes its local cache on MGM restart.

Would you mind testing that? ie switch back the testing instance to old inodes, start eosxd, switch to new again, observe how eosxd reacts.

Cheers,
Georgios

gbitzes · November 28, 2019, 12:27pm

By the way: You can run eos ns reserve-ids 300000000 300000000 to artificially bump the current container and file IDs to 300M. This way you can properly confirm that containers with high IDs work.

franck-jrc · November 28, 2019, 3:21pm

Some update for our upgrade, it seems to go well. We directly upgraded the production instance without running further test on the test instance.

Yes, the eosd clients stopped with this message, after the FSTs were also back :

191128 14:38:30 t=1574948310.300521 f=InodeToFid       l=CRIT  tid=00007f0b33bff700 s=InodeTranslator:57       Configured to use legacy encoding scheme, but encountered inode which is recognized as new: 9223372037199260508
Configured to use legacy encoding scheme, but encountered inode which is recognized as new: 9223372037199260508

From our prodcution instance, I can tell you that the clients behave well, they threw some messages because the MGM and the FSTs were down, but they then recover without any particular message for most of them, some did write what seems some messages just linked to the restart itself :

191128 14:12:01 t=1574946721.940905 f=mdcommunicate    l=NOTE  tid=00007f93ed3f9700 s=md:2669                  MGM asked us to drop all known caps
191128 14:12:01 t=1574946721.940943 f=mdcommunicate    l=WARN  tid=00007f93ed3f9700 s=md:2682                  MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1
191128 14:12:02 t=1574946722.073881 f=mdcommunicate    l=WARN  tid=00007f4f277fa700 s=md:2682                  MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1

The clients could then correctly access files/folders with any ID (so including the newer ones with ID above previous limit) and could create files. We observed though some cases where unmounting eos would lead in the command being stuck, either taking a long time, either requiring to manually kill the eosxd daemon. Could also not be linked to the change, but some previous situation of the client.
We will anyway systematically restart the clients, and upgrade them when too old.

Some extra observations after the upgrade to v4.5.15 :

We had to change line ofs.tpc pgm /usr/bin/xrdcp to ofs.tpc pgm /opt/eos/xrootd/bin/xrdcp in /etc/xrd.cf.fst following change of xrootd-client package to eos-xrootd package, otherwise the FST would refuse to start. Is that correct ?
eos client 4.5.15 might have a problem : when running eos file check command from MGM, we get the following error eos: symbol lookup error: eos: undefined symbol: _ZN5XrdCl3URLC1EPKc . But the same command works from FST (client 4.5.15) or other host (client 4.5.9). Or is it some package installation issue ?

Thank you again for your precious help !

gbitzes · November 28, 2019, 3:47pm

That’s great news, glad that it worked. Yes, restarting the eosxd clients is a good idea in any case, despite the fact they survived the inode switchover.

I’ll let Elvin confirm, but yes, I think changing ofs.tpc is required when using eos-xrootd package.
What is the output of ldd -r /usr/bin/eos? Looks related to xrootd RPMs installed on a particular machine. Is there any relationship between the versions, and whether the command works or not?

esindril · November 28, 2019, 3:50pm

Hi Franck,

Yes, it’s correct and expected to replace the ofs.tpc directive with the new location of xrdcp in opt.

Just to work around your issue with undefined symbol: _ZN5XrdCl3URLC1EPKc you need to also install the xrootd-client package since in the 4.5.* branch the RPATH was not properly set for the executables. If you do this then you can also leave the ofs.tpc directive unchanged.

Cheers,
Elvin

franck-jrc · November 28, 2019, 4:01pm

You are right, on this machine, both eos-xrootd and xrootd-* packages are installed. And from ldd -r /usr/bin/eos output, it uses /usr/lib64/libXrd*.so libraries. But there is running also one QuarkDB member, and quarkdb package depends on xrootd-* package, so we couldn’t remove it.

I wonder if other vital eos commands would also fail when using eos from the MGM. Maybe could we just upgrade xrootd-* packages ? It is version 4.8.x. Or is there a way to select the eos-xrootd libraries when running eos command ?

franck-jrc · November 29, 2019, 9:37am

Is there a recommended version to install xrootd-client to be used by both QuarkDB and the eos client ? On test instance, xrootd-client 4.11.0 seems to correctly remove the undefined symbol error, and to allow QuarkDB to run, but we would like to a confirmation that this is reasonable.

gbitzes · November 29, 2019, 9:49am

Hi Franck,

xrootd-client is not really used by QuarkDB, QuarkDB should work with any 4.x xrootd version. Your setup looks good.

Cheers,
Georgios

esindril · November 29, 2019, 9:50am

Yes, one should match the xrootd-* package version with the version of eos-xrootd. QuarkDB is not that picky and should work fine with any 4.* xrootd release.

franck-jrc · November 29, 2019, 10:14am

Yes, but upgrading it will upgrade all xrootd-* packages, and indeed this triggers the restart of quarkdb service.

OK, so we will select version 4.10.1, this is the version of eos-xrootd that was installed when we upgraded to 4.5.15.

dszkola · December 2, 2019, 4:32pm

Just so you are aware, at FNAL, we have run into this issue with xrootd 4.10.1: https://github.com/xrootd/xrootd/issues/1038

It prevents 'gfal-rm -r’ from removing an empty directory. It’s supposed to be fixed in 4.11.0.

CERN Accelerating science

Critical : Newly created folders can't be seen in fusex, overflow issue with id number?