Critical : Newly created folders can't be seen in fusex, overflow issue with id number?

OK, probably /etc/sysconfig/eos_env indeed for us. On both MGM and FST, right ?

Technically yes, but for file IDs the limitation is far greater: Around 34B I think.

With the new inode encoding scheme, the limitations become 2^63 for both file IDs and container IDs, which is 9223372036854775808. For our purposes, this might as well be infinite.

Cheers,
Georgios

1 Like

What about the MQ ? We also need to restart it with the new version ?

It’s probably a good idea to do so, since the previous one is quite a few versions behind. Since you’re restarting all MGMs + FSTs anyway, restarting the MQ wouldn’t hurt.

No need to restart QDB.

1 Like

We tested the upgrade and inode scheme switch on a test instance (which didn’t yet reach the limit), it works, but we didn’t observe this auto crash. Is that normal ?

# eos version -m
eos.instance.name=contingency eos.instance.version=4.5.15 eos.instance.release=1 xrootd.version=v4.10.1 eos.encodepath=curl eos.inodeencodingscheme=1 eos.lazyopen=true 
EOS_CLIENT_VERSION=4.5.15 EOS_CLIENT_RELEASE=1

You are right, turns out the auto-crash is only implemented on eosd. I had a quick look at the eosxd code, it’ll likely continue working fine through the inode update as it flushes its local cache on MGM restart.

Would you mind testing that? ie switch back the testing instance to old inodes, start eosxd, switch to new again, observe how eosxd reacts.

Cheers,
Georgios

By the way: You can run eos ns reserve-ids 300000000 300000000 to artificially bump the current container and file IDs to 300M. This way you can properly confirm that containers with high IDs work.

Some update for our upgrade, it seems to go well. We directly upgraded the production instance without running further test on the test instance.

Yes, the eosd clients stopped with this message, after the FSTs were also back :

191128 14:38:30 t=1574948310.300521 f=InodeToFid       l=CRIT  tid=00007f0b33bff700 s=InodeTranslator:57       Configured to use legacy encoding scheme, but encountered inode which is recognized as new: 9223372037199260508
Configured to use legacy encoding scheme, but encountered inode which is recognized as new: 9223372037199260508

From our prodcution instance, I can tell you that the clients behave well, they threw some messages because the MGM and the FSTs were down, but they then recover without any particular message for most of them, some did write what seems some messages just linked to the restart itself :

191128 14:12:01 t=1574946721.940905 f=mdcommunicate    l=NOTE  tid=00007f93ed3f9700 s=md:2669                  MGM asked us to drop all known caps
191128 14:12:01 t=1574946721.940943 f=mdcommunicate    l=WARN  tid=00007f93ed3f9700 s=md:2682                  MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1
191128 14:12:02 t=1574946722.073881 f=mdcommunicate    l=WARN  tid=00007f4f277fa700 s=md:2682                  MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1

The clients could then correctly access files/folders with any ID (so including the newer ones with ID above previous limit) and could create files. We observed though some cases where unmounting eos would lead in the command being stuck, either taking a long time, either requiring to manually kill the eosxd daemon. Could also not be linked to the change, but some previous situation of the client.
We will anyway systematically restart the clients, and upgrade them when too old.

Some extra observations after the upgrade to v4.5.15 :

  • We had to change line ofs.tpc pgm /usr/bin/xrdcp to ofs.tpc pgm /opt/eos/xrootd/bin/xrdcp in /etc/xrd.cf.fst following change of xrootd-client package to eos-xrootd package, otherwise the FST would refuse to start. Is that correct ?
  • eos client 4.5.15 might have a problem : when running eos file check command from MGM, we get the following error eos: symbol lookup error: eos: undefined symbol: _ZN5XrdCl3URLC1EPKc . But the same command works from FST (client 4.5.15) or other host (client 4.5.9). Or is it some package installation issue ?

Thank you again for your precious help !

That’s great news, glad that it worked. :slight_smile: Yes, restarting the eosxd clients is a good idea in any case, despite the fact they survived the inode switchover.

  • I’ll let Elvin confirm, but yes, I think changing ofs.tpc is required when using eos-xrootd package.
  • What is the output of ldd -r /usr/bin/eos? Looks related to xrootd RPMs installed on a particular machine. Is there any relationship between the versions, and whether the command works or not?

Hi Franck,

Yes, it’s correct and expected to replace the ofs.tpc directive with the new location of xrdcp in opt.

Just to work around your issue with undefined symbol: _ZN5XrdCl3URLC1EPKc you need to also install the xrootd-client package since in the 4.5.* branch the RPATH was not properly set for the executables. If you do this then you can also leave the ofs.tpc directive unchanged.

Cheers,
Elvin

You are right, on this machine, both eos-xrootd and xrootd-* packages are installed. And from ldd -r /usr/bin/eos output, it uses /usr/lib64/libXrd*.so libraries. But there is running also one QuarkDB member, and quarkdb package depends on xrootd-* package, so we couldn’t remove it.

I wonder if other vital eos commands would also fail when using eos from the MGM. Maybe could we just upgrade xrootd-* packages ? It is version 4.8.x. Or is there a way to select the eos-xrootd libraries when running eos command ?

Is there a recommended version to install xrootd-client to be used by both QuarkDB and the eos client ? On test instance, xrootd-client 4.11.0 seems to correctly remove the undefined symbol error, and to allow QuarkDB to run, but we would like to a confirmation that this is reasonable.

Hi Franck,

xrootd-client is not really used by QuarkDB, QuarkDB should work with any 4.x xrootd version. Your setup looks good.

Cheers,
Georgios

Yes, one should match the xrootd-* package version with the version of eos-xrootd. QuarkDB is not that picky and should work fine with any 4.* xrootd release.

Yes, but upgrading it will upgrade all xrootd-* packages, and indeed this triggers the restart of quarkdb service.

OK, so we will select version 4.10.1, this is the version of eos-xrootd that was installed when we upgraded to 4.5.15.

Just so you are aware, at FNAL, we have run into this issue with xrootd 4.10.1: https://github.com/xrootd/xrootd/issues/1038

It prevents 'gfal-rm -r’ from removing an empty directory. It’s supposed to be fixed in 4.11.0.

1 Like