OK, probably /etc/sysconfig/eos_env
indeed for us. On both MGM and FST, right ?
Technically yes, but for file IDs the limitation is far greater: Around 34B I think.
With the new inode encoding scheme, the limitations become 2^63 for both file IDs and container IDs, which is 9223372036854775808. For our purposes, this might as well be infinite.
Cheers,
Georgios
What about the MQ ? We also need to restart it with the new version ?
It’s probably a good idea to do so, since the previous one is quite a few versions behind. Since you’re restarting all MGMs + FSTs anyway, restarting the MQ wouldn’t hurt.
No need to restart QDB.
We tested the upgrade and inode scheme switch on a test instance (which didn’t yet reach the limit), it works, but we didn’t observe this auto crash. Is that normal ?
# eos version -m
eos.instance.name=contingency eos.instance.version=4.5.15 eos.instance.release=1 xrootd.version=v4.10.1 eos.encodepath=curl eos.inodeencodingscheme=1 eos.lazyopen=true
EOS_CLIENT_VERSION=4.5.15 EOS_CLIENT_RELEASE=1
You are right, turns out the auto-crash is only implemented on eosd. I had a quick look at the eosxd code, it’ll likely continue working fine through the inode update as it flushes its local cache on MGM restart.
Would you mind testing that? ie switch back the testing instance to old inodes, start eosxd, switch to new again, observe how eosxd reacts.
Cheers,
Georgios
By the way: You can run eos ns reserve-ids 300000000 300000000
to artificially bump the current container and file IDs to 300M. This way you can properly confirm that containers with high IDs work.
Some update for our upgrade, it seems to go well. We directly upgraded the production instance without running further test on the test instance.
Yes, the eosd clients stopped with this message, after the FSTs were also back :
191128 14:38:30 t=1574948310.300521 f=InodeToFid l=CRIT tid=00007f0b33bff700 s=InodeTranslator:57 Configured to use legacy encoding scheme, but encountered inode which is recognized as new: 9223372037199260508
Configured to use legacy encoding scheme, but encountered inode which is recognized as new: 9223372037199260508
From our prodcution instance, I can tell you that the clients behave well, they threw some messages because the MGM and the FSTs were down, but they then recover without any particular message for most of them, some did write what seems some messages just linked to the restart itself :
191128 14:12:01 t=1574946721.940905 f=mdcommunicate l=NOTE tid=00007f93ed3f9700 s=md:2669 MGM asked us to drop all known caps
191128 14:12:01 t=1574946721.940943 f=mdcommunicate l=WARN tid=00007f93ed3f9700 s=md:2682 MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1
191128 14:12:02 t=1574946722.073881 f=mdcommunicate l=WARN tid=00007f4f277fa700 s=md:2682 MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1
The clients could then correctly access files/folders with any ID (so including the newer ones with ID above previous limit) and could create files. We observed though some cases where unmounting eos would lead in the command being stuck, either taking a long time, either requiring to manually kill the eosxd daemon. Could also not be linked to the change, but some previous situation of the client.
We will anyway systematically restart the clients, and upgrade them when too old.
Some extra observations after the upgrade to v4.5.15 :
- We had to change line
ofs.tpc pgm /usr/bin/xrdcp
toofs.tpc pgm /opt/eos/xrootd/bin/xrdcp
in/etc/xrd.cf.fst
following change ofxrootd-client
package toeos-xrootd
package, otherwise the FST would refuse to start. Is that correct ? - eos client 4.5.15 might have a problem : when running
eos file check
command from MGM, we get the following erroreos: symbol lookup error: eos: undefined symbol: _ZN5XrdCl3URLC1EPKc
. But the same command works from FST (client 4.5.15) or other host (client 4.5.9). Or is it some package installation issue ?
Thank you again for your precious help !
That’s great news, glad that it worked. Yes, restarting the eosxd clients is a good idea in any case, despite the fact they survived the inode switchover.
- I’ll let Elvin confirm, but yes, I think changing
ofs.tpc
is required when usingeos-xrootd
package. - What is the output of
ldd -r /usr/bin/eos
? Looks related to xrootd RPMs installed on a particular machine. Is there any relationship between the versions, and whether the command works or not?
Hi Franck,
Yes, it’s correct and expected to replace the ofs.tpc
directive with the new location of xrdcp in opt
.
Just to work around your issue with undefined symbol: _ZN5XrdCl3URLC1EPKc
you need to also install the xrootd-client package since in the 4.5.* branch the RPATH was not properly set for the executables. If you do this then you can also leave the ofs.tpc
directive unchanged.
Cheers,
Elvin
You are right, on this machine, both eos-xrootd
and xrootd-*
packages are installed. And from ldd -r /usr/bin/eos
output, it uses /usr/lib64/libXrd*.so
libraries. But there is running also one QuarkDB member, and quarkdb package depends on xrootd-*
package, so we couldn’t remove it.
I wonder if other vital eos commands would also fail when using eos from the MGM. Maybe could we just upgrade xrootd-*
packages ? It is version 4.8.x. Or is there a way to select the eos-xrootd
libraries when running eos
command ?
Is there a recommended version to install xrootd-client
to be used by both QuarkDB and the eos client ? On test instance, xrootd-client
4.11.0 seems to correctly remove the undefined symbol
error, and to allow QuarkDB to run, but we would like to a confirmation that this is reasonable.
Hi Franck,
xrootd-client
is not really used by QuarkDB, QuarkDB should work with any 4.x xrootd version. Your setup looks good.
Cheers,
Georgios
Yes, one should match the xrootd-*
package version with the version of eos-xrootd
. QuarkDB is not that picky and should work fine with any 4.* xrootd release.
Yes, but upgrading it will upgrade all xrootd-*
packages, and indeed this triggers the restart of quarkdb service.
OK, so we will select version 4.10.1, this is the version of eos-xrootd
that was installed when we upgraded to 4.5.15.
Just so you are aware, at FNAL, we have run into this issue with xrootd 4.10.1: https://github.com/xrootd/xrootd/issues/1038
It prevents 'gfal-rm -r’ from removing an empty directory. It’s supposed to be fixed in 4.11.0.