Critical : Newly created folders can't be seen in fusex, overflow issue with id number?

franck-jrc · November 27, 2019, 4:05pm

Hi,

We have a quite serious issue, concerning newly created folders that can’t be seen from any fusex client. They are seen as zero size files with always the same date and uid/gid pair.

We might have hit a threshold in the number of folders created… Current container id is 269146625.

We could see that when running eos file info on these new folders, their id is like fxid:10081bf3, i.e. above 10000000. Could this be linked ? Is this a known bug ? If yes, would a version upgrade solve this ? Supposedly the MGM (4.4.39), since even newer clients (4.5.9) have the issue.

esindril · November 27, 2019, 4:23pm

Hi Franck,

Unfortunately, you hit a problem that is agnostic to the type of fuse, namely the max number of container ids that the fuse daemon can see at the moment. We encode it in 28 bits which matches perfectly with you container id.
28 bits can hold at most 268435456 container ids.

Fortunately, Georigos has done some preparation work for this but it’s not in any release. So we’ll prepare a hot fix release for this and you can enable the new functionality with an env variable. Can you please let me know which version of eos you run in your cluster?

Thanks,
Elvin

franck-jrc · November 27, 2019, 4:33pm

Thank you for your quick answer.
We are running MGM 4.4.39. Fusex clients are in the range of 4.4.23 to 4.5.6.
Which component would need an upgrade to enable this hot fix ?

But we can confirm that old fuse/eosd clients can correctly access the folders, or is there some hidden drawback that we didn’t detect yet ?

esindril · November 27, 2019, 4:41pm

Hi Franck,

At the moment no fuse client can properly deal with this, either old or new.
We will do a hot fix for this on the 4.5.y branch so you will need to upgrade everything to this new release (4.5.15).
You need to update the MGM, the FSTs and potentially the fuse clients.
The old clients might work actually but we never tested this thoroughly.

Georgios will push a commit to enable the switch and I will tag 4.5.15 asap.

Cheers,
Elvin

franck-jrc · November 27, 2019, 4:56pm

Thank you again

You need to update the MGM, the FSTs and potentially the fuse clients.

Also the FSTs would need to update ? At the same time than the MGM (i.e. full shutdown of the instance), or could the FSTs be done in a second moment ? Or before ?

Potentially the fuse clients : under which conditions ?

The old clients might work actually but we never tested this thoroughly.

In fact, it really seems that the old eosd clients behave correctly, folders and files can be created from there, and can be read back.

gbitzes · November 27, 2019, 4:58pm

Hi Franck,

The plan for the inode encoding switchover has been:

The administrator flips the flag in the MGM and all FSTs, all at the same time, and restarts all server processes.
All clients detect the inode change, and crash on purpose. Users can just restart them.
Upon restart, the clients will now begin using the new inode encoding scheme.

Parts 2 & 3 have already been implemented, the inode scheme detection logic is there for many past versions. There’s a good possibility the clients will continue working as-is, they’ll “only” all crash and restart once the inode encoding scheme is switched.

I’m implementing part 1 for you to try out immediately, hang tight.

armin-jrc · November 27, 2019, 4:58pm

Hi Elvin

for a quick work-around to make the EOS usable quickly, would it help if we empty the recycle bin, meaning: would this reduce the number of container ID’s and we have more time to inform users before we apply the patches?

Best
Armin

esindril · November 27, 2019, 4:59pm

Hi Armin,

Not really, since the ids of new containers would anyway be beyond this threshold no matter how many containers you delete.

Cheers,
Elvin

franck-jrc · November 27, 2019, 5:04pm

This is in the client part, but if the version is new enough, right ? Can you find back which is the minimum version for this to be already available in the client ?

gbitzes · November 27, 2019, 5:08pm

Any 4.4.x client and later should have it. This includes both eosd and eosxd.

franck-jrc · November 27, 2019, 5:16pm

OK, so you ensure that the old fuse clients will go on working even after this switch ?

gbitzes · November 27, 2019, 5:19pm

Unfortunately not - the changeover was not supposed to happen until some time in the future (months), and I’ve wanted to test the implementation a lot more. I believe it’ll work, but can’t ensure it.

If we had known you were close to the inode limit, we’d have rushed it… Having this happen so suddenly is quite unfortunate.

franck-jrc · November 27, 2019, 5:24pm

We can not run the maintenance before tomorrow, even if ready before.

So for us, the plan is approximately :

stop all MGM and FSTs
install new version (progressing from 4.4.x to 4.5.x includes also xrootd upgrade)
set the parameter
start back MGM and FSTs
observe the clients

We might maybe decide to stop all the clients together with the servers, then start few ones to test if this works.

I believe it’ll work, but can’t ensure it.

OK, at least it is foreseen to work, this is what I meant.

If we had known you were close to the inode limit, we’d have rushed it… Having this happen so suddenly is quite unfortunate.

Sorry for this, a lot of folders are indeed created on our instance

gbitzes · November 27, 2019, 6:22pm

Sounds like a good plan. The environment variable to use is EOS_USE_NEW_INODES, set it to 1.

No worries, we probably weren’t very good in communicating this limitation even existed…

gbitzes · November 27, 2019, 7:21pm

Just implemented, committed, and tested the changes with huge container and file IDs, things seem to work. Expect a release soon.

esindril · November 28, 2019, 7:34am

Hi Franck,

The new release 4.5.15 which contains the fix is available in the usual testing repository:
http://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-7/

Let us know if you encounter any issues.

Thanks,
Elvin

franck-jrc · November 28, 2019, 8:02am

Thank you both for your help, we will plan the update and let you know.

So the limitation also concerns file IDs ? What is the limit for them ? The current file ID on our instance is 1413156299 (0x543b0dcb)

franck-jrc · November 28, 2019, 8:12am

Just to be sure, this needs to be added inside file /etc/sysconfig/eos as export EOS_USE_NEW_INODES=1, correct ?

Installing version 4.5.15 would upgrade also xrootd to 4.10, is that OK ?

And nothing to be done on the side of the QuarkDB instance ?

esindril · November 28, 2019, 8:27am

Hi Franck,

Yes, the env variable needs to be put in /etc/sysconfig/eos_env or /etc/sysconfig/eos, depends which one you use. I assume it’s the first one, since it’s used by systemd and there you don’t need the export.

Yes, you also need to upgrade xrootd and eos-xrootd if you have it installed on your machine.

There is nothing to be done on the QuarkDB side.

There is no problem for the file ids, anyway this change will considerably increase the max id values for both.

Cheers,
Elvin

gbitzes · November 28, 2019, 8:33am

Hi Franck,

To verify the MGM is using the new inodes, run eos version -m: You should see eos.inodeencodingscheme=1.

Cheers,
Georgios

CERN Accelerating science

Critical : Newly created folders can't be seen in fusex, overflow issue with id number?