Fusex client and its use of cache

franck-jrc · February 20, 2023, 11:19am

Good morning,

We have started using the eosxd 5 client, and it seems quite a bite more stable that the previous one. However, we have had some issue with the cache folder it is using, which has been filling too much, and this caused the client to crash with such a message :

230210 14:04:15 t=1676037855.239435 f=leveler          l=WARN  tid=00007f9d907fd700 s=dircleaner:302           [ dc ] diskspace on partition path /var/cache/eos/fusex/cache/xxxx less than 5% free : free-bytes=73728 total-bytes=4798700060672 filled=100.00 % - cleaning cache
230210 14:04:21 t=1676037861.043786 f=attach           l=CRIT  ino:8000000092f87301 s=data:731                 attach to cache failed - ino=0x8000000092f87301 errno=28
terminate called after throwing an instance of 'std::runtime_error'
  what():  attach to cache failed - ino=0x8000000092f87301 errno=28
Stack trace (most recent call last) in thread 61004:
#13   Object ", at 0xffffffffffffffff, in 
#12   Object "/lib64/libc.so.6, at 0x7f9dac865b0c, in clone
#11   Source "pthread_create.c", line 0, in start_thread [0x7f9dacb3cea4]
#10   Object "/lib64/libfuse.so.2, at 0x7f9daf4cf400, in fuse_session_loop
#9    Object "/lib64/libfuse.so.2, at 0x7f9daf4d2b6a, in fuse_reply_iov
#8    Object "/lib64/libfuse.so.2, at 0x7f9daf4d352b, in fuse_reply_open
#7    Source "/root/rpmbuild/BUILD/eos-5.1.8-1/fusex/eosfuse.cc", line 4534, in EosFuse::open(fuse_req*, unsigned long, fuse_file_info*) [0x472dce]
#6    Source "/root/rpmbuild/BUILD/eos-5.1.8-1/fusex/data/data.cc", line 732, in data::datax::attach(fuse_req*, std::string&, int) [0x42a4d3]
#5    Object "/lib64/libstdc++.so.6, at 0x7f9dad2c7c52, in __cxa_throw
#4    Object "/lib64/libstdc++.so.6, at 0x7f9dad2c7a32, in std::terminate()
#3    Object "/lib64/libstdc++.so.6, at 0x7f9dad2c7a05, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
#2    Object "/lib64/libstdc++.so.6, at 0x7f9dad2c9a94, in __gnu_cxx::__verbose_terminate_handler()
#1    Object "/lib64/libc.so.6, at 0x7f9dac79ea77, in abort
#0    Object "/lib64/libc.so.6, at 0x7f9dac79d387, in gsignal
Aborted (Signal sent by tkill() 31115 0)

How this cache is supposed to work with the fusex client ? We would like to know if there is a way to avoid this folder to go above a certain limit ? The setting value cache.size-mb doesn’t seem to be effective for this, since the default value of 52 doesn’t prevent to consume many GB.

Our setup foresee to share the partition for several eosx client, but overall, it seems that the auto clean feature is not enough, because at some points the partition filled up with more than 100GBs, causing some clients to crash, after that the space was released. So maybe one client has been causing this, since the volume is only used by the clients.

We found several old files in /var/log/eos/fusex/cache/clientname/00X folders with extensions jc.recover or dc.recover which we understood can be cleaned before restarting the client, we do not know if this could be linked.

buzykaev · March 21, 2024, 11:39pm

There is also another fusex cache (for atlas example) at:
/var/cache/eos/fusex/md-cache/atlas/<UUID>/LOG
And it can grows up to 10G for just the few days.
How it can be turned off or decresed verbosity?

apeters · March 22, 2024, 8:35am

Hi Franck,
this large files come from recoveries. I leave them there if the recovery failed to be able to grab them manually, but practically I agree, that this is useless. I will add two improvements:

1 try to pre-book the space and allow to set a max size for recoveries
2 unlink them on failure

Cheers Andreas.

apeters · March 22, 2024, 8:37am

That is something else, that is the meta-data cache.
The size is defined by the number of inodes the client has visited.
For this there is no cleanup mechnism implemented, only by remounting. We don’t see a problem here, because we use AUTOFS everywhere, but if you have a static mount, it can be a problem if you do something like ‘find /eos/atlas/’.

I will add a clean-up also based on the size of the ROCKSDB cache (LOG).

apeters · March 22, 2024, 9:56am

Here are the two JIRA tickets dealing with your issues:

https://its.cern.ch/jira/browse/EOS-6103
https://its.cern.ch/jira/browse/EOS-6102

franck-jrc · March 22, 2024, 5:53pm

Thank you Andreas for your answer. The recovery files maybe not unlink them immediately, if they can be used in some cases, but maybe clean them if they are too old.

Unfortunately, as external users we do not have access any more to the JIRA tickets, which is a pity because it would be useful to check the existing issues, to understand if our is already dealt with somewhere.

apeters · March 25, 2024, 4:06pm

I have addressed both issues in the eosxd code. They should be released with EOS 5.2.22.

franck-jrc · March 27, 2024, 4:19pm

Thank you, we will start using this version when it is released

CERN Accelerating science

Fusex client and its use of cache