CERN Accelerating science

Quickest way to find the largest files in an EOS namespace

Does anyone have an elegant quick way to find the largest files on an EOS namespace?

Is there a way other than crawling the entire namespace and listing each directory?

Hi David,

You could use the eos-ns-inspect tool to scan (offline) one of your QDB backups and dump all files and keep track of the largest one. In this way at least you will not trash your active QDB instance.

Cheers,
Elvin

1 Like

Under the same idea of not bothering the active MGM/QDB instance, is there a quarkdb alternative to get the result of eos fs dumpmp --path command ? When this is ran on the MGM for a FS containing million of files, this makes the MGM unavailable during the process.

Hi Franck,

Using the --path means it needs to resolve the entire hierarchy - so this is not possible with the eos-ns-insptect tool without reimplementing the same logic as in the namespace. Doing the eos fs dumpmd without resolving the paths should at least be faster.

Cheers,
Elvin

Thank you. Yes, sure, the --path is killing, and it is probably what gives issue. We usually never need to run that, except in the case where a drain is finished and some files are left over, to inspect. In that case, the --path doesn’t give problem, files are very few, maximum hundreds, and it is very handy. It just happened once that the bad fsid was used in the command by mistake, and it started dumping a full disk instead of the almost empty one…

Isn’t there either an equivalent to eos fs dumpmd without --path that we could use in such situation, to avoid other mistake ? The matching with the path can be done afterwards just on a list of ids.

Hi Franck,

For such a case doing eos fs dumpmd <fsid> --fid is actually the quickest way. Since anyway this info is already cached at the MGM. This will give you a list of all fids on that file system.

Cheers,
Elvin

1 Like

Thank you Elvin for your previous answer, it seemed indeed quite a reasonable approach.

However, I can observe that in case that the information is not in the MGM cache, retrieving the fids only is also quite an expensive operation, requiring to query the quarkdb cluster. The speed on our test instance (virtual machine based) is about 2kHz, so querying a 100k files FS takes a bit less than a minute. And during that period, MGM responsiveness is very reduced (almost unusable). Retrieving path information doesn’t seem much more expensive (maximum 2 times slower). Is that correct ?

If yes, this is why I was wondering if retrieving the list of fids on a filesystem was possible to be retrievd directly from the QuarkDB cluster ? even directly with a redis-cli query instead of eos-ns-inspect.

Hi Franck,

Yes, you are right. The current implementation of eos fs dumpmd --fid can be greatly improved. There is almost no difference between retrieving only the fids or retrieving the full path of the files on a particular file system. I will fix this for the next release.

Thanks,
Elvin