Does anyone have an elegant quick way to find the largest files on an EOS namespace?
Is there a way other than crawling the entire namespace and listing each directory?
Does anyone have an elegant quick way to find the largest files on an EOS namespace?
Is there a way other than crawling the entire namespace and listing each directory?
Hi David,
You could use the eos-ns-inspect
tool to scan (offline) one of your QDB backups and dump all files and keep track of the largest one. In this way at least you will not trash your active QDB instance.
Cheers,
Elvin
Under the same idea of not bothering the active MGM/QDB instance, is there a quarkdb alternative to get the result of eos fs dumpmp --path
command ? When this is ran on the MGM for a FS containing million of files, this makes the MGM unavailable during the process.
Hi Franck,
Using the --path
means it needs to resolve the entire hierarchy - so this is not possible with the eos-ns-insptect
tool without reimplementing the same logic as in the namespace. Doing the eos fs dumpmd
without resolving the paths should at least be faster.
Cheers,
Elvin
Thank you. Yes, sure, the --path
is killing, and it is probably what gives issue. We usually never need to run that, except in the case where a drain is finished and some files are left over, to inspect. In that case, the --path
doesn’t give problem, files are very few, maximum hundreds, and it is very handy. It just happened once that the bad fsid was used in the command by mistake, and it started dumping a full disk instead of the almost empty one…
Isn’t there either an equivalent to eos fs dumpmd
without --path
that we could use in such situation, to avoid other mistake ? The matching with the path can be done afterwards just on a list of ids.
Hi Franck,
For such a case doing eos fs dumpmd <fsid> --fid
is actually the quickest way. Since anyway this info is already cached at the MGM. This will give you a list of all fids on that file system.
Cheers,
Elvin
Thank you Elvin for your previous answer, it seemed indeed quite a reasonable approach.
However, I can observe that in case that the information is not in the MGM cache, retrieving the fids only is also quite an expensive operation, requiring to query the quarkdb cluster. The speed on our test instance (virtual machine based) is about 2kHz, so querying a 100k files FS takes a bit less than a minute. And during that period, MGM responsiveness is very reduced (almost unusable). Retrieving path information doesn’t seem much more expensive (maximum 2 times slower). Is that correct ?
If yes, this is why I was wondering if retrieving the list of fids on a filesystem was possible to be retrievd directly from the QuarkDB cluster ? even directly with a redis-cli
query instead of eos-ns-inspect
.
Hi Franck,
Yes, you are right. The current implementation of eos fs dumpmd --fid
can be greatly improved. There is almost no difference between retrieving only the fids or retrieving the full path of the files on a particular file system. I will fix this for the next release.
Thanks,
Elvin