FST memory footprint

franck-jrc · July 16, 2018, 1:25pm

Hello,

We had several case of eos FST being killed by the OS for too much memory use.
This happened after we increase the number of active FS on these nodes from 24 to 48. Could there be some link to this increase activity, and undersized servers ? They currently manage ~a total 30M file each for 150GB memory. Some server report 500GB vsize, and up to 129GB rss in ~eos node ls --sys` output, but the values are very variable among different nodes.
When restarted and after all FS have booted, the memory footprint is much lower (few GB), but it then increases with time, it seems. Could there be some memory leak ? eos version on these FST is 4.2.20, maybe some newer version fixes a known bug ?

amanzi · July 19, 2018, 12:00pm

@hroussea have you noticed this problem here @CERN? i think we also have 48 fs per server right?

hroussea · July 19, 2018, 12:21pm

Hi there,

I had a quick look at some of our instances and the biggest FST reports ~22GB of RSS running 4.2.26-1 on a CentOS kernel 3.10.0-862.3.3.el7.x86_64 for ~2.9 million files on its disks.

Also, note that we use jemalloc (but you should have it as well)

franck-jrc · July 20, 2018, 8:20am

Hi Hervé,

Thank you for your answer, it helps figuring out.

Yes, we also use jemalloc.

When you say 2.9M, you mean total by node, not by FS, right ?
Our FST handles 30M files by node, so around 10 times more. RSS can be also less than 22GB size (starts at 10GB), but also much more (120GB max, currently)
eos version is 4.2.20, kernel version is almost the same as yours (862.3.2)

Maybe RSS size is not meant to be proportional to handled files, but more to file usage.
I’ll keep on monitoring, especially after we upgrade the FST versions.

crystal · July 31, 2018, 12:40am

hello!

we’ve been seeing something similar very recently on EOS 4.2.24 - just had one FST killed by the OS for too much memory use. however, our FSTs have always had 44 FSes, so a change in the number of active FSes wouldn’t have been the trigger for us… prior to this we’ve been seeing xrootd traps in dmesg causing crashes, not entirely sure if related.

fst that crashed today has about 20m files, 150G vsize and 130G rss and was using about 95-96% memory when it died - after reboot, 11G vsize and 7G rss.

happen to have any updates on your situation @franck-jrc ?

franck-jrc · July 31, 2018, 3:22pm

Hi Crystal,

Thank you for your input. For now we don’t observe this memory increase any more, all FSTs seem to have quite settled, after some period.
What has changed since then is that we upgraded the FSTs that were crashing from 4.2.20 to 4.2.28, and we also paused the balancing and group balancing activities.
We observed in total 9 memory kills on 5 FSTs in a 15 days period. The FST were the ones involved in this balancing, so maybe something is linked there. And maybe also this situation led to the strange ghosts files we have in this other issue.

crystal · July 31, 2018, 8:23pm

interesting! 9 memory kills in 15 days is quite a lot.

i just double-checked and all our balancing is turned off, so it is maybe a different issue for us - will keep looking into it thanks for the update anyway!

CERN Accelerating science

FST memory footprint