Last week, we had the an incident which was linked to a very high activity of a user (using fusex mounts, on a htcondor cluster, so many many processes), leading to a huge log being written on disk. More that the necessary space, the problem was it seems that logrotate wasn’t compressing fast enough to consume all log messages to free space.
Of course, the user launched the culprit process on Friday evening, so there was no one to intervene on this, and on Saturday evening the disk was full, while 300GB were available before leaving.
This time, it occurred that freeing up space wasn’t sufficient to restore the MGM service, it has to be restarted. After the restart, the previous activity of that user started again, with 3Khz of Open operations. Thinking it might be too much, we tried to limit it using eos acces set limit <rate> rate:use:*:Open (up to now we never really needed this feature).
However we observed that if the rate value is below it limits the rate observed in eos stat output to around 400Hz, but any value above doesn’t limit at all (or to a higher limit that what the user is actually doing). Is that a normal behaviour ?
In any case, since this activity actually doesn’t disturb the MGM (very very resilient on this
, we were wondering if there would be a way to still have a useful log file (in case of issue, to analyse what is going on), in particular, we find useful to still have the open information in the logs.
The filter we use for the log file is this (was defined some years ago to remove useless, maybe not any mor valid for this version):
eos debug info --filter IdMap,Commit,_access,readlink,_readlink,BroadcastReleaseFromExternal,BroadcastRefreshFromExternal,BroadcastMD,OpSetFile,RefreshEntry,HandleMD,FillContainerCAP,Store,OpGetLs,OpGetCap,log_info,Imply,OpSetDirectory
With this, and the activity we observed these days, the xrdlog.mgm log file growth is above 8MB/s, which makes 30GB in one hour. In addition, the Clients.log file is almost as big.
The questions are :
- How to correctly set rate limit to users ?
- Does someone of you already had this log file increase issue, and what is the configuration you use for the log files, to still be able to analyse the logs ? Could it be possible to filter out messages only in case of high activity ? Or do you just mainly filter them a lot on big instances ?
- What is causing the unavailability of the MGM when the
/var/eospartition (where the logs are written) is full ? Is it the log writing, or something els (report, ns-queue) ? Would it be better to separated the logs from the/var/eosparition ? We put them there as in the past they were on the root partition, much smaller and this gave other types of issues… But maybe now we could think to do otherwise… The/var/eospartition used to be big to host the files hosting the in-memory namespace. But now the MD are on a qurkdb separated one…
Our server version : 5.2.32