On our production EOS ALICE::Subatech::EOS, the namespace is huge. It takes time to backup, so I decided to try compactification (which is off). The process started OK but the MGM crashed and I had to restart it.
I have saved parts of the log xrdlog.mgm that contain a stack trace.
I do not know if this is something that has been seen already. This is eos-server-4.2.25-1.el6.x86_64 on SL6.
Hi, we also observed such behaviour in the past, but I think it was still with aquamarine. What we try to do usually is to run compaction with reduced (write) activity on the server, it seems to reduce the risk of such crash.
Some other times also the server didn’t crash, but the compaction failed. The problem in that case is that the compaction state remains “compacting”, and we can’t start the online compaction again without restarting the server (which takes time with a large namespace file)
[root@naneosmgr01(EOSMASTER) ~]#df -kh
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 247G 179G 56G 77% /
tmpfs 32G 0 32G 0% /dev/shm
/dev/sda1 93M 71M 18M 81% /boot
[root@naneosmgr01(EOSMASTER) ~]#ls -ltrh /var/eos/md/
total 61G
-rw-r--r-- 1 daemon daemon 2.6G Dec 13 10:54 files.naneosmgr01.in2p3.fr.mdlog.oc
-rw-r--r-- 1 daemon root 92K Dec 13 10:54 stacktrace
-rw-r--r-- 1 daemon daemon 23G Dec 20 08:22 directories.naneosmgr01.in2p3.fr.mdlog
-rw-rw-rw- 1 daemon daemon 1.3M Dec 20 08:50 so.mgm.dump
-rwxr--r-- 1 daemon daemon 3.2K Dec 20 08:51 iostat.naneosmgr01.in2p3.fr.dump
-rw-r--r-- 1 daemon daemon 3.2K Dec 20 08:51 iostat.naneosmgr02.in2p3.fr.dump
-rw-r--r-- 1 daemon daemon 35G Dec 20 08:51 files.naneosmgr01.in2p3.fr.mdlog
Starting compactification:
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns compact on
success: configured online compacting to run in 0 seconds from now (might be delayed up to 60 seconds)
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Compactification status=starting waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Compactification status=starting waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Files 17051907 [compacting] (371s)
ALL Compactification status=compacting waitstart=0 ratio-file=0.0:1 ratio-dir=0.0:1
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Files 17051907 [compacting] (371s)
ALL Compactification status=compacting waitstart=0 ratio-file=0.0:1 ratio-dir=0.0:1
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Files 17051907 [compacting] (371s)
ALL Compactification status=compacting waitstart=0 ratio-file=0.0:1 ratio-dir=0.0:1
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
=> Here the command eos hangs and the mgm has crashed. I restart it immediately to minimise unavalilabiliy:
[root@naneosmgr01(EOSMASTER) ~]#service eos restart mgm
Stopping xrootd: mgm [ OK ]
Starting xrootd as mgm with -n mgm -c /etc/xrd.cf.mgm -m -l /var/log/eos/xrdlog.mgm -b -Rdaemon
[ OK ]
I do not know what to do in order to compact the namespace.
For what I know, these reports are merely track of file accesses that can help analysing the use of the storage, but they are not used, so removing them is harmless, just think if you are interested in keeping them somewhere else for information.
But your problem seems not linked to disk space at the end.
I can’t say why the compaction failed also this time, maybe there are some information in /var/eos/md/stacktrace or /var/log/eos/mgm/Master.log or even /var/log/eos/mgm/xrdlog.mgm.
One though I have is that this last crash could be linked to the fact that there is the leftover files.naneosmgr01.in2p3.fr.mdlog.oc file. It is the temporary file used for online compaction (.oc), and is from last week. Rename it and retry, but you should also be able to delete it.
If you still have problem doing online compaction, I know that there is an offline tool as alternative to online compaction, that might help troubleshooting :
You need to do that when MGM is off. Then if it succeeds, you can use the newly created files as namespace files by giving them the original names (keep safe the original files, of course) and restart the MGM.
I gave a new try to compact the namespace and it looks like it worked:
[root@naneosmgr01(EOSMASTER) ~]#ls -ltrh /var/eos/md/
total 61G
-rw-r--r-- 1 daemon daemon 2.6G Dec 20 08:53 files.naneosmgr01.in2p3.fr.mdlog.oc
-rw-r--r-- 1 daemon root 64K Dec 20 08:53 stacktrace
-rw-r--r-- 1 daemon daemon 23G Feb 18 14:57 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 36G Feb 18 15:00 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 460K Feb 18 15:06 so.mgm.dump
-rw-r--r-- 1 daemon daemon 3.2K Feb 18 15:07 iostat.naneosmgr02.in2p3.fr.dump
-rwxr--r-- 1 daemon daemon 3.2K Feb 18 15:07 iostat.naneosmgr01.in2p3.fr.dump
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns compact on
success: configured online compacting to run in 0 seconds from now (might be delayed up to 60 seconds)
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Compactification status=starting waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
[...]
root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Files 16609893 [compacting] (534s)
ALL Compactification status=compacting waitstart=0 ratio-file=0.0:1 ratio-dir=0.0:1
[...]
[root@naneosmgr01(EOSMASTER) ~]#eos -b ns | grep -i compac
ALL Compactification status=off waitstart=0 interval=0 ratio-file=14.2:1 ratio-dir=0.0:1
[root@naneosmgr01(EOSMASTER) ~]#ls -ltrh /var/eos/md/
total 61G
-rw-r--r-- 1 daemon root 64K Dec 20 08:53 stacktrace
-rw-r--r-- 1 daemon daemon 23G Feb 18 14:57 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 36G Feb 18 15:00 files.naneosmgr01.in2p3.fr.mdlog.1550498906
-rw-r--r-- 1 daemon daemon 2.5G Feb 18 15:09 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 459K Feb 18 15:10 so.mgm.dump
-rwxr--r-- 1 daemon daemon 3.2K Feb 18 15:11 iostat.naneosmgr01.in2p3.fr.dump
-rw-r--r-- 1 daemon daemon 3.2K Feb 18 15:11 iostat.naneosmgr02.in2p3.fr.dump
=> I do not know why it worked this time. Is the above trace correct ?
More questions:
files.naneosmgr01.in2p3.fr.mdlog is now 2.5G. Am I supposed to remove files.naneosmgr01.in2p3.fr.mdlog.1550498906 at some point (to recover disk space) ?
Should I turn compaction to auto (reschedule itself regularly) ? With which parameters ?
Hi Jean Michel,
for your use case actually the directory is the most useful to compact. Since ALICE does not update files, file compaction should give you a 2:1 or 3:1 ratio, while every file rewrites the directory meta data of the parent.
So, yes compact the directories, but don’t use the autorepair flags.