EOS stopped working

franck-jrc · July 31, 2020, 9:56pm

Some more information about the issue above reported by @marco.

Here is a small backtrace we found in the xrdlog.mgm file at the time of the first issue and that might explain the first crash.

error: received signal 11:
/lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7fb32e2e8cb4]
/lib64/libc.so.6(+0x36280)[0x7fb33355d280]
/usr/lib64/libjemalloc.so.1(+0xd5cd)[0x7fb334cb65cd]
/usr/lib64/libjemalloc.so.1(+0x266ff)[0x7fb334ccf6ff]
/usr/lib64/libjemalloc.so.1(malloc+0x22f)[0x7fb334caf14f]
/lib64/libstdc++.so.6(_Znwm+0x1d)[0x7fb333e6aecd]
/lib64/libstdc++.so.6(_ZNSs4_Rep9_S_createEmmRKSaIcE+0x59)[0x7fb333ec9a19]
/lib64/libstdc++.so.6(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0x21)[0x7fb333ecb2a1]
/lib64/libstdc++.so.6(_ZNSsC2EPKcRKSaIcE+0x38)[0x7fb333ecb6d8]
/lib64/libXrdEosMgm.so(_ZN9TableCellC1EdRKSsS1_b19TableFormatterColor+0x1a8)[0x7fb32e53c508]

At that time, the GroupBalancer was enabled and active, busy balancing our groups. I do not know if this had an impact, it has been going on correctly for around a week.

After that, the daemon was still alive, but only the regular messages regarding the AcquireLease, Supervisor and Converter function were logged.

Currently, when the mgm daemon is started, it start booting correctly, but after few seconds, everything is again frozen, as if some dead lock was occurring somewhere, preventing it from operating correctly. The only messages that come after this are the lease management. At some point, I managed to have a correct output for eos ls and eos node ls (FSTs were reporting their information) but it stopped working after a while. I could provide the logs via email if that could help.

I tried to disable the balancer (which gave some issue in the past during startup) and groupbalancer with the following commands on the QuarkDB cluster

redis-cli -p 7777 hset eos-config:default global:/config/jeodpp/space/default#balancer off  
redis-cli -p 7777 hset eos-config:default global:/config/jeodpp/space/default#groupbalancer off

But this didn’t change anything in the behavior at startup.

So anything that could help us start again the MGM would be helpful.

Thank you.

CERN Accelerating science

EOS stopped working