EOS MGM cannot start because of BackgroundFlusher corruption

Hi there,
We have a test EOS 4.8.35 instance with QuarkDB 0.4.3, and recently EOS crashed with the following error message:

210331 11:32:03 time=1617161523.613302 func=Configure                level=NOTE  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=XrdMgmOfsConfigure:1533        tident=<single-exec> sec=      uid=0 gid=0 name= geo="" MGM_HOST=besh04.ihep.ac.cn MGM_PORT=1094 VERSION=4.8.35 RELEASE=1 KEYTABADLER=5a5f24fa SYMKEY=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
210331 11:32:03 time=1617161523.658986 func=Init                     level=INFO  logid=a91957b8-91d1-11eb-8611-38eaa7a302a2 unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=Master:83                      tident=<service> sec=      uid=0 gid=0 name= geo="" systemd found on the machine = 1
210331 11:32:03 time=1617161523.720189 func=set                      level=INFO  logid=static.............................. unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=InstanceName:39                tident= sec=(null) uid=99 gid=99 name=- geo="" Setting global instance name => eosdev
210331 11:32:03 time=1617161523.720501 func=AddBroker                level=INFO  logid=static.............................. unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=XrdMqClient:173                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="add broker" url="root://besh04.ihep.ac.cn:1097//eos/besh04.ihep.ac.cn/mgm?xmqclient.advisory.status=1&xmqclient.advisory.query=1&xmqclient.advisory.flushbacklog=1"
210331 11:32:03 time=1617161523.729596 func=Subscribe                level=INFO  logid=static.............................. unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=XrdMqClient:618                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="successfully subscribed to broker" url="root://besh04.ihep.ac.cn:1097//eos/besh04.ihep.ac.cn/mgm?xmqclient.advisory.status=1&xmqclient.advisory.query=1&xmqclient.advisory.flushbacklog=1"
###### mq messaging: starting thread 
210331 11:32:03 time=1617161523.834687 func=BootNamespace            level=ALERT logid=a91957b8-91d1-11eb-8611-38eaa7a302a2 unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=Master:1702                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="running boot sequence (as master)"
210331 11:32:03 time=1617161523.835004 func=CreateObject             level=INFO  logid=a912b688-91d1-11eb-8611-38eaa7a302a2 unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=PluginManager:287              tident=<service> sec=      uid=0 gid=0 name= geo="" created plugin object type=NamespaceGroup
210331 11:32:03 time=1617161523.840309 func=enforceQuarkDBVersion    level=INFO  logid=static.............................. unit=mgm@besh04.ihep.ac.cn:1094 tid=00007fe3726e9780 source=VersionEnforcement:38          tident= sec=(null) uid=99 gid=99 name=- geo="" QuarkDB version: "0.4.3"
BackgroundFlusher corruption, could not retrieve entry with index 12256660885
terminate called without an active exception

I made some research. Is “I-79405346” the key name corresponding to index 12256660885 obtained by getKey(12256660885) ? “I-79405346” or “I79405346” return nothing from quarkdb.

In this case, how should we recovery the EOS service?

Best reguards.

Hi Yujiang,

This index is not related to any actual file in the system. This means the local flusher is corrupted. You can simply delete the directories in side /var/eos/ns-queue/ and then restart the MGM daemon. It should be able to restart successfully. This can happen due to a crash or an unclean shutdown and the local db used for flushing gets corrupted. This does not impact the entries from the namespace.

Cheers,
Elvin