EOS start fail due to namespace boot fail

Dear Experts,

suddenly our MGM stopped to work, and I would like to ask some guidance. It fails to restart with the following, repeating messages:

230201 09:17:53 17084 Starting on Linux 3.10.0-1160.6.1.el7.x86_64
Copr.  2004-2012 Stanford University, xrd version v4.12.8
++++++ xrootd mgm@eos-mgm1.alice-af.wigner.hu initialization started.
Config using configuration file /etc/xrd.cf.mgm
=====> all.sitename ALICE::KFKI::EOS
=====> xrd.sched mint 8 maxt 256 idle 64
=====> xrd.protocol XrdHttp:9000 /usr/lib64/libXrdHttp.so
230201 09:17:53 17084 XrdConfig: sitename already specified, using ' ALICE::KFKI::EOS '.
=====> all.sitename ALICE::KFKI::EOS
Config maximum number of connections restricted to 65000
Plugin loaded 
Copr.  2012 Stanford University, xrootd protocol 4.0.0 version v4.12.8
++++++ xrootd protocol initialization started.
=====> xrootd.fslib libXrdEosMgm.so
=====> xrootd.seclib libXrdSec.so
=====> xrootd.async off nosf
=====> xrootd.chksum adler32
=====> all.export / nolock
Config exporting /
Plugin loaded 
++++++ Authentication system initialization started.
Plugin loaded 
=====> sec.protocol unix
Plugin loaded 
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
Plugin loaded 
230201 09:17:53 17084 secgsi_InitOpts: *** ------------------------------------------------------------ ***
230201 09:17:53 17084 secgsi_InitOpts:  Mode: server
230201 09:17:53 17084 secgsi_InitOpts:  Debug: 0
230201 09:17:53 17084 secgsi_InitOpts:  CA dir: /etc/grid-security/certificates/
230201 09:17:53 17084 secgsi_InitOpts:  CA verification level: 1
230201 09:17:53 17084 secgsi_InitOpts:  CRL dir: /etc/grid-security/certificates/
230201 09:17:53 17084 secgsi_InitOpts:  CRL extension: .r0
230201 09:17:53 17084 secgsi_InitOpts:  CRL check level: 0
230201 09:17:53 17084 secgsi_InitOpts:  Certificate: /etc/grid-security/daemon/hostcert.pem
230201 09:17:53 17084 secgsi_InitOpts:  Key: /etc/grid-security/daemon/hostkey.pem
230201 09:17:53 17084 secgsi_InitOpts:  Proxy delegation option: 0
230201 09:17:53 17084 secgsi_InitOpts:  GRIDmap file: /etc/grid-security/grid-mapfile
230201 09:17:53 17084 secgsi_InitOpts:  GRIDmap option: 2
230201 09:17:53 17084 secgsi_InitOpts:  GRIDmap cache entries expiration (secs): 600
230201 09:17:53 17084 secgsi_InitOpts:  Client proxy availability in XrdSecEntity.endorsement: 0
230201 09:17:53 17084 secgsi_InitOpts:  VOMS option: 1
230201 09:17:53 17084 secgsi_InitOpts:  MonInfo option: 1
230201 09:17:53 17084 secgsi_InitOpts:  Crypto modules: ssl
230201 09:17:53 17084 secgsi_InitOpts:  Ciphers: aes-128-cbc:bf-cbc:des-ede3-cbc
230201 09:17:53 17084 secgsi_InitOpts:  MDigests: sha1:md5
230201 09:17:53 17084 secgsi_InitOpts:  Trusting DNS for hostname checking
230201 09:17:53 17084 secgsi_InitOpts: *** ------------------------------------------------------------ ***
230201 09:17:53 17084 secgsi_GetSrvCertEnt: problems loading srv cert: invalid
230201 09:17:53 17084 secgsi_Init: problems loading srv cert
=====> sec.protocol gsi -crl:0 -cert:/etc/grid-security/daemon/hostcert.pem -key:/etc/grid-security/daemon/hostkey.pem -gridmap:/etc/grid-security/grid-mapfile -d:0 -gmapopt:2 -vomsat:1 -moninfo:1 -exppxy:/var/eos/auth/gsi#<uid>
=====> sec.protbind localhost.localdomain unix sss
=====> sec.protbind localhost unix sss
=====> sec.protbind * only sss unix
Config 6 authentication directives processed in /etc/xrd.cf.mgm
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local  protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for 172.16.152.16: local pub4 prv4 
Config Route all4: 172.16.152.16 Dest=[::172.16.152.16]:1094
Plugin loaded 
++++++ (c) 2015 CERN/IT-DSS MgmOfs (meta data redirector) 4.8.62
=====> mgmofs enforces SSS authentication for XROOT clients
jemalloc is loaded!
jemalloc heap profiling is disabled
=====> mgmofs.hostname: eos-mgm1.alice-af.wigner.hu
=====> mgmofs.hostpref: eos-mgm1
=====> mgmofs.managerid: eos-mgm1.alice-af.wigner.hu:1094
=====> mgmofs.fs: /
=====> mgmofs.targetport: 1095
=====> mgmofs.authlib : /usr/lib64/libXrdAliceTokenAcc.so
=====> mgmofs.authorize : true
=====> mgmofs.instance : eosalice
=====> mgmofs.metalog: /var/eos/md
=====> mgmofs.txdir:   /var/eos/tx
=====> mgmofs.authdir:   /var/eos/auth
=====> mgmofs.reportstorepath: /var/eos/report
=====> mgmofs.cfgtype: quarkdb
=====> mgmofs.nslib : /usr/lib64/libEosNsQuarkdb.so
=====> mgmofs.qdbcluster : localhost:7001 localhost:7002 localhost:7003 
=====> mgmofs.qdbpassword length : 89
=====> ofs.tpc redirect to: eos-gateway-node.cern.ch1094
=====> mgmofs.redirector : false
=====> mgmofs.broker : root://localhost:1097//eos/eos-mgm1.alice-af.wigner.hu/mgm
=====> mgmofs.defaultreceiverqueue : /eos/*/fst
=====> mgmofs.fs: /
=====> mgmofs.errorlog : enabled
++++++ (c) 2008 CERN/IT-DM-SMD AliceTokenAcc (Alice Token Access Authorization) v 1.0
=====> alicetokenacc.noauthzhost: localhost
=====> alicetokenacc.noauthzhost: localhost.localdomain
=====> alicetokenacc.truncateprefix: /eos/alice/grid
=====> XrdAliceTokenAcc: No Authorizationfile set via environment variable 'TTOKENAUTHZ_AUTHORIZATIONFILE'
=====> XrdAliceTokenAcc: Using Authorizationfile '/etc/grid-security/xrootd/TkAuthz.Authorization'!
------ AliceTokenAcc initialization completed
=====> all.role: manager
=====> setting message filter: Process,AddQuota,Update,UpdateHint,Deletion,PrintOut,SharedHash,work
=====> comment log in /var/log/eos/mgm/logbook.log
=====> eosxd stacktraces log in /var/log/eos/mgm/eosxd-stacktraces.log
=====> eosxd logtraces log in /var/log/eos/mgm/eosxd-logtraces.log
=====> mgmofs.alias: eos-mgm.alice-af.wigner.hu
230201 09:17:53 time=1675243073.796342 func=Configure                level=NOTE  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=XrdMgmOfsConfigure:1540        tident=<single-exec> sec=      uid=0 gid=0 name= geo="" MGM_HOST=eos-mgm1.alice-af.wigner.hu MGM_PORT=1094 VERSION=4.8.62 RELEASE=1 KEYTABADLER=deba251a SYMKEY=I1HQvI4qbzhCNdw464x2Jf6vPRk=
230201 09:17:53 time=1675243073.798220 func=set                      level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=InstanceName:39                tident= sec=(null) uid=99 gid=99 name=- geo="" Setting global instance name => eosalice
230201 09:17:53 time=1675243073.798231 func=Supervisor               level=NOTE  logid=4ec3ff0a-a211-11ed-96ab-00259074c8e8 unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f5ff23fc700 source=QdbMaster:238                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="set up booting stall rule"
230201 09:17:53 time=1675243073.798406 func=AddBroker                level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=XrdMqClient:179                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="add broker" url="root://localhost:1097//eos/eos-mgm1.alice-af.wigner.hu/mgm?xmqclient.advisory.status=1&xmqclient.advisory.query=1&xmqclient.advisory.flushbacklog=1"
###### mq messaging: starting thread 
230201 09:17:53 time=1675243073.803418 func=Subscribe                level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=XrdMqClient:605                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="successfully subscribed to broker" url="root://localhost:1097//eos/eos-mgm1.alice-af.wigner.hu/mgm?xmqclient.advisory.status=1&xmqclient.advisory.query=1&xmqclient.advisory.flushbacklog=1"
230201 09:17:53 time=1675243073.907452 func=CreateObject             level=INFO  logid=4ebf30ce-a211-11ed-96ab-00259074c8e8 unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=PluginManager:287              tident=<service> sec=      uid=0 gid=0 name= geo="" created plugin object type=NamespaceGroup
230201 09:17:53 time=1675243073.917964 func=enforceQuarkDBVersion    level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=VersionEnforcement:38          tident= sec=(null) uid=99 gid=99 name=- geo="" QuarkDB version: "0.4.2"
230201 09:17:54 time=1675243074.097625 func=synchronize              level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=MetadataFlusher:157            tident= sec=(null) uid=99 gid=99 name=- geo="" starting-index=2311658799 ending-index=2311658799 msg="waiting until queue item 2311658798 has been acknowledged.."
230201 09:17:54 time=1675243074.097647 func=synchronize              level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=MetadataFlusher:170            tident= sec=(null) uid=99 gid=99 name=- geo="" starting-index=2311658799 ending-index=2311658799 msg="queue item 2311658798 has been acknowledged"
230201 09:17:54 time=1675243074.224703 func=synchronize              level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=MetadataFlusher:157            tident= sec=(null) uid=99 gid=99 name=- geo="" starting-index=359094629 ending-index=359094629 msg="waiting until queue item 359094628 has been acknowledged.."
230201 09:17:54 time=1675243074.224736 func=synchronize              level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=MetadataFlusher:170            tident= sec=(null) uid=99 gid=99 name=- geo="" starting-index=359094629 ending-index=359094629 msg="queue item 359094628 has been acknowledged"
[QCLIENT - INFO - getNext:57] Received redirection to localhost:7003
230201 09:17:54 time=1675243074.261859 func=configure                level=INFO  logid=static.............................. unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=FileSystemView:58              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="FileSystemView loadFromBackend" duration=0s
230201 09:17:54 17084 XrootdConfig: Unable to create file system object via libXrdEosMgm.so
230201 09:17:54 17084 XrootdConfig: Unable to load file system.
------ xrootd protocol initialization failed.
230201 09:17:54 time=1675243074.278885 func=BootNamespace            level=NOTE  logid=4ec3ff0a-a211-11ed-96ab-00259074c8e8 unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=QdbMaster:151                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="container initialization failed" duration=0s, errc=17, reason="SafetyCheck FATAL: Risk of data loss, found container (41564) with id bigger than max container id (41563)"
230201 09:17:54 17084 XrdProtocol: Protocol xrootd could not be loaded
230201 09:17:54 time=1675243074.278923 func=Configure                level=CRIT  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eos-mgm1.alice-af.wigner.hu:1094 tid=00007f6035b26780 source=XrdMgmOfsConfigure:1700        tident=<single-exec> sec=      uid=0 gid=0 name= geo="" msg="namespace boot failed"
------ xrootd mgm@eos-mgm1.alice-af.wigner.hu:-1 initialization failed.

I’m not really sure what containers does it refer. The quarkdb seems healthy.

Thanks,
Gabor

230201 09:17:53 17084 secgsi_GetSrvCertEnt: problems loading srv cert: invalid
230201 09:17:53 17084 secgsi_Init: problems loading srv cert

Something is wrong with your certificate …

Yeah, meanwhile I realized it too - the file permissions were not ok. Anyhow, now that srv cert message is gone, but unfortunately the problem still exists.

My apologies, it seems that actually this was really the source of the problem: after a cold restart everything works as they supposed to. Thanks!