CERN Accelerating science

Mgm fails to boot quark ns

SE being redeployed, added the first FSTs without issue. Added next fst and upon reloading mgm it fails with the following.

Restart quark on the three member nodes, new quark master delegated, and quark seems content, though eos@mgm still fails to boot namespace

Saw the ending comment here EOS mgm with qdb crashes

I’d like to understand how the inconsistent state occured and how to resolve should it occur in production.

Unsure how to proceed - any suggestions?

200130 20:27:58 time=1580412478.129895 func=BootNamespace            level=CRIT  logid=9ee9a660-4396-11ea-b382-0060dd4265f8 unit=mgm@ornl-eos-01.ornl.gov:1094 tid=00007f2e87ce9740 source=Master:1890
ident=<service> sec=      uid=0 gid=0 name= geo="" initialization returned ec=17 SafetyCheck FATAL: Risk of data loss, found container with id bigger than max container id

raft_info

[root@ornl-eos-01 qdb1]#  redis-cli -p 7001 raft_info     
 1) TERM 106601
 2) LOG-START 0
 3) LOG-SIZE 214196
 4) LEADER ornl-eos-02.ornl.gov:7001
 5) CLUSTER-ID 0123456789
 6) COMMIT-INDEX 214195
 7) LAST-APPLIED 214195
 8) BLOCKED-WRITES 0
 9) LAST-STATE-CHANGE 2 (2 seconds)
10) ----------
11) MYSELF ornl-eos-01.ornl.gov:7001
12) VERSION 0.4.1
13) STATUS FOLLOWER
14) NODE-HEALTH GREEN
15) JOURNAL-FSYNC-POLICY sync-important-updates
16) ----------
17) MEMBERSHIP-EPOCH 95141
18) NODES ornl-eos-01.ornl.gov:7001,ornl-eos-02.ornl.gov:7001,warp-ornl-cern-05.ornl.gov:7001
19) OBSERVERS 
20) QUORUM-SIZE 2

Hi Pete,

Could you describe the process you followed during the NS migration? Did you use QuarkDB in bulkload mode? If so, how did you convert from bulkload to raft?

If you copied the state machine data for the bulkload -> raft transition, did you make sure to shut down the bulkload instance first?

This looks like an issue which might occur if one copies the files of a live, running QuarkDB instance in bulkload.

Cheers,
Georgios

Hi Georgios,

There was no NS migration, the SE is being newly deployed, with new quarkbd from the start.

Quark was working okay, the ns booting, and we’d copied some test files in, things seemed stable. We were just adding some fsids, then restarted the MGM and got the error.

eos ns cfg is file based, not in quark, so I don’t think adding the fsids is related. We rolled the eos ns config back, but no luck.

Pete

Interesting, which MGM version were you running?

Cheers,
Georgios

eos-server-4.5.9-1.el7.cern.x86_64
eos-xrootd-4.11.1-1.el7.cern.x86_64
eos-rocksdb-5.7.3-1.el7.cern.x86_64
eos-folly-2017.09.18.00-4.el7.cern.x86_64