Repair quarkdb inconsistency

rptaylor · June 15, 2023, 10:38pm

Hello,

I just upgraded to v5.1.19. One of the QDBs is continually crashing, maybe it did not get shut down cleanly.

[1686866077719] Reading configuration file from /etc/xrd.cf.quarkdb
[1686866077720] INFO: Openning state machine '/var/quarkdb/node-2/current/state-machine'.
[1686866077795] INFO: Opening raft journal '/var/quarkdb/node-2/current/raft-journal'
------ quarkdb protocol plugin initialization completed.
------ xrootd qdb@eos-qdb-2.eos-qdb.eos.svc.kermes-dev.local:7777 initialization completed.
[1686866079308] EVENT: eos-qdb-2.eos-qdb.eos.svc.kermes-dev.local:7777: TIMEOUT after 1446ms, I am not receiving heartbeats. Attempting to start election.
[1686866079308] INFO: Starting pre-vote round for term 37
[1686866079310] INFO: Pre-vote requests have been sent off, will allow a window of 1000ms to receive replies.
[1686866079322] INFO: Pre-vote round unsuccessful for term 37. Contacted 2 nodes, received 2 replies with a tally of 0 positive votes, 0 refused votes, and 2 vetoes.
[1686866079324] INFO: Pre-vote round for term 37 resulted in a veto. This means, the next leader of this cluster cannot be me. Stopping election attempts until I receive a heartbeat.
230615 21:54:50 010 XrdProtocol: anon.0:48@10-5-7-71.kube-prometheus-kubelet.kube-system.svc.kermes-dev.local terminated handshake not received
[1686866090865] INFO: New link from localhost [607404b8-3eaa-483f-9716-8779f2e1bef8]
[1686866090866] INFO: Shutting down link from localhost [607404b8-3eaa-483f-9716-8779f2e1bef8]
[1686866095698] INFO: New link from eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local [127883fc-c1a7-4b04-98ad-36b21d0190e8]
[1686866095700] INFO: Connection with UUID 127883fc-c1a7-4b04-98ad-36b21d0190e8 identifying as 'internal-heartbeat-sender'
[1686866095719] INFO: New link from eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local [3d0c2641-4440-422a-97d3-a5cf4c505d11]
[1686866095720] INFO: Connection with UUID 3d0c2641-4440-422a-97d3-a5cf4c505d11 identifying as 'internal-replicator'
[1686866095943] EVENT: Recognizing leader eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local:7777 for term 36
[1686866095968] WARNING: Detected inconsistency for entry #3021686. Contents of my journal: term: 20 -> ['TIMESTAMPED_LEASE_ACQUIRE' 'master_lease' 'eos-mgm-0.eos-mgm.eos.svc.kermes-dev.local:1094' '10000' 'G�C']. Contents of what the leader sent: term: 21 -> ['JOURNAL_LEADERSHIP_MARKER' '21' 'eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local:7777']
terminate called after throwing an instance of 'quarkdb::FatalException'
  what():  detected inconsistent entries for index 3021686.  Leader attempted to overwrite a committed entry with one with different contents. ----- Stack trace (most recent call last) in thread 11:
#13   Object ", at 0xffffffffffffffff, in 
#12   Object ", at 0x7f01c5bdb96c, in 
#11   Object ", at 0x7f01c5eb2ea4, in 
#10   Object ", at 0x7f01c6d4e206, in 
#9    Object ", at 0x7f01c6dbe338, in 
#8    Object ", at 0x7f01c6dbe216, in 
#7    Object ", at 0x7f01c6dbaeac, in 
#6    Object ", at 0x7f01c15543a3, in 
#5    Object ", at 0x7f01c156bb41, in 
#4    Object ", at 0x7f01c158d14f, in 
#3    Object ", at 0x7f01c15945df, in 
#2    Object ", at 0x7f01c15ff984, in 
#1    Object ", at 0x7f01c15fcd6d, in 
#0    Object ", at 0x7f01c1558428, in 

Stack trace (most recent call last) in thread 11:
#18   Object ", at 0xffffffffffffffff, in 
#17   Object ", at 0x7f01c5bdb96c, in 
#16   Object ", at 0x7f01c5eb2ea4, in 
#15   Object ", at 0x7f01c6d4e206, in 
#14   Object ", at 0x7f01c6dbe338, in 
#13   Object ", at 0x7f01c6dbe216, in 
#12   Object ", at 0x7f01c6dbaeac, in 
#11   Object ", at 0x7f01c15543a3, in 
#10   Object ", at 0x7f01c156bb41, in 
#9    Object ", at 0x7f01c158d14f, in 
#8    Object ", at 0x7f01c15945df, in 
#7    Object ", at 0x7f01c15ff984, in 
#6    Object ", at 0x7f01c15fcdde, in 
#5    Object ", at 0x7f01c663dc52, in 
#4    Object ", at 0x7f01c663da32, in 
#3    Object ", at 0x7f01c663da05, in 
#2    Object ", at 0x7f01c663fa94, in 
#1    Object ", at 0x7f01c5b14a77, in 
#0    Object ", at 0x7f01c5b13387, in 
Aborted (Signal sent by tkill() 1 0)
command terminated with exit code 137

The other 2 members are up and running so presumably their version of the data can be considered correct. How can I repair the inconsistent data?
I tried to find documentation about how to view and delete keys but all I found was
https://quarkdb.web.cern.ch/quarkdb/docs/master/ref/hash

Anything that I try to get or hgetall doesn’t seem to exist and I can’t find a way to see all keys.
If I had such a command I’m also not sure how I would apply it because the inconsistent member crashes immediately. Can I tell the leader to overwrite the inconsistent data in the corrupted member?

Thanks.

esindril · June 16, 2023, 8:14am

Hi Ryan,

Can you give us a bit more details about the upgrade? From what version did you upgrade?
Are you using the eos-quarkdb packages for deploying QuarkDB? Are all your QDB instances running the same version? How exactly did you go about doing the upgrade?

At this point I would suggest to start a fresh re-silvering process leaving the current corrupted QDB instance completely aside? You can find instructions on the re-silvering here:
https://quarkdb.web.cern.ch/quarkdb/docs/master/resilvering/

Cheers,
Elvin

rptaylor · June 16, 2023, 5:45pm

Hi Elvin,

Previously it was 5.0.31. It is on kubernetes. All the pods run the same EOS container image, except during an upgrade when one pod is replaced at a time.

I can delete the volume of the corrupted pod, forcing it to start over as an empty member. It would be nice if there was a way to force a repair on it though.

Thanks.

CERN Accelerating science

Repair quarkdb inconsistency