Hello,
I just upgraded to v5.1.19. One of the QDBs is continually crashing, maybe it did not get shut down cleanly.
[1686866077719] Reading configuration file from /etc/xrd.cf.quarkdb
[1686866077720] INFO: Openning state machine '/var/quarkdb/node-2/current/state-machine'.
[1686866077795] INFO: Opening raft journal '/var/quarkdb/node-2/current/raft-journal'
------ quarkdb protocol plugin initialization completed.
------ xrootd qdb@eos-qdb-2.eos-qdb.eos.svc.kermes-dev.local:7777 initialization completed.
[1686866079308] EVENT: eos-qdb-2.eos-qdb.eos.svc.kermes-dev.local:7777: TIMEOUT after 1446ms, I am not receiving heartbeats. Attempting to start election.
[1686866079308] INFO: Starting pre-vote round for term 37
[1686866079310] INFO: Pre-vote requests have been sent off, will allow a window of 1000ms to receive replies.
[1686866079322] INFO: Pre-vote round unsuccessful for term 37. Contacted 2 nodes, received 2 replies with a tally of 0 positive votes, 0 refused votes, and 2 vetoes.
[1686866079324] INFO: Pre-vote round for term 37 resulted in a veto. This means, the next leader of this cluster cannot be me. Stopping election attempts until I receive a heartbeat.
230615 21:54:50 010 XrdProtocol: anon.0:48@10-5-7-71.kube-prometheus-kubelet.kube-system.svc.kermes-dev.local terminated handshake not received
[1686866090865] INFO: New link from localhost [607404b8-3eaa-483f-9716-8779f2e1bef8]
[1686866090866] INFO: Shutting down link from localhost [607404b8-3eaa-483f-9716-8779f2e1bef8]
[1686866095698] INFO: New link from eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local [127883fc-c1a7-4b04-98ad-36b21d0190e8]
[1686866095700] INFO: Connection with UUID 127883fc-c1a7-4b04-98ad-36b21d0190e8 identifying as 'internal-heartbeat-sender'
[1686866095719] INFO: New link from eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local [3d0c2641-4440-422a-97d3-a5cf4c505d11]
[1686866095720] INFO: Connection with UUID 3d0c2641-4440-422a-97d3-a5cf4c505d11 identifying as 'internal-replicator'
[1686866095943] EVENT: Recognizing leader eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local:7777 for term 36
[1686866095968] WARNING: Detected inconsistency for entry #3021686. Contents of my journal: term: 20 -> ['TIMESTAMPED_LEASE_ACQUIRE' 'master_lease' 'eos-mgm-0.eos-mgm.eos.svc.kermes-dev.local:1094' '10000' 'G�C']. Contents of what the leader sent: term: 21 -> ['JOURNAL_LEADERSHIP_MARKER' '21' 'eos-qdb-1.eos-qdb.eos.svc.kermes-dev.local:7777']
terminate called after throwing an instance of 'quarkdb::FatalException'
what(): detected inconsistent entries for index 3021686. Leader attempted to overwrite a committed entry with one with different contents. ----- Stack trace (most recent call last) in thread 11:
#13 Object ", at 0xffffffffffffffff, in
#12 Object ", at 0x7f01c5bdb96c, in
#11 Object ", at 0x7f01c5eb2ea4, in
#10 Object ", at 0x7f01c6d4e206, in
#9 Object ", at 0x7f01c6dbe338, in
#8 Object ", at 0x7f01c6dbe216, in
#7 Object ", at 0x7f01c6dbaeac, in
#6 Object ", at 0x7f01c15543a3, in
#5 Object ", at 0x7f01c156bb41, in
#4 Object ", at 0x7f01c158d14f, in
#3 Object ", at 0x7f01c15945df, in
#2 Object ", at 0x7f01c15ff984, in
#1 Object ", at 0x7f01c15fcd6d, in
#0 Object ", at 0x7f01c1558428, in
Stack trace (most recent call last) in thread 11:
#18 Object ", at 0xffffffffffffffff, in
#17 Object ", at 0x7f01c5bdb96c, in
#16 Object ", at 0x7f01c5eb2ea4, in
#15 Object ", at 0x7f01c6d4e206, in
#14 Object ", at 0x7f01c6dbe338, in
#13 Object ", at 0x7f01c6dbe216, in
#12 Object ", at 0x7f01c6dbaeac, in
#11 Object ", at 0x7f01c15543a3, in
#10 Object ", at 0x7f01c156bb41, in
#9 Object ", at 0x7f01c158d14f, in
#8 Object ", at 0x7f01c15945df, in
#7 Object ", at 0x7f01c15ff984, in
#6 Object ", at 0x7f01c15fcdde, in
#5 Object ", at 0x7f01c663dc52, in
#4 Object ", at 0x7f01c663da32, in
#3 Object ", at 0x7f01c663da05, in
#2 Object ", at 0x7f01c663fa94, in
#1 Object ", at 0x7f01c5b14a77, in
#0 Object ", at 0x7f01c5b13387, in
Aborted (Signal sent by tkill() 1 0)
command terminated with exit code 137
The other 2 members are up and running so presumably their version of the data can be considered correct. How can I repair the inconsistent data?
I tried to find documentation about how to view and delete keys but all I found was
https://quarkdb.web.cern.ch/quarkdb/docs/master/ref/hash
Anything that I try to get or hgetall doesn’t seem to exist and I can’t find a way to see all keys.
If I had such a command I’m also not sure how I would apply it because the inconsistent member crashes immediately. Can I tell the leader to overwrite the inconsistent data in the corrupted member?
Thanks.