QuarkDB recovery from checkpoint

georgep · August 8, 2022, 9:54am

Hello,

I am trying to understand the procedure of restoring QuarkDB from a checkpoint as per
https://quarkdb.web.cern.ch/quarkdb/docs/master/backup/

We create daily qdb checkpoints on all three qdb nodes and we archive them on tape.

Running quarkdb-recovery on the checkpoint from a particular qdb node, means we restore this qdb node? Is this what the “How to restore” section in the above link shows? Not sure I understand the command’s syntax though: what are the --path and --command flags for?

In an extreme distaster scenario that includes a total EOS namespace loss, what do we need to do to recreate all thre qdb nodes?

Many thanks,

George

esindril · August 10, 2022, 7:25am

Hi George,

I think there might be some misunderstanding to what the QuarkDB cluster is doing and how it works. When you have a cluster with raft enabled all three(or whatever number of) nodes are holding the same information, therefore, I does not make sense to backup each of them, if you have a backup from the current leader that is already good enough.

As highlighted in the documentation link that you pointed to, restoring works by creating an entire new cluster from a checkpoint. Therefore, the quarkdb-recovery command just helps you in doing this and also to update the cluster members which most likely will have different hostnames. It can also happen that you restore a checkpoint on the exact same machines, in this case you just need to make sure you start from a clean setup and you don’t need to update the hostnames of the machines/clusterID in the cluster. To restore the cluster you recover each of the nodes from the same checkpoint and you start the cluster.

Hope this helps,
Elvin

georgep · August 19, 2022, 2:43pm

Hi Elvin,

Many thanks for clarifying and apologies for the confusion. OK , we can use the checkpoint created on the leader to re-create the info (SSTs) on all three quarkdb nodes: we just need to make sure to delete the original contents in /var/lib/quarkdb/ and then copy to this dir the info extracted from the checkpoint. Is this correct?

George

esindril · August 22, 2022, 6:43am

Hi George,

Yes, exactly.

Cheers,
Elvin

CERN Accelerating science

QuarkDB recovery from checkpoint