We should use the command ‘quarkdb-checkpoint’ to create a checkpoint/snapshot that can then be synced. That tool does not seem to be part of the rpm package.
Installed Packages
Name : quarkdb
Arch : x86_64
Version : 0.4.2
Release : 1.el7.cern
However, /bin/quarkdb-validate-checkpoint is there. What am I missing?
Crystal is correct, raft-checkpoint will work. quarkdb-checkpoint is a command aliased to raft-checkpoint, they do exactly the same thing. It’s a redis command, not a tool – I’m adding a clarification in the docs
maybe for useful for others, we’re using this as backup now:
we run this on all quarkdb nodes in the cluster, but the script will only execute the backup on the leader.
roughly it does:
create checkpoint
calculate checkpoint size
validate checkpoint - and bail if validation fails
rotate away previous backup run (the target is a snapshotted NFS mount)
rsync current backup
and write some prometheus metrics
for sure it’s not perfect, but we’ll start with that.
comments welcome.
RAFT_STATUS=$(redis-cli -p 9999 raft-info | grep 'STATUS')
BACKUP_PATH=/mnt/meta_backup/eos_ns_metadata
START_TIME=$(date -Iseconds)
echo "##################################################"
echo "now is ${START_TIME}"
echo "raft status: $RAFT_STATUS"
PROM_METRIC_TIME="# HELP eos_backup_metadata_timestamp time of last successful eos NS metadata backup
# TYPE eos_backup_metadata_timestamp counter
eos_backup_metadata_timestamp"
PROM_METRIC_BYTES="# HELP eos_backup_metadata_bytes size of eos NS metadata backup
# TYPE eos_backup_metadata_bytes gauge
eos_backup_metadata_bytes"
if [[ "$RAFT_STATUS" == "STATUS LEADER" ]] ; then
echo "starting backup"
STAMP=$(date +%s)
CHECKPOINT_PATH=/srv/metadata/backup_${STAMP}
echo "writing snapshot..."
redis-cli -p 9999 raft-checkpoint ${CHECKPOINT_PATH}
echo "showing checkpoint size"
du -sch ${CHECKPOINT_PATH}
BACKUP_BYTES=$(du -s --bytes ${CHECKPOINT_PATH} | awk '{print $1;}')
echo "validating checkpoint"
quarkdb-validate-checkpoint --path ${CHECKPOINT_PATH} --eos 2>&1
VALIDATE_STATUS=$?
if [[ "$VALIDATE_STATUS" -ne "0" ]]; then
echo "=== SNAPSHOT VALIDATION FAILED, BAILING ==="
exit 1
fi
echo "rotate old backup"
rm -rf ${BACKUP_PATH}.old
mv ${BACKUP_PATH} ${BACKUP_PATH}.old
echo "rsync snapshot"
rsync -at ${CHECKPOINT_PATH}/ ${BACKUP_PATH}
echo "cleaning up checkpoint"
rm -rf ${CHECKPOINT_PATH}
echo "update prometheus metric"
echo "${PROM_METRIC_TIME} $(date +%s)" > /opt/prometheus_data/eos_backup.prom
echo "${PROM_METRIC_BYTES} ${BACKUP_BYTES}" >> /opt/prometheus_data/eos_backup.prom
FINISH_TIME=$(date -Iseconds)
echo "completion time: ${FINISH_TIME}"
echo "backup complete, done."
exit 0
else
echo "not leader, done."
exit 0
fi
exit 0
Based on my testing, the raft-checkpoint can run on any member of the cluster. However if you run it on a follower, if it is catching up to the leader, I suppose you will get a RAFT journal checkpoint that is valid as of some time in the recent past? (Which anyway becomes true for any checkpoint as soon as you finish writing it to disk, so doesn’t seem like an issue to me).
I can see how it might nevertheless be preferable to run it on the leader though.