QuarkDB backup: missing quarkdb-checkpoint cmd

Dear all,

We’re running an EOS instance with 3 MGMs and ns in quarkdb.
According to quarkdb docs, https://quarkdb.web.cern.ch/quarkdb/docs/master/backup/

We should use the command ‘quarkdb-checkpoint’ to create a checkpoint/snapshot that can then be synced. That tool does not seem to be part of the rpm package.
Installed Packages
Name : quarkdb
Arch : x86_64
Version : 0.4.2
Release : 1.el7.cern

However, /bin/quarkdb-validate-checkpoint is there. What am I missing?

Best,
Erich

Hi, we currently run the command to checkpoint from the redis-cli, eg. redis-cli -p ${QDB_PORT} raft-checkpoint ${BACKUP_PATH}

From memory it needs to also be run on the QDB master node.

1 Like

Thanks, I’ll give this a try!

Crystal is correct, raft-checkpoint will work. quarkdb-checkpoint is a command aliased to raft-checkpoint, they do exactly the same thing. It’s a redis command, not a tool – I’m adding a clarification in the docs :slight_smile:

quarkdb-validate-checkpoint is indeed a tool.

Cheers,
Georgios

Thanks, @crystal and @gbitzes I just ran
redis-cli -p 9999 raft-checkpoint /srv/metadata/first_backup
it works like a charm :slight_smile:
Best,
Erich

maybe for useful for others, we’re using this as backup now:
we run this on all quarkdb nodes in the cluster, but the script will only execute the backup on the leader.

roughly it does:

  • create checkpoint
  • calculate checkpoint size
  • validate checkpoint - and bail if validation fails
  • rotate away previous backup run (the target is a snapshotted NFS mount)
  • rsync current backup
  • and write some prometheus metrics

for sure it’s not perfect, but we’ll start with that.
comments welcome.


RAFT_STATUS=$(redis-cli -p 9999 raft-info | grep 'STATUS')
BACKUP_PATH=/mnt/meta_backup/eos_ns_metadata

START_TIME=$(date -Iseconds)
echo "##################################################"
echo "now is ${START_TIME}"
echo "raft status: $RAFT_STATUS"

PROM_METRIC_TIME="# HELP eos_backup_metadata_timestamp time of last successful eos NS metadata backup
# TYPE eos_backup_metadata_timestamp counter
eos_backup_metadata_timestamp"

PROM_METRIC_BYTES="# HELP eos_backup_metadata_bytes size of eos NS metadata backup
# TYPE eos_backup_metadata_bytes gauge
eos_backup_metadata_bytes"

if [[ "$RAFT_STATUS" == "STATUS LEADER" ]] ; then
  echo "starting backup"
  STAMP=$(date +%s)
  CHECKPOINT_PATH=/srv/metadata/backup_${STAMP}
  echo "writing snapshot..."
  redis-cli -p 9999 raft-checkpoint ${CHECKPOINT_PATH}
  echo "showing checkpoint size"
  du -sch ${CHECKPOINT_PATH}
  BACKUP_BYTES=$(du -s --bytes  ${CHECKPOINT_PATH} | awk '{print $1;}')
  echo "validating checkpoint"
  quarkdb-validate-checkpoint --path ${CHECKPOINT_PATH} --eos 2>&1
  VALIDATE_STATUS=$?
  if [[ "$VALIDATE_STATUS" -ne "0" ]]; then
    echo "=== SNAPSHOT VALIDATION FAILED, BAILING ==="
    exit 1
  fi
  echo "rotate old backup"
  rm -rf ${BACKUP_PATH}.old
  mv ${BACKUP_PATH} ${BACKUP_PATH}.old
  echo "rsync snapshot"
  rsync -at ${CHECKPOINT_PATH}/ ${BACKUP_PATH}
  echo "cleaning up checkpoint"
  rm -rf ${CHECKPOINT_PATH}
  echo "update prometheus metric"
  echo "${PROM_METRIC_TIME} $(date +%s)" > /opt/prometheus_data/eos_backup.prom
  echo "${PROM_METRIC_BYTES} ${BACKUP_BYTES}" >> /opt/prometheus_data/eos_backup.prom
  FINISH_TIME=$(date -Iseconds)
  echo "completion time: ${FINISH_TIME}"
  echo "backup complete, done."
  exit 0
else
  echo "not leader, done."
  exit 0
fi

exit 0
2 Likes

Based on my testing, the raft-checkpoint can run on any member of the cluster. However if you run it on a follower, if it is catching up to the leader, I suppose you will get a RAFT journal checkpoint that is valid as of some time in the recent past? (Which anyway becomes true for any checkpoint as soon as you finish writing it to disk, so doesn’t seem like an issue to me).

I can see how it might nevertheless be preferable to run it on the leader though.