QuarkDB membership problem

Hi ,

I was trying to add one serverhost in as cluster member using the steps shared in the following link.
https://quarkdb.web.cern.ch/quarkdb/docs/master/membership/

When I added the server as observer using “raft-add-observer”.
A directory called /var/lib/quarkdb/node-1/resilvering-arena is getting created rather than /var/lib/quarkdb/node-1/temp-snapshots . Observed following error message in the Leader logs( also seen in the Observer Logs).

----- The above stacktrace does NOT signify a crash! It’s used to show the location of a serious error.
[1672911477810] INFO: Creating directory: /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954
[1672911477811] CRITICAL: cannot create state machine checkpoint in /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954/state-machine: IO error: while link file to /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954/state-machine.tmp/011735.sst: /var/lib/quarkdb/node-1/current/state-machine/011735.sst: Operation not permitted ----- Stack trace (most recent call last) in thread 52409:
#6 Object ", at 0xffffffffffffffff, in
#5 Object ", at 0x7f3cd7e5d98c, in
#4 Object ", at 0x7f3cd8b5cea4, in
#3 Object ", at 0x7f3cd7093bfe, in
#2 Object ", at 0x7f3cd6ca86a4, in
#1 Object ", at 0x7f3cd6d14e39, in
#0 Object ", at 0x7f3cd6cd4dd8, in
----- The above stacktrace does NOT signify a crash! It’s used to show the location of a serious error.
[1672911477811] CRITICAL: Attempt to resilver antares-eos99:9999 has failed: Could not create snapshot: ----- Stack trace (most recent call last) in thread 52409:

The following output is from raft-info command from the observer and Leader node for more clarity.

OBSERVER NODE:
[root@antares-eos99 resilvering-arena]# redis-cli -p 9999 raft-info

  1. TERM 2535
  2. LOG-START 0
  3. LOG-SIZE 1
  4. LEADER antares-eos98:9999
  5. CLUSTER-ID 0123456789
  6. COMMIT-INDEX 0
  7. LAST-APPLIED 0
  8. BLOCKED-WRITES 0
  9. LAST-STATE-CHANGE 479 (7 minutes, 59 seconds)

  10. MYSELF antares-eos99:9999
  11. VERSION 0.4.2
  12. STATUS FOLLOWER
  13. NODE-HEALTH GREEN
  14. JOURNAL-FSYNC-POLICY sync-important-updates

  15. MEMBERSHIP-EPOCH 0
  16. NODES #!^NULL-HOSTNAME^!#:0
  17. OBSERVERS
  18. QUORUM-SIZE 1

LEADER NODE:

  1. TERM 2535
  2. LOG-START 22000000
  3. LOG-SIZE 72424684
  4. LEADER antares-eos98:9999
  5. CLUSTER-ID 0123456789
  6. COMMIT-INDEX 72424683
  7. LAST-APPLIED 72424683
  8. BLOCKED-WRITES 0
  9. LAST-STATE-CHANGE 87406 (1 days, 16 minutes, 46 seconds)

  10. MYSELF antares-eos98:9999
  11. VERSION 0.4.2
  12. STATUS LEADER
  13. NODE-HEALTH GREEN
  14. JOURNAL-FSYNC-POLICY sync-important-updates

  15. MEMBERSHIP-EPOCH 72424678
  16. NODES antares-eos14:9999,antares-eos98:9999,antares-eos99:9999
  17. OBSERVERS
  18. QUORUM-SIZE 2

  19. REPLICA antares-eos14:9999 | ONLINE | UP-TO-DATE | NEXT-INDEX 72424684 | VERSION 0.4.2
  20. REPLICA antares-eos99:9999 | ONLINE | UP-TO-DATE | NEXT-INDEX 72424684 | VERSION 0.4.2

Kindly share your inputs so that this server host can get synced with Leader host.

Thank you!

Thanks and Regards,
Maha

Hi Maha,

It seems that the user under which your quarkdb service runs is not allowed to create directories under the following path: /var/lib/quarkdb/node-1/
Could you double check that the persmissons are correct for the given user and then give it a retry?

Thanks,
Elvin

Hi Elvin,

User level permission and ownership have set properly. The working server is having similar permission too. I am sharing the output of the folder for your reference.

Problematic Server:
[root@antares-eos99 node-1]# ls -ltr
total 368
-rw-r–r-- 1 daemon daemon 7 Jan 6 09:56 SHARD-ID
-rw-r–r-- 1 daemon daemon 21 Jan 6 09:56 RESILVERING-HISTORY
drwxr-xr-x 4 daemon daemon 47 Jan 6 09:58 current
drwxr-xr-x 5898 daemon daemon 290816 Jan 6 10:50 resilvering-arena

[root@antares-eos99 node-1]# ls -ld /var/lib/quarkdb/node-1/
drwxr-xr-x 4 daemon daemon 89 Jan 6 10:01 /var/lib/quarkdb/node-1/
[root@antares-eos99 node-1]#

Working Server:
[root@antares-eos14 node-1]# ls -ltr
total 28
-rw-r–r-- 1 daemon daemon 7 Feb 23 2022 SHARD-ID
-rw-r–r-- 1 daemon daemon 21 Feb 23 2022 RESILVERING-HISTORY
drwxr-xr-x 4 daemon daemon 47 Feb 23 2022 current
drwxr-xr-x 300 daemon daemon 16384 Jan 3 10:23 temp-snapshots

[root@antares-eos14 node-1]# ls -ld /var/lib/quarkdb/node-1/
drwxr-xr-x 4 daemon daemon 86 Jan 3 10:18 /var/lib/quarkdb/node-1/
[root@antares-eos14 node-1]#

The changes which we observed is the highlighted folder “resilvering-arena” which is observed in Problematic Server:antares-eos99.

Thanks and Regards,
Maha

Hi Maha,

Thanks for the info. Unfortunately, I can’t see immediately why you have the "Operation not permitted error. Could you paste the configuration of your QuarkDB XRootD daemons from /etc/xrootd/xrootd-quarkdb.cfg or similar?

Also please do: ps aux | grep xrotod on that machine.

Can you explain in a few words the initial state? You where running in raft mode with a single QDB and now you want to add a couple more for redundancy? Where you running in standalone mode at any point in time in the past?

Thanks,
Elvin

Hi Elvin,

 Please find the requested configuration file and the xrootd process 

PROBLEMATIC HOST:antares-eos99

CONFIGURATION FILE:
[root@antares-eos99 ~]# cat /etc/xrd.cf.quarkdb
xrd.port 9999
xrd.protocol redis:9999 libXrdQuarkDB.so

xrd.network keepalive

redis.mode raft
redis.database /var/lib/quarkdb/node-1

#redis.myself localhost:9999
redis.myself antares-eos99:9999

redis.password_file /etc/eos.keytab

PROCESS:

[root@antares-eos99 ~]# ps aux | grep xrootd
root 32845 0.0 0.0 112816 2220 pts/0 S+ 14:34 0:00 grep --color=auto xrootd
daemon 33578 1.5 0.0 6702812 67876 ? SLsl Jan09 48:13 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
daemon 33602 0.0 0.0 437296 12848 ? S Jan09 0:37 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon

NOTE: WE HAVE STOPPED QUARKDB PROCESS in PROBLEMATIC HOST

Regarding Question related to inital state and other questions I will reply shortly.

Thanks and Regards,
Maha

Hi Elvin,

Our initial aim was to add a newly re-installed node to a 2 node quarkDB cluster by following the procedure outlined in Membership updates - QuarkDB Documentation
Because we didn’t want to try this procedure for the first time in production, we tried to “emulate” it first on our dev instance with a working 3-node cluster: we did this by removing a node (antares-eos99) in the cluster, deleting completely the /var/lib/quarkdb/node-1/ dir and then try to add it back into cluster

In particular we followed the below steps:

  1. On Observer antares-eos99: We stopped Quarkdb service then
  2. On leader : “redis-cli -p 9999 raft-remove-member antares-eos99:9999”
  3. On Observer antares-eos99: rm -rf /var/lib/quarkdb/node-1
  4. On Observer antares-eos99: quarkdb-create --path /var/lib/quarkdb/node-1 --clusterID 0123456789
  5. On Observer antares-eos99 : chown -R daemon:daemon /var/lib/quarkdb/node-1
  6. On observer : Started “systemctl start eos@quarkdb service”
  7. On Leader : redis-cli -p 9999 raft-add-observer antares-eos99:9999
  8. On Leader : redis-cli -p 9999 raft-promote-observer antares-eos99:9999

Please let us know if you need more details.

Thanks and Regards,
Maha

Hi Maha,

Given that you are running QuarkDB 0.4.2 (which is quite old), I would suggest to update to eos-quarkdb-5.1.8 and then retry the procedure. There were several bugs fixed in QuarkDB related to this operation in the meantime and the hope is this will also cover your issues.

You can grab the rpms from this location (note this requires XRootD 5):
https://storage-ci.web.cern.ch/storage-ci/eos/diopside/tag/testing/el-7/x86_64/

Let me know how it goes.

Cheers,
Elvin

Hi Elvin,

Our production environment has EOS version 4 (4.8.88) and CTA version 4 (4.7.12) . To upgrade to eos-quarkdb-5.1.8 we have to upgrade other EOS package to version 5 too. We will consider the solution during the upgrade activity. Thanks.
We completely agree that eos-quarkdb Version 5.1.8 has several bugs fix . To plan the upgrade activity it might take time meanwhile is there any other solution or fix which we are try for the issue reported with older version of quarkdb.

Thanks and Regards,
Maha

Hi Elvin,

Our production environment has EOS version 4 (4.8.88) and CTA version 4 (4.7.12) . To upgrade to eos-quarkdb-5.1.8 we have to upgrade other EOS package to version 5 too. We will consider the solution during the upgrade activity. Thanks.
We completely agree that eos-quarkdb Version 5.1.8 has several bugs fix . To plan the upgrade activity it might take time meanwhile is there any other solution or fix which we are try for the issue reported with older version of quarkdb.

Thanks and Regards,
Maha

Hi Maha,

The only option is to do a “manual” re-silvering. This will still require a downtime as you need to stop you main QDB instance during the process. For this you create a new QDB instance using the same clusterID and after stopping the main QDB you copy over the following sub=trees:
/var/lib/quarkdb/current/state-machine and /var/lib/quarkdb/current/raft-journal
Once this is done you can restart your main QDB instance and add the new QDB as an observer. It now should be able to follow updates from the main instance.

Cheers,
Elvin

Hi Elvin
Thanks for the suggestion. Could you please confirm us whether we have to take a backup by checkpointing (Backup & restore - QuarkDB Documentation) then, delete the Quarkdb contents ,then create a new instance with the clusterID and then restore the data
Thanks and Regards,
Maha

Hi Maha,

Yes, you should in any case do a regular backup of the QDB instance once every couple of days and move it to some other location to have it for disaster recovery. Especially in this case, performing such procedure, a backup is recommended.

Once you have the backup you don’t need to delete anything. You just create a new QDB from scratch on a different machine (with the same cluster id), you stop (very important) the main QDB instance and copy the contents I mentioned before.

Cheers,
Elvin

Hi Elvin,

The problem actually was that there some incorrect user ownership in the RocksDBs on the remaining QuarkDB nodes which preventer them from carrying out the re-silvering (or taking a checkpoint when trying your suggestion). After correcting the RocksDBs’ ownership, the normal node addition procedure (Membership updates - QuarkDB Documentation) worked as expected.

What was confusing was that from the raft info, the node looked like it had finished re-silvering while in fact it hadn’t.

Once we understood the process from the corresponing lines in the xrdflog.quarkdb, the whole thing became a lot clearer…

Thanks again for your patience and support!

Maha,Tom, George