When I added the server as observer using “raft-add-observer”.
A directory called /var/lib/quarkdb/node-1/resilvering-arena is getting created rather than /var/lib/quarkdb/node-1/temp-snapshots . Observed following error message in the Leader logs( also seen in the Observer Logs).
----- The above stacktrace does NOT signify a crash! It’s used to show the location of a serious error.
[1672911477810] INFO: Creating directory: /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954
[1672911477811] CRITICAL: cannot create state machine checkpoint in /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954/state-machine: IO error: while link file to /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954/state-machine.tmp/011735.sst: /var/lib/quarkdb/node-1/current/state-machine/011735.sst: Operation not permitted ----- Stack trace (most recent call last) in thread 52409: #6 Object ", at 0xffffffffffffffff, in #5 Object ", at 0x7f3cd7e5d98c, in #4 Object ", at 0x7f3cd8b5cea4, in #3 Object ", at 0x7f3cd7093bfe, in #2 Object ", at 0x7f3cd6ca86a4, in #1 Object ", at 0x7f3cd6d14e39, in #0 Object ", at 0x7f3cd6cd4dd8, in
----- The above stacktrace does NOT signify a crash! It’s used to show the location of a serious error.
[1672911477811] CRITICAL: Attempt to resilver antares-eos99:9999 has failed: Could not create snapshot: ----- Stack trace (most recent call last) in thread 52409:
The following output is from raft-info command from the observer and Leader node for more clarity.
It seems that the user under which your quarkdb service runs is not allowed to create directories under the following path: /var/lib/quarkdb/node-1/
Could you double check that the persmissons are correct for the given user and then give it a retry?
User level permission and ownership have set properly. The working server is having similar permission too. I am sharing the output of the folder for your reference.
Problematic Server:
[root@antares-eos99 node-1]# ls -ltr
total 368
-rw-r–r-- 1 daemon daemon 7 Jan 6 09:56 SHARD-ID
-rw-r–r-- 1 daemon daemon 21 Jan 6 09:56 RESILVERING-HISTORY
drwxr-xr-x 4 daemon daemon 47 Jan 6 09:58 current drwxr-xr-x 5898 daemon daemon 290816 Jan 6 10:50 resilvering-arena
[root@antares-eos99 node-1]# ls -ld /var/lib/quarkdb/node-1/
drwxr-xr-x 4 daemon daemon 89 Jan 6 10:01 /var/lib/quarkdb/node-1/
[root@antares-eos99 node-1]#
Working Server:
[root@antares-eos14 node-1]# ls -ltr
total 28
-rw-r–r-- 1 daemon daemon 7 Feb 23 2022 SHARD-ID
-rw-r–r-- 1 daemon daemon 21 Feb 23 2022 RESILVERING-HISTORY
drwxr-xr-x 4 daemon daemon 47 Feb 23 2022 current drwxr-xr-x 300 daemon daemon 16384 Jan 3 10:23 temp-snapshots
[root@antares-eos14 node-1]# ls -ld /var/lib/quarkdb/node-1/
drwxr-xr-x 4 daemon daemon 86 Jan 3 10:18 /var/lib/quarkdb/node-1/
[root@antares-eos14 node-1]#
The changes which we observed is the highlighted folder “resilvering-arena” which is observed in Problematic Server:antares-eos99.
Thanks for the info. Unfortunately, I can’t see immediately why you have the "Operation not permitted error. Could you paste the configuration of your QuarkDB XRootD daemons from /etc/xrootd/xrootd-quarkdb.cfg or similar?
Also please do: ps aux | grep xrotod on that machine.
Can you explain in a few words the initial state? You where running in raft mode with a single QDB and now you want to add a couple more for redundancy? Where you running in standalone mode at any point in time in the past?
Our initial aim was to add a newly re-installed node to a 2 node quarkDB cluster by following the procedure outlined in Membership updates - QuarkDB Documentation
Because we didn’t want to try this procedure for the first time in production, we tried to “emulate” it first on our dev instance with a working 3-node cluster: we did this by removing a node (antares-eos99) in the cluster, deleting completely the /var/lib/quarkdb/node-1/ dir and then try to add it back into cluster
In particular we followed the below steps:
On Observer antares-eos99: We stopped Quarkdb service then
On leader : “redis-cli -p 9999 raft-remove-member antares-eos99:9999”
On Observer antares-eos99: rm -rf /var/lib/quarkdb/node-1
On Observer antares-eos99: quarkdb-create --path /var/lib/quarkdb/node-1 --clusterID 0123456789
On Observer antares-eos99 : chown -R daemon:daemon /var/lib/quarkdb/node-1
On observer : Started “systemctl start eos@quarkdb service”
On Leader : redis-cli -p 9999 raft-add-observer antares-eos99:9999
On Leader : redis-cli -p 9999 raft-promote-observer antares-eos99:9999
Given that you are running QuarkDB 0.4.2 (which is quite old), I would suggest to update to eos-quarkdb-5.1.8 and then retry the procedure. There were several bugs fixed in QuarkDB related to this operation in the meantime and the hope is this will also cover your issues.
Our production environment has EOS version 4 (4.8.88) and CTA version 4 (4.7.12) . To upgrade to eos-quarkdb-5.1.8 we have to upgrade other EOS package to version 5 too. We will consider the solution during the upgrade activity. Thanks.
We completely agree that eos-quarkdb Version 5.1.8 has several bugs fix . To plan the upgrade activity it might take time meanwhile is there any other solution or fix which we are try for the issue reported with older version of quarkdb.
Our production environment has EOS version 4 (4.8.88) and CTA version 4 (4.7.12) . To upgrade to eos-quarkdb-5.1.8 we have to upgrade other EOS package to version 5 too. We will consider the solution during the upgrade activity. Thanks.
We completely agree that eos-quarkdb Version 5.1.8 has several bugs fix . To plan the upgrade activity it might take time meanwhile is there any other solution or fix which we are try for the issue reported with older version of quarkdb.
The only option is to do a “manual” re-silvering. This will still require a downtime as you need to stop you main QDB instance during the process. For this you create a new QDB instance using the same clusterID and after stopping the main QDB you copy over the following sub=trees: /var/lib/quarkdb/current/state-machine and /var/lib/quarkdb/current/raft-journal
Once this is done you can restart your main QDB instance and add the new QDB as an observer. It now should be able to follow updates from the main instance.
Hi Elvin
Thanks for the suggestion. Could you please confirm us whether we have to take a backup by checkpointing (Backup & restore - QuarkDB Documentation) then, delete the Quarkdb contents ,then create a new instance with the clusterID and then restore the data
Thanks and Regards,
Maha
Yes, you should in any case do a regular backup of the QDB instance once every couple of days and move it to some other location to have it for disaster recovery. Especially in this case, performing such procedure, a backup is recommended.
Once you have the backup you don’t need to delete anything. You just create a new QDB from scratch on a different machine (with the same cluster id), you stop (very important) the main QDB instance and copy the contents I mentioned before.
The problem actually was that there some incorrect user ownership in the RocksDBs on the remaining QuarkDB nodes which preventer them from carrying out the re-silvering (or taking a checkpoint when trying your suggestion). After correcting the RocksDBs’ ownership, the normal node addition procedure (Membership updates - QuarkDB Documentation) worked as expected.
What was confusing was that from the raft info, the node looked like it had finished re-silvering while in fact it hadn’t.
Once we understood the process from the corresponing lines in the xrdflog.quarkdb, the whole thing became a lot clearer…