When I added the server as observer using “raft-add-observer”.
A directory called /var/lib/quarkdb/node-1/resilvering-arena is getting created rather than /var/lib/quarkdb/node-1/temp-snapshots . Observed following error message in the Leader logs( also seen in the Observer Logs).
----- The above stacktrace does NOT signify a crash! It’s used to show the location of a serious error.
[1672911477810] INFO: Creating directory: /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954
[1672911477811] CRITICAL: cannot create state machine checkpoint in /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954/state-machine: IO error: while link file to /var/lib/quarkdb/node-1/temp-snapshots/d02e3c40-5580-46f6-a8da-5c013e695954/state-machine.tmp/011735.sst: /var/lib/quarkdb/node-1/current/state-machine/011735.sst: Operation not permitted ----- Stack trace (most recent call last) in thread 52409: #6 Object ", at 0xffffffffffffffff, in #5 Object ", at 0x7f3cd7e5d98c, in #4 Object ", at 0x7f3cd8b5cea4, in #3 Object ", at 0x7f3cd7093bfe, in #2 Object ", at 0x7f3cd6ca86a4, in #1 Object ", at 0x7f3cd6d14e39, in #0 Object ", at 0x7f3cd6cd4dd8, in
----- The above stacktrace does NOT signify a crash! It’s used to show the location of a serious error.
[1672911477811] CRITICAL: Attempt to resilver antares-eos99:9999 has failed: Could not create snapshot: ----- Stack trace (most recent call last) in thread 52409:
The following output is from raft-info command from the observer and Leader node for more clarity.
It seems that the user under which your quarkdb service runs is not allowed to create directories under the following path: /var/lib/quarkdb/node-1/
Could you double check that the persmissons are correct for the given user and then give it a retry?
User level permission and ownership have set properly. The working server is having similar permission too. I am sharing the output of the folder for your reference.
Problematic Server:
[root@antares-eos99 node-1]# ls -ltr
total 368
-rw-r–r-- 1 daemon daemon 7 Jan 6 09:56 SHARD-ID
-rw-r–r-- 1 daemon daemon 21 Jan 6 09:56 RESILVERING-HISTORY
drwxr-xr-x 4 daemon daemon 47 Jan 6 09:58 current drwxr-xr-x 5898 daemon daemon 290816 Jan 6 10:50 resilvering-arena
[root@antares-eos99 node-1]# ls -ld /var/lib/quarkdb/node-1/
drwxr-xr-x 4 daemon daemon 89 Jan 6 10:01 /var/lib/quarkdb/node-1/
[root@antares-eos99 node-1]#
Working Server:
[root@antares-eos14 node-1]# ls -ltr
total 28
-rw-r–r-- 1 daemon daemon 7 Feb 23 2022 SHARD-ID
-rw-r–r-- 1 daemon daemon 21 Feb 23 2022 RESILVERING-HISTORY
drwxr-xr-x 4 daemon daemon 47 Feb 23 2022 current drwxr-xr-x 300 daemon daemon 16384 Jan 3 10:23 temp-snapshots
[root@antares-eos14 node-1]# ls -ld /var/lib/quarkdb/node-1/
drwxr-xr-x 4 daemon daemon 86 Jan 3 10:18 /var/lib/quarkdb/node-1/
[root@antares-eos14 node-1]#
The changes which we observed is the highlighted folder “resilvering-arena” which is observed in Problematic Server:antares-eos99.
Thanks for the info. Unfortunately, I can’t see immediately why you have the "Operation not permitted error. Could you paste the configuration of your QuarkDB XRootD daemons from /etc/xrootd/xrootd-quarkdb.cfg or similar?
Also please do: ps aux | grep xrotod on that machine.
Can you explain in a few words the initial state? You where running in raft mode with a single QDB and now you want to add a couple more for redundancy? Where you running in standalone mode at any point in time in the past?
Our initial aim was to add a newly re-installed node to a 2 node quarkDB cluster by following the procedure outlined in Membership updates - QuarkDB Documentation
Because we didn’t want to try this procedure for the first time in production, we tried to “emulate” it first on our dev instance with a working 3-node cluster: we did this by removing a node (antares-eos99) in the cluster, deleting completely the /var/lib/quarkdb/node-1/ dir and then try to add it back into cluster
In particular we followed the below steps:
On Observer antares-eos99: We stopped Quarkdb service then
On leader : “redis-cli -p 9999 raft-remove-member antares-eos99:9999”
On Observer antares-eos99: rm -rf /var/lib/quarkdb/node-1
On Observer antares-eos99: quarkdb-create --path /var/lib/quarkdb/node-1 --clusterID 0123456789
On Observer antares-eos99 : chown -R daemon:daemon /var/lib/quarkdb/node-1
On observer : Started “systemctl start eos@quarkdb service”
On Leader : redis-cli -p 9999 raft-add-observer antares-eos99:9999
On Leader : redis-cli -p 9999 raft-promote-observer antares-eos99:9999
Given that you are running QuarkDB 0.4.2 (which is quite old), I would suggest to update to eos-quarkdb-5.1.8 and then retry the procedure. There were several bugs fixed in QuarkDB related to this operation in the meantime and the hope is this will also cover your issues.
Our production environment has EOS version 4 (4.8.88) and CTA version 4 (4.7.12) . To upgrade to eos-quarkdb-5.1.8 we have to upgrade other EOS package to version 5 too. We will consider the solution during the upgrade activity. Thanks.
We completely agree that eos-quarkdb Version 5.1.8 has several bugs fix . To plan the upgrade activity it might take time meanwhile is there any other solution or fix which we are try for the issue reported with older version of quarkdb.