A nagios test for quarkdb

barbet · June 18, 2019, 9:37am

Hello,

Did someone already thought about designing a nagios plugin for testing the health of a quarkdb cluster ?

make sure the daemon is running
report the node’s role
assert the quorum of the cluster
[other ideas]

I suppose that would mean using redis commands and parsing the result ?

JM

gbitzes · June 18, 2019, 10:47am

Hi Jean Michel,

That would be interesting indeed - to my knowledge, there’s no nagios probe written for QuarkDB yet.
Yes, a probe should probably parse the output of raft-info and quarkdb-info.

I’m planning on adding a health command soon, such as quarkdb-health where a node would self-assess its own health, to be used in probes and such. But for now, raft-info and quarkdb-info should already give plenty of information.

Cheers,
Georgios

barbet · June 19, 2019, 12:36pm

Hi Georgios,

Very good idea to provide a “quarkdb-health” command ! It is true that the commands “raft-info” and “quarkdb-info” give a lot of information but what would be the best way to assess the health of the cluster as a whole, like if the quorum is reached and if the replication is OK ?

The best info for me are the REPLICA lines from “raft-info” only visible when asking the leader:

 TERM 2755
LOG-START 0
LOG-SIZE 3460802
LEADER nanxrd16.in2p3.fr:7777
CLUSTER-ID 8d9ce854-2ebe-4a93-95a9-c918256d8679
COMMIT-INDEX 3460801
LAST-APPLIED 3460801
BLOCKED-WRITES 0
LAST-STATE-CHANGE 181566 (2 days, 2 hours, 26 minutes, 6 seconds)
----------
MYSELF nanxrd16.in2p3.fr:7777
STATUS LEADER
----------
MEMBERSHIP-EPOCH 0
NODES nanxrd15.in2p3.fr:7777,nanxrd16.in2p3.fr:7777,nanxrd17.in2p3.fr:7777
OBSERVERS 
QUORUM-SIZE 2
----------
REPLICA nanxrd15.in2p3.fr:7777 ONLINE | UP-TO-DATE | NEXT-INDEX 3460802
REPLICA nanxrd17.in2p3.fr:7777 OFFLINE

Here for example I have one of the the cluster members down

JM

gbitzes · June 19, 2019, 1:14pm

Hi Jean Michel,

Checking for quorum is easy: If the LEADER field is filled out, it means the queried node believes the cluster currently has a quorum, that it is part of said quorum, and is receiving heartbeats from the leader, with the last heartbeat received less than 1-2 seconds ago. (or alternatively, if the node is a leader: It means the heartbeats it sends out are being acknowledged by at least a quorum of nodes)

Otherwise, LEADER will be empty.

Note that, different nodes could believe different things: If a node is unable to communicate with the other two (network partition), the two could form a quorum, but the third would remain out of the cluster, and thus have an empty LEADER field. Under normal circumstances however, all should agree.

The following would be quite reasonable for a nagios setup:

The probe is ran against all QuarkDB nodes.
If LEADER is empty, this node is at red health.
If the node is the cluster leader, check REPLICA section. If there’s an offline entry, the leader node is at yellow health. Same if some replica is LAGGING instead of UP-TO-DATE.
If all checks pass, node is green.

barbet · June 19, 2019, 1:27pm

Thank you Georgios, I am going to try to write something. It may take me a little time but if it works OK I will tell the community.

JM

gbitzes · August 2, 2019, 1:49pm

Hi Jean Michel,

Today I wrote an implementation for a health command to be included in QDB 0.3.9. Sample output from a leader node:

$ redis-cli -p 7777 quarkdb-health-local
1) [GREEN] Free space in state-machine filesystem: 137847242752 bytes (58.0055%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
3) [GREEN] Quorum stability: Good
4) [GREEN] Replica localhost:7778: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
5) [GREEN] Replica localhost:7780: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty

And from a follower:

$ redis-cli -p 7778 quarkdb-health-local
1) [GREEN] Free space in state-machine filesystem: 137847234560 bytes (58.0055%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777

Exact output format is subject to change, don’t write any parsers for the above yet.

What do you think? Anything we should add?

gbitzes · August 2, 2019, 1:50pm

Some more examples - when one node is down, but the cluster is still operational:

1) [GREEN] Free space in state-machine filesystem: 138005041152 bytes (58.0719%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
3) [YELLOW] Quorum stability: Shaky
4) [GREEN] Replica localhost:7778: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
5) [YELLOW] Replica localhost:7780: OFFLINE

No quorum, node unavailable:

1) [GREEN] Free space in state-machine filesystem: 138173075456 bytes (58.1426%)
2) [RED] Part of cluster: No, I don't know who the cluster leader is, node is unavailable

barbet · August 5, 2019, 7:29am

Thank you Georgios,

This look very promising and the associated nagios test becomes very basic : just testing the tag for [GREEN], [YELLOW] or [RED]. Waiting for the release

JM

peby · January 29, 2020, 1:21am

Hi Georgios,
Was this implemented - I don’t show it available in quarkdb-0.4.1-1?

Cheers,
Pete

barbet · January 29, 2020, 6:38am

Hello Pete,

Yes it is here with Quarkdb 0.4.1:

redis-cli -p 7777 -h localhost quarkdb-info
 1) MODE RAFT
 2) BASE-DIRECTORY /var/spool/quarkdb
 3) CONFIGURATION-PATH 
 4) QUARKDB-VERSION 0.4.1
 5) ROCKSDB-VERSION 6.2.4
 6) XROOTD-HEADERS v4.11.1
 7) NODE-HEALTH GREEN
 8) MONITORS 0
 9) BOOT-TIME 1 (1 seconds)
10) UPTIME 148093 (1 days, 17 hours, 8 minutes, 13 seconds)

JM

peby · February 1, 2020, 5:46pm

Ah, thanks JM

( quarkdb-info vs quarkdb-health-local )

gbitzes · February 1, 2020, 6:20pm

Hi all,

Yes, this should be available starting from 0.3.9: I decided to name the command quarkdb-health instead of quarkdb-health-local.

It’ll show you why and how QDB came to a conclusion on GREEN, YELLOW, or RED health status. The leader node will show more information than the followers, and include current replication status.

quarkdb-info and raft-info just show the overall node health, without details.

I’ll add some documentation on health monitoring.

Cheers,
Georgios

CERN Accelerating science

A nagios test for quarkdb