A nagios test for quarkdb

Hello,

Did someone already thought about designing a nagios plugin for testing the health of a quarkdb cluster ?

  • make sure the daemon is running
  • report the node’s role
  • assert the quorum of the cluster
  • [other ideas]

I suppose that would mean using redis commands and parsing the result ?

JM

Hi Jean Michel,

That would be interesting indeed - to my knowledge, there’s no nagios probe written for QuarkDB yet.
Yes, a probe should probably parse the output of raft-info and quarkdb-info.

I’m planning on adding a health command soon, such as quarkdb-health where a node would self-assess its own health, to be used in probes and such. But for now, raft-info and quarkdb-info should already give plenty of information.

Cheers,
Georgios

Hi Georgios,

Very good idea to provide a “quarkdb-health” command ! It is true that the commands “raft-info” and “quarkdb-info” give a lot of information but what would be the best way to assess the health of the cluster as a whole, like if the quorum is reached and if the replication is OK ?

The best info for me are the REPLICA lines from “raft-info” only visible when asking the leader:

 TERM 2755
LOG-START 0
LOG-SIZE 3460802
LEADER nanxrd16.in2p3.fr:7777
CLUSTER-ID 8d9ce854-2ebe-4a93-95a9-c918256d8679
COMMIT-INDEX 3460801
LAST-APPLIED 3460801
BLOCKED-WRITES 0
LAST-STATE-CHANGE 181566 (2 days, 2 hours, 26 minutes, 6 seconds)
----------
MYSELF nanxrd16.in2p3.fr:7777
STATUS LEADER
----------
MEMBERSHIP-EPOCH 0
NODES nanxrd15.in2p3.fr:7777,nanxrd16.in2p3.fr:7777,nanxrd17.in2p3.fr:7777
OBSERVERS 
QUORUM-SIZE 2
----------
REPLICA nanxrd15.in2p3.fr:7777 ONLINE | UP-TO-DATE | NEXT-INDEX 3460802
REPLICA nanxrd17.in2p3.fr:7777 OFFLINE

Here for example I have one of the the cluster members down

JM

Hi Jean Michel,

Checking for quorum is easy: If the LEADER field is filled out, it means the queried node believes the cluster currently has a quorum, that it is part of said quorum, and is receiving heartbeats from the leader, with the last heartbeat received less than 1-2 seconds ago. (or alternatively, if the node is a leader: It means the heartbeats it sends out are being acknowledged by at least a quorum of nodes)

Otherwise, LEADER will be empty.

Note that, different nodes could believe different things: If a node is unable to communicate with the other two (network partition), the two could form a quorum, but the third would remain out of the cluster, and thus have an empty LEADER field. Under normal circumstances however, all should agree.

The following would be quite reasonable for a nagios setup:

  • The probe is ran against all QuarkDB nodes.
  • If LEADER is empty, this node is at red health.
  • If the node is the cluster leader, check REPLICA section. If there’s an offline entry, the leader node is at yellow health. Same if some replica is LAGGING instead of UP-TO-DATE.
  • If all checks pass, node is green.

Thank you Georgios, I am going to try to write something. It may take me a little time but if it works OK I will tell the community.

JM

1 Like

Hi Jean Michel,

Today I wrote an implementation for a health command to be included in QDB 0.3.9. Sample output from a leader node:

$ redis-cli -p 7777 quarkdb-health-local
1) [GREEN] Free space in state-machine filesystem: 137847242752 bytes (58.0055%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
3) [GREEN] Quorum stability: Good
4) [GREEN] Replica localhost:7778: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
5) [GREEN] Replica localhost:7780: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty

And from a follower:

$ redis-cli -p 7778 quarkdb-health-local
1) [GREEN] Free space in state-machine filesystem: 137847234560 bytes (58.0055%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777

Exact output format is subject to change, don’t write any parsers for the above yet. :slight_smile:

What do you think? Anything we should add?

1 Like

Some more examples - when one node is down, but the cluster is still operational:

1) [GREEN] Free space in state-machine filesystem: 138005041152 bytes (58.0719%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
3) [YELLOW] Quorum stability: Shaky
4) [GREEN] Replica localhost:7778: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
5) [YELLOW] Replica localhost:7780: OFFLINE

No quorum, node unavailable:

1) [GREEN] Free space in state-machine filesystem: 138173075456 bytes (58.1426%)
2) [RED] Part of cluster: No, I don't know who the cluster leader is, node is unavailable
1 Like

Thank you Georgios,

This look very promising and the associated nagios test becomes very basic : just testing the tag for [GREEN], [YELLOW] or [RED]. Waiting for the release :slight_smile:

JM

Hi Georgios,
Was this implemented - I don’t show it available in quarkdb-0.4.1-1?

Cheers,
Pete

Hello Pete,

Yes it is here with Quarkdb 0.4.1:

redis-cli -p 7777 -h localhost quarkdb-info
 1) MODE RAFT
 2) BASE-DIRECTORY /var/spool/quarkdb
 3) CONFIGURATION-PATH 
 4) QUARKDB-VERSION 0.4.1
 5) ROCKSDB-VERSION 6.2.4
 6) XROOTD-HEADERS v4.11.1
 7) NODE-HEALTH GREEN
 8) MONITORS 0
 9) BOOT-TIME 1 (1 seconds)
10) UPTIME 148093 (1 days, 17 hours, 8 minutes, 13 seconds)

JM

Ah, thanks JM

( quarkdb-info vs quarkdb-health-local )

Hi all,

Yes, this should be available starting from 0.3.9: I decided to name the command quarkdb-health instead of quarkdb-health-local.

It’ll show you why and how QDB came to a conclusion on GREEN, YELLOW, or RED health status. The leader node will show more information than the followers, and include current replication status.

quarkdb-info and raft-info just show the overall node health, without details.

I’ll add some documentation on health monitoring. :slight_smile:

Cheers,
Georgios