That would be interesting indeed - to my knowledge, there’s no nagios probe written for QuarkDB yet.
Yes, a probe should probably parse the output of raft-info and quarkdb-info.
I’m planning on adding a health command soon, such as quarkdb-health where a node would self-assess its own health, to be used in probes and such. But for now, raft-info and quarkdb-info should already give plenty of information.
Very good idea to provide a “quarkdb-health” command ! It is true that the commands “raft-info” and “quarkdb-info” give a lot of information but what would be the best way to assess the health of the cluster as a whole, like if the quorum is reached and if the replication is OK ?
The best info for me are the REPLICA lines from “raft-info” only visible when asking the leader:
Checking for quorum is easy: If the LEADER field is filled out, it means the queried node believes the cluster currently has a quorum, that it is part of said quorum, and is receiving heartbeats from the leader, with the last heartbeat received less than 1-2 seconds ago. (or alternatively, if the node is a leader: It means the heartbeats it sends out are being acknowledged by at least a quorum of nodes)
Otherwise, LEADER will be empty.
Note that, different nodes could believe different things: If a node is unable to communicate with the other two (network partition), the two could form a quorum, but the third would remain out of the cluster, and thus have an empty LEADER field. Under normal circumstances however, all should agree.
The following would be quite reasonable for a nagios setup:
The probe is ran against all QuarkDB nodes.
If LEADER is empty, this node is at red health.
If the node is the cluster leader, check REPLICA section. If there’s an offline entry, the leader node is at yellow health. Same if some replica is LAGGING instead of UP-TO-DATE.
Today I wrote an implementation for a health command to be included in QDB 0.3.9. Sample output from a leader node:
$ redis-cli -p 7777 quarkdb-health-local
1) [GREEN] Free space in state-machine filesystem: 137847242752 bytes (58.0055%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
3) [GREEN] Quorum stability: Good
4) [GREEN] Replica localhost:7778: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
5) [GREEN] Replica localhost:7780: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
And from a follower:
$ redis-cli -p 7778 quarkdb-health-local
1) [GREEN] Free space in state-machine filesystem: 137847234560 bytes (58.0055%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
Exact output format is subject to change, don’t write any parsers for the above yet.
Some more examples - when one node is down, but the cluster is still operational:
1) [GREEN] Free space in state-machine filesystem: 138005041152 bytes (58.0719%)
2) [GREEN] Part of cluster: Yes, current leader is localhost:7777
3) [YELLOW] Quorum stability: Shaky
4) [GREEN] Replica localhost:7778: ONLINE | UP-TO-DATE | NEXT-INDEX 38829 | VERSION 0.3.8.27.8041c21.dirty
5) [YELLOW] Replica localhost:7780: OFFLINE
No quorum, node unavailable:
1) [GREEN] Free space in state-machine filesystem: 138173075456 bytes (58.1426%)
2) [RED] Part of cluster: No, I don't know who the cluster leader is, node is unavailable
This look very promising and the associated nagios test becomes very basic : just testing the tag for [GREEN], [YELLOW] or [RED]. Waiting for the release
Yes, this should be available starting from 0.3.9: I decided to name the command quarkdb-health instead of quarkdb-health-local.
It’ll show you why and how QDB came to a conclusion on GREEN, YELLOW, or RED health status. The leader node will show more information than the followers, and include current replication status.
quarkdb-info and raft-info just show the overall node health, without details.