Regression in QuarkDB 0.4.0 and 0.4.1: Replication may get stuck

gbitzes · March 11, 2020, 3:45pm

Hi all,

This is a quick heads-up for a rare regression recently discovered in QuarkDB, versions 0.4.0 and 0.4.1. Under complicated conditions (follower is very far behind the leader + there’s network instabilities), replication for a particular follower might get stuck.

This means that some followers might be receiving journal entries normally, while others may not. In the very unlikely case that both followers in a 3-node cluster get stuck, all writes towards QDB will hang.

It’s quite rare – you could run a cluster for months without this happening. If you suspect you’ve been affected, check if there’s LAGGING replicas in the leader’s raft-info output, whose NEXT-INDEX stays still for a long time, without increasing. Moreover, health on the leader node will turn yellow.

To workaround, simply restart the leader QDB process.

The fix will land in 0.4.2, to be released tomorrow. Many thanks to Pete Eby (@peby, ORNL) for finding and reporting this bug.

Cheers,
Georgios

CERN Accelerating science

Regression in QuarkDB 0.4.0 and 0.4.1: Replication may get stuck