QuarkDB force leader election

gbitzes · January 13, 2020, 6:08pm

It would be quite a coincidence if it was not.

Was QuarkDB unavailable for long, or just 1-10 seconds? If only a few seconds, the bug must be that for some reason the MQ does not retry its requests toward QDB, gets an unexpected NULL reply object somewhere, and crashes when trying to access it. I haven’t had time today to look into that, will do soon.

franck-jrc · January 14, 2020, 4:01pm

We didn’t detect that it was unavailable, it probably was unavailable just for the time to switch leader after the one crashed, so less than 10 seconds. The MQ crash was what made eos unavailable. We just understood what happened on QDB sied from the same logs we sent you that there was also a QuarkDB crash, so you probably might understand more than us. In fact, the quorum was always reached, but at some point there was probably one leader node, plus one outdated follower, then it kept up quite quickly, and the former leader came back with auto restart from systemd.

By the way, do you know if it is possible/desirable to also set up the autorestart of the MQ, in case this happens again in the future ?

gbitzes · January 15, 2020, 2:36pm

I don’t see why not, it’s generally a good idea to have systemd auto-restart crashed servers to minimize disruption in cases like these.

By the way, you can check out the changes to fsync policy here: Fsync policy - QuarkDB Documentation

Cheers,
Georgios

franck-jrc · January 17, 2020, 11:04am

Thank you very much Georgios !

Is this planned to be released in a next version ?

gbitzes · January 17, 2020, 2:29pm

Yep, you’re in luck, I just did the 0.4.1 release today as Luca would like to have the fsync improvements in our own instances as well:

Cheers,
Georgios

franck-jrc · January 17, 2020, 4:32pm

Thank you, we might also install it soon, as we had planned the upgrade to v0.4.0.

CERN Accelerating science

QuarkDB force leader election