QuarkDB force leader election

gbitzes · January 8, 2020, 5:31pm

Awesome, glad that it worked!

Yes, because signing challenges are a QuarkDB-specific thing, and redis-cli does not understand them. From redis-cli only AUTH is possible right now, signing challenges would be very cumbersome to do manually. (though technically possible)

I’ve been thinking of writing a quarkdb-cli tool for this reason to overcome the limitation, which could be pointed to a password file, but for now having a password file with a single line and simple copy-pasteable ASCII characters is the most convenient.

Yes: Both the MGM and eos-ns-inspect tool use signing challenges, and never send the password over cleartext. QuarkDB inter-cluster communication uses signing challenges as well.

franck-jrc · January 10, 2020, 2:54pm

Hello @gbitzes ,

This morning the failing node has been stopped, fixed, and started back.
But this wasn’t harmless for the eos instance : although the QDB cluster recovered correctly (changing leader again), we observed that the eos instance was not usable : no possible read or write. We soon understood that the MQ service went in segmentation fault, making all disks seen as offline. Starting back the MQ service fixed everything. Not sure if the MQ produces a stacktrace that we could send.

In addition, by looking further in the QuarkDB logs, it appeared that when the QDB node came back, the leader crashed with a FatalException, which caused the third node to take back as leader. Maybe these 2 crashed are linked. For this QuarkDB issue, we can provide the logs (there is a stacktrace in them), are you interested that we send them to you to help understand what could have happen ?

gbitzes · January 10, 2020, 3:15pm

Hi @franck-jrc,

Certainly, please send me any stacktraces you have - thanks. I’m not sure if the MQ prints a stacktrace in the logs when it crashes, it’s probably worth to have a look. (/var/log/eos/mq)

I see two issues:

The QuarkDB node crashing with FatalException
MQ being unable to survive QDB temporarily going down for a few seconds. Maybe the client there is misconfigured, and does not use retries, I’ll have a look. Normally, QClient should completely hide any transient unavailability of QDB…

Is the QDB node that crashed back online, or are you running on 2/3?

franck-jrc · January 10, 2020, 4:26pm

I just sent you the logs via email (for info, the MGM/MQ runs on the same host as the QuarkDB node, currently the leader)

I didn’f find any additional trace about the MQ crash, I inserted the xrdlog.mq file in the mail. The only additional info might be these lines in /var/log/messages file :

Jan 10 13:45:21 ...jeop214p kernel: xrootd[73952]: segfault at 7f6b1e49c001 ip 00007f6aadcc0eae sp 00007f6a93ffd580 error 4 in libjemalloc.so.1[7f6aadcb3000+31000]
Jan 10 13:45:21 ...jeop214p sh: /bin/sh: line 1: 70959 Segmentation fault      $XRDPROG -n mq -c /etc/xrd.cf.mq -l /var/log/eos/xrdlog.mq -s /tmp/xrootd.mq.pid -Rdaemon
Jan 10 13:45:21 ...jeop214p systemd: eos@mq.service: main process exited, code=exited, status=139/n/a
Jan 10 13:45:21 ...jeop214p systemd: Unit eos@mq.service entered failed state.
Jan 10 13:45:21 ...jeop214p systemd: eos@mq.service failed.

The QuarkDB cluster is fully back 3/3, systemd restarted the QuarkDB node that crashed, and it took over as follower without further complaint.

Another thing that we observed is that while the failed node was (forcedly, because no access to terminal) powered off for fixing, the previous leader was still seeing it as ONLINE and UP-TO-DATE, and increasing its corresponding NEXT-INDEX value, although the QuarkDB daemon was not accessible. It was seen as OFFLINE only when they tried to power on the server. Not sure if that’s normal, so I’m reporting it.

gbitzes · January 10, 2020, 5:00pm

Thanks for the logs, very helpful. This is what happened, an assertion was erroneously triggered as a follower appeared to go back in time:

216p went offline at index 6052409268
The forced poweroff caused rocksdb to lose the most recent still-buffered writes. This is something rocksdb (the underlying storage in QDB) can handle gracefully without data corruption, it’ll only lose recent writes.
216p went online at index 6052408879
The leader remembered that the follower was supposed to be at 6052409268, and freaked out.

No worries, everything should be back to normal after the restart. I’ll fix it in version 0.4.1, thanks for reporting this bug!

Interesting that NEXT-INDEX kept increasing, in the logs the server is correctly reported as going offline, I’ll take a look.

As for the MQ still not quite sure what happened, I’ll have a look on Monday too. Have a nice weekend!

gbitzes · January 10, 2020, 5:32pm

By the way, NEXT-INDEX increasing is probably due to a high TCP timeout. When the server went dark and all incoming packets were DROPped, the asynchronous replication machinery kept on buffering entries, unaware the TCP connection was hanging.

This is just a cosmetic issue, NEXT-INDEX keeps track of which entries have been sent into qclient, which can buffer a large number of requests. MATCH-INDEX tracks those which have been sent and acknowledged, and is what’s taken into account in decisions on whether a write has been safely replicated. Raft was still working as it should.

Now that I think about it, maybe we should display MATCH-INDEX in raft-info, not NEXT-INDEX… it would be less confusing. MATCH-INDEX would not have kept increasing.

How long was the server down? I’ve heard stories of the default TCP timeout sometimes lasting half an hour. This would explain why it was shown as offline as soon as the server came back - it started REJECTing packets, the TCP connection died immediately, and QDB showed the node as offline.

Maybe I should decrease the TCP timeout to something more reasonable in QDB, like 30 seconds.

gbitzes · January 10, 2020, 5:38pm

By the way, both issues (crash + TCP connection stalling) were caused by the server being forcefully powered off - I should add such scenarios in QDB tests to better control what happens in such cases, this is something that’s not currently being extensively tested. Thanks a lot for reporting the issue!

franck-jrc · January 10, 2020, 6:24pm

Thank you Georgios for your explanations.

Yes, I agree that the scenario is probably this one, and is due to the server disappearing without any notice. And I understand that this is a test that we don’t really think about, and also not straightforward to simulate.

I’m not sure about the exact time the server was off, but around 30 minutes, indeed. Around the time it was started, this message appeared in the leader’s log, I cannot say if this is the consequence of server restart, of the TCP timeout.

[1578648680749] QCLIENT (ERROR): Bad return value from send(): -1, errno: 32,Broken pipe
[1578648680814] WARNING: Hiccup during streaming replication of ...jeop216p:7777, switching back to conservative replication.
[1578648681336] EVENT: Replication target ...jeop216p:7777 went offline.

I also agree that displaying MATCH-INDEX in raft-info seems more useful, now that I understood the difference. Is there currently any other way to query the MATCH-INDEX value ?

But if MATCH-INDEX is what the followers acknowledge, it seems strange that it decreased after forced shutdown ? Or does the acknowledge just occurs after write in cache, and not flush on disk ?

Anyway, I’m glad that we could help in identifying some potential improvement. In any case, QuarkDB was quite robust in this situation, especially in surviving an OS disk disappearance!

gbitzes · January 10, 2020, 6:41pm

You are correct that acknowledging happens after write in OS page cache, not flush on disk. A recent local write might not be persisted if the kernel crashes badly, or the power goes off. It will be persisted if just QuarkDB crashes, however.

fsyncing to disk for every write would decrease performance by a factor of 100 to 1000… There is such an option in rocksdb, but we don’t enable it due to the performance penalty.

So QuarkDB could in theory lose recent, acknowledged distributed writes if two or more of the three node’s kernels crash (with the leader being one of them), or are powered off at the exact moment. If only one node is powered off (like in your case), no data will be lost as some other node will have it.

For the increase in performance, I think such a tradeoff is worth it - I could add an option to enable fsync after every journal write if you might be interested to experiment with it. Since QuarkDB batches multiple writes into each journal entry whenever possible, the performance penalty might even be acceptable.

Nope, I’ll add it

franck-jrc · January 11, 2020, 10:05am

Thank you for these additional explanations. I asked about fsync or not just to better understand what has happened, but I wasn’t expecting fsync to occur at every write, it would indeed be performance-killing. I was wondering that maybe one fsync occurs every 1000, 10K entries, or 10 seconds, then all the writes are acknowledge up to that point. But since acknowledged not automatically means written on disk, then the scenario of what happens sounds realistic.

No need (for us) to add a systematic fsync option, it is for sure not desirable for performance (maybe also for disk stress). I agree with you that the risk of loosing data is close to 0 (and few data). As long as the leader doesn’t crash any more when a follower loses some entries, it seems OK as it is now. Hopefully, this situation will not occur again any when soon

Have a nice week-end.

franck-jrc · January 13, 2020, 11:45am

Hi Georgios,

I have another question : are the index identified just by their sequential number, or is there some signature/hash to discriminate ?

I was thinking of the scenario were all nodes are stopped at the same time (e.g. for a general power failure) and write on the disk different indices (let’s call them 100, 150 and 200). If the one that has the highest index number (200) comes back last, the two others might have started back writing at 150, and reached an index above 200. Can this last node detect that when it has for interval 150-200 is different from what the other ones have, or would it take over from 200, resulting in kind of corrupted database ?

gbitzes · January 13, 2020, 12:23pm

Hi Franck,

Nice scenario! Raft is quite the rabbit hole, eh?

Each journal entry is identified by its position in the journal, and the term of the leader who wrote it. In the situation you describe:

Due to lost entries from a sudden poweroff, node #1, #2, and #3 are at 100, 150, and 200 respectively.
Node #1 and node #2 form a quorum after the poweroff, without #3.
The cluster operates normally, #1 and #2 journals reach 300 total entries.
Node #3 comes back online.

We’ll be able to tell the difference as the conflicting journal entries will have been written during different terms. There’s actually a mechanism in raft to remove non-committed conflicting journal entries, detectable through the term number, which can happen rarely during leader transition and is a normal part of raft. (The redis commands corresponding to any conflicting entries will not receive a positive acknowledgment, no acknowledged writes are lost)

In our scenario, I suspect the above conflicting-entries-removal mechanism will be engaged, with the leader asking #3 to remove its conflicting journal entries. However, as these have already been committed in #3, it will crash with an assertion and not allow any changes.

RaftDispatcher.cc:

    if(firstInconsistency <= journal.getCommitIndex()) {
      qdb_throw("detected inconsistent entries for index " << firstInconsistency << ". "
      << " Leader attempted to overwrite a committed entry with one with different contents.");
    }

(There’s tests which attempt to remove committed entries, and verify a FatalException is thrown)

Basically, the first quorum that is formed after a severe incident such as this one “wins”, and any later nodes with conflicting committed entries that attempt to join will keep crashing and refuse to participate.

We’d then have to manually inspect the contents of #3 to recover any additional data (in case those original entries contained something we’d like to keep), then wipe out the contents of #3 entirely and let the leader resilver it from scratch.

gbitzes · January 13, 2020, 5:17pm

By the way, I’m adding an option on whether to fsync the journal on every write, so we can check the performance impact.

Moreover, a background thread will do fsync once every second. This should not affect performance, and limit any potential data loss during poweroff in all nodes to writes during the last second. (same as official redis)

franck-jrc · January 13, 2020, 6:03pm

Thank you very much for this very detailed explanation. Yes, if the raft term is also used to identify a journal entry, then I understand that we can discriminate, and prevent a follower to join a cluster if it has diverged seems the good protective way.

Thank you also for the changes with the fsync option and thread, that sounds perfect !

About the MQ crash, can we legitimately conclude that it was caused by the QuarkDB unavailability ?

gbitzes · January 13, 2020, 6:08pm

It would be quite a coincidence if it was not.

Was QuarkDB unavailable for long, or just 1-10 seconds? If only a few seconds, the bug must be that for some reason the MQ does not retry its requests toward QDB, gets an unexpected NULL reply object somewhere, and crashes when trying to access it. I haven’t had time today to look into that, will do soon.

franck-jrc · January 14, 2020, 4:01pm

We didn’t detect that it was unavailable, it probably was unavailable just for the time to switch leader after the one crashed, so less than 10 seconds. The MQ crash was what made eos unavailable. We just understood what happened on QDB sied from the same logs we sent you that there was also a QuarkDB crash, so you probably might understand more than us. In fact, the quorum was always reached, but at some point there was probably one leader node, plus one outdated follower, then it kept up quite quickly, and the former leader came back with auto restart from systemd.

By the way, do you know if it is possible/desirable to also set up the autorestart of the MQ, in case this happens again in the future ?

gbitzes · January 15, 2020, 2:36pm

I don’t see why not, it’s generally a good idea to have systemd auto-restart crashed servers to minimize disruption in cases like these.

By the way, you can check out the changes to fsync policy here: Fsync policy - QuarkDB Documentation

Cheers,
Georgios

franck-jrc · January 17, 2020, 11:04am

Thank you very much Georgios !

Is this planned to be released in a next version ?

gbitzes · January 17, 2020, 2:29pm

Yep, you’re in luck, I just did the 0.4.1 release today as Luca would like to have the fsync improvements in our own instances as well:

Cheers,
Georgios

franck-jrc · January 17, 2020, 4:32pm

Thank you, we might also install it soon, as we had planned the upgrade to v0.4.0.

CERN Accelerating science

QuarkDB force leader election