QuarkDB force leader election

franck-jrc · January 8, 2020, 3:07pm

Hi,

We would like to force a leader election on a our running cluster. Is there a way to do it without stopping the leader ?

We have the current situation where the RAID controller managing the OS on the leader node has failed, which results in the terminal being unavailable on the server (nor SSH nor console), but the QuarkDB service still runs. Stopping the server would result in the cluster running with only 2 nodes until we manage to fix the server, or provision a replacing member. We would prefer to have it still running, but only as a follower.

@gbitzes maybe you have some suggestion about this ?

Thank you

gbitzes · January 8, 2020, 3:22pm

Hi Franck,

Yes, this is possible: Run command raft-attempt-coup in one of the follower nodes. This will trigger an artificial timeout (as if the node had not been contacted by the leader for a long time), and thus, elections.

There’s no way to force the outcome of an election though, that would break the raft algorithm. There’s a chance the old leader will win again the election round, becoming leader again. If that happens, run raft-attempt-coup in a follower again and again until it works.

Cheers,
Georgios

gbitzes · January 8, 2020, 3:27pm

By the way, I’m not quite sure what you mean here:

I’m confused by “server is unavailable, but the QuarkDB service still runs”.

Is it possible to connect to the QuarkDB leader with redis-cli?
Has this problem brought down the MGM?
Is the failed server SSH-able?

franck-jrc · January 8, 2020, 3:37pm

Hi Giorgos,

Thank you for your answer, it is exactly what we were looking for, we will try that.

Yes, right, I was not specific enough.

As you guessed, the server is not SSH-able, since the OS disk has failed.
The problem didn’t bring down the MGM, the QuarkDB data are on another disk, and the follower still sees the node as leader. The redis-cli command can contact the leader, however we have a password file, and not sure if we can use this with from the command line.

gbitzes · January 8, 2020, 3:46pm

Thanks for the explanation, makes sense that you wouldn’t want this node to remain the leader.

Yes it’s possible:

Through redis-cli, use AUTH to authenticate by pasting the contents of the password-file. (Only convenient if your password file is a single line with no weird binary bytes - ending newline is ignored). AUTH password-file-contents. A disadvantage is that the password is sent in plaintext over the wire, just like in official redis-server.
Testing if the node is contactable is also possible through eos-ns-inspect, it accepts a --password-file parameter.

More on password authentication here: Password authentication - QuarkDB Documentation

franck-jrc · January 8, 2020, 4:36pm

Thank you again Giorgos!

The coup succeeded at first try, Vive la révolution !

About authentication, when using a password file, it is better to keep it in clear text on one line, so that we can connect remotely in such a case, since redis-cli can’t use the password file, right ?
It was our case, so we could double check that the failing node is not any more a leader, but still runs as a follower.
Does MGM use the signing challenge method when contacting the QuarkDB cluster ?

gbitzes · January 8, 2020, 5:31pm

Awesome, glad that it worked!

Yes, because signing challenges are a QuarkDB-specific thing, and redis-cli does not understand them. From redis-cli only AUTH is possible right now, signing challenges would be very cumbersome to do manually. (though technically possible)

I’ve been thinking of writing a quarkdb-cli tool for this reason to overcome the limitation, which could be pointed to a password file, but for now having a password file with a single line and simple copy-pasteable ASCII characters is the most convenient.

Yes: Both the MGM and eos-ns-inspect tool use signing challenges, and never send the password over cleartext. QuarkDB inter-cluster communication uses signing challenges as well.

franck-jrc · January 10, 2020, 2:54pm

Hello @gbitzes ,

This morning the failing node has been stopped, fixed, and started back.
But this wasn’t harmless for the eos instance : although the QDB cluster recovered correctly (changing leader again), we observed that the eos instance was not usable : no possible read or write. We soon understood that the MQ service went in segmentation fault, making all disks seen as offline. Starting back the MQ service fixed everything. Not sure if the MQ produces a stacktrace that we could send.

In addition, by looking further in the QuarkDB logs, it appeared that when the QDB node came back, the leader crashed with a FatalException, which caused the third node to take back as leader. Maybe these 2 crashed are linked. For this QuarkDB issue, we can provide the logs (there is a stacktrace in them), are you interested that we send them to you to help understand what could have happen ?

gbitzes · January 10, 2020, 3:15pm

Hi @franck-jrc,

Certainly, please send me any stacktraces you have - thanks. I’m not sure if the MQ prints a stacktrace in the logs when it crashes, it’s probably worth to have a look. (/var/log/eos/mq)

I see two issues:

The QuarkDB node crashing with FatalException
MQ being unable to survive QDB temporarily going down for a few seconds. Maybe the client there is misconfigured, and does not use retries, I’ll have a look. Normally, QClient should completely hide any transient unavailability of QDB…

Is the QDB node that crashed back online, or are you running on 2/3?

franck-jrc · January 10, 2020, 4:26pm

I just sent you the logs via email (for info, the MGM/MQ runs on the same host as the QuarkDB node, currently the leader)

I didn’f find any additional trace about the MQ crash, I inserted the xrdlog.mq file in the mail. The only additional info might be these lines in /var/log/messages file :

Jan 10 13:45:21 ...jeop214p kernel: xrootd[73952]: segfault at 7f6b1e49c001 ip 00007f6aadcc0eae sp 00007f6a93ffd580 error 4 in libjemalloc.so.1[7f6aadcb3000+31000]
Jan 10 13:45:21 ...jeop214p sh: /bin/sh: line 1: 70959 Segmentation fault      $XRDPROG -n mq -c /etc/xrd.cf.mq -l /var/log/eos/xrdlog.mq -s /tmp/xrootd.mq.pid -Rdaemon
Jan 10 13:45:21 ...jeop214p systemd: eos@mq.service: main process exited, code=exited, status=139/n/a
Jan 10 13:45:21 ...jeop214p systemd: Unit eos@mq.service entered failed state.
Jan 10 13:45:21 ...jeop214p systemd: eos@mq.service failed.

The QuarkDB cluster is fully back 3/3, systemd restarted the QuarkDB node that crashed, and it took over as follower without further complaint.

Another thing that we observed is that while the failed node was (forcedly, because no access to terminal) powered off for fixing, the previous leader was still seeing it as ONLINE and UP-TO-DATE, and increasing its corresponding NEXT-INDEX value, although the QuarkDB daemon was not accessible. It was seen as OFFLINE only when they tried to power on the server. Not sure if that’s normal, so I’m reporting it.

gbitzes · January 10, 2020, 5:00pm

Thanks for the logs, very helpful. This is what happened, an assertion was erroneously triggered as a follower appeared to go back in time:

216p went offline at index 6052409268
The forced poweroff caused rocksdb to lose the most recent still-buffered writes. This is something rocksdb (the underlying storage in QDB) can handle gracefully without data corruption, it’ll only lose recent writes.
216p went online at index 6052408879
The leader remembered that the follower was supposed to be at 6052409268, and freaked out.

No worries, everything should be back to normal after the restart. I’ll fix it in version 0.4.1, thanks for reporting this bug!

Interesting that NEXT-INDEX kept increasing, in the logs the server is correctly reported as going offline, I’ll take a look.

As for the MQ still not quite sure what happened, I’ll have a look on Monday too. Have a nice weekend!

gbitzes · January 10, 2020, 5:32pm

By the way, NEXT-INDEX increasing is probably due to a high TCP timeout. When the server went dark and all incoming packets were DROPped, the asynchronous replication machinery kept on buffering entries, unaware the TCP connection was hanging.

This is just a cosmetic issue, NEXT-INDEX keeps track of which entries have been sent into qclient, which can buffer a large number of requests. MATCH-INDEX tracks those which have been sent and acknowledged, and is what’s taken into account in decisions on whether a write has been safely replicated. Raft was still working as it should.

Now that I think about it, maybe we should display MATCH-INDEX in raft-info, not NEXT-INDEX… it would be less confusing. MATCH-INDEX would not have kept increasing.

How long was the server down? I’ve heard stories of the default TCP timeout sometimes lasting half an hour. This would explain why it was shown as offline as soon as the server came back - it started REJECTing packets, the TCP connection died immediately, and QDB showed the node as offline.

Maybe I should decrease the TCP timeout to something more reasonable in QDB, like 30 seconds.

gbitzes · January 10, 2020, 5:38pm

By the way, both issues (crash + TCP connection stalling) were caused by the server being forcefully powered off - I should add such scenarios in QDB tests to better control what happens in such cases, this is something that’s not currently being extensively tested. Thanks a lot for reporting the issue!

franck-jrc · January 10, 2020, 6:24pm

Thank you Georgios for your explanations.

Yes, I agree that the scenario is probably this one, and is due to the server disappearing without any notice. And I understand that this is a test that we don’t really think about, and also not straightforward to simulate.

I’m not sure about the exact time the server was off, but around 30 minutes, indeed. Around the time it was started, this message appeared in the leader’s log, I cannot say if this is the consequence of server restart, of the TCP timeout.

[1578648680749] QCLIENT (ERROR): Bad return value from send(): -1, errno: 32,Broken pipe
[1578648680814] WARNING: Hiccup during streaming replication of ...jeop216p:7777, switching back to conservative replication.
[1578648681336] EVENT: Replication target ...jeop216p:7777 went offline.

I also agree that displaying MATCH-INDEX in raft-info seems more useful, now that I understood the difference. Is there currently any other way to query the MATCH-INDEX value ?

But if MATCH-INDEX is what the followers acknowledge, it seems strange that it decreased after forced shutdown ? Or does the acknowledge just occurs after write in cache, and not flush on disk ?

Anyway, I’m glad that we could help in identifying some potential improvement. In any case, QuarkDB was quite robust in this situation, especially in surviving an OS disk disappearance!

gbitzes · January 10, 2020, 6:41pm

You are correct that acknowledging happens after write in OS page cache, not flush on disk. A recent local write might not be persisted if the kernel crashes badly, or the power goes off. It will be persisted if just QuarkDB crashes, however.

fsyncing to disk for every write would decrease performance by a factor of 100 to 1000… There is such an option in rocksdb, but we don’t enable it due to the performance penalty.

So QuarkDB could in theory lose recent, acknowledged distributed writes if two or more of the three node’s kernels crash (with the leader being one of them), or are powered off at the exact moment. If only one node is powered off (like in your case), no data will be lost as some other node will have it.

For the increase in performance, I think such a tradeoff is worth it - I could add an option to enable fsync after every journal write if you might be interested to experiment with it. Since QuarkDB batches multiple writes into each journal entry whenever possible, the performance penalty might even be acceptable.

Nope, I’ll add it

franck-jrc · January 11, 2020, 10:05am

Thank you for these additional explanations. I asked about fsync or not just to better understand what has happened, but I wasn’t expecting fsync to occur at every write, it would indeed be performance-killing. I was wondering that maybe one fsync occurs every 1000, 10K entries, or 10 seconds, then all the writes are acknowledge up to that point. But since acknowledged not automatically means written on disk, then the scenario of what happens sounds realistic.

No need (for us) to add a systematic fsync option, it is for sure not desirable for performance (maybe also for disk stress). I agree with you that the risk of loosing data is close to 0 (and few data). As long as the leader doesn’t crash any more when a follower loses some entries, it seems OK as it is now. Hopefully, this situation will not occur again any when soon

Have a nice week-end.

franck-jrc · January 13, 2020, 11:45am

Hi Georgios,

I have another question : are the index identified just by their sequential number, or is there some signature/hash to discriminate ?

I was thinking of the scenario were all nodes are stopped at the same time (e.g. for a general power failure) and write on the disk different indices (let’s call them 100, 150 and 200). If the one that has the highest index number (200) comes back last, the two others might have started back writing at 150, and reached an index above 200. Can this last node detect that when it has for interval 150-200 is different from what the other ones have, or would it take over from 200, resulting in kind of corrupted database ?

gbitzes · January 13, 2020, 12:23pm

Hi Franck,

Nice scenario! Raft is quite the rabbit hole, eh?

Each journal entry is identified by its position in the journal, and the term of the leader who wrote it. In the situation you describe:

Due to lost entries from a sudden poweroff, node #1, #2, and #3 are at 100, 150, and 200 respectively.
Node #1 and node #2 form a quorum after the poweroff, without #3.
The cluster operates normally, #1 and #2 journals reach 300 total entries.
Node #3 comes back online.

We’ll be able to tell the difference as the conflicting journal entries will have been written during different terms. There’s actually a mechanism in raft to remove non-committed conflicting journal entries, detectable through the term number, which can happen rarely during leader transition and is a normal part of raft. (The redis commands corresponding to any conflicting entries will not receive a positive acknowledgment, no acknowledged writes are lost)

In our scenario, I suspect the above conflicting-entries-removal mechanism will be engaged, with the leader asking #3 to remove its conflicting journal entries. However, as these have already been committed in #3, it will crash with an assertion and not allow any changes.

RaftDispatcher.cc:

    if(firstInconsistency <= journal.getCommitIndex()) {
      qdb_throw("detected inconsistent entries for index " << firstInconsistency << ". "
      << " Leader attempted to overwrite a committed entry with one with different contents.");
    }

(There’s tests which attempt to remove committed entries, and verify a FatalException is thrown)

Basically, the first quorum that is formed after a severe incident such as this one “wins”, and any later nodes with conflicting committed entries that attempt to join will keep crashing and refuse to participate.

We’d then have to manually inspect the contents of #3 to recover any additional data (in case those original entries contained something we’d like to keep), then wipe out the contents of #3 entirely and let the leader resilver it from scratch.

gbitzes · January 13, 2020, 5:17pm

By the way, I’m adding an option on whether to fsync the journal on every write, so we can check the performance impact.

Moreover, a background thread will do fsync once every second. This should not affect performance, and limit any potential data loss during poweroff in all nodes to writes during the last second. (same as official redis)

franck-jrc · January 13, 2020, 6:03pm

Thank you very much for this very detailed explanation. Yes, if the raft term is also used to identify a journal entry, then I understand that we can discriminate, and prevent a follower to join a cluster if it has diverged seems the good protective way.

Thank you also for the changes with the fsync option and thread, that sounds perfect !

About the MQ crash, can we legitimately conclude that it was caused by the QuarkDB unavailability ?

CERN Accelerating science

QuarkDB force leader election