Managing failover of MGM

dszkola · July 30, 2020, 8:54pm

I’d be interested in hearing how other sites handle DNS and virtual IP when the MGMs failover.

We have a separate DNS name and virtual IP address for the instance that points to the current master.

Most cluster management software has means for dealing with the movement back and forth of a virtual IP for cluster services between various nodes of the cluster when the leader changes.

Since moving to QuarkDB for the namespace and using the new failover methodology, I’ve noticed times where the MGM will get busy and then the other MGM will take over as master presumably because the other node does not answer in a predetermined number of seconds. Since the virtual IP (to which the instance DNS points) remains on the original MGM, this can cause us some problems, like fuse clients pointing to the wrong place, etc.

I’ve played with a cron job on test instances to move this virtual IP with some success, but this seems a bit less than ideal.

I have no ability to make quick DNS changes here (was nice when I did at the old job, lol) so I can’t rely on a CNAME or something else that involves DNS changes. Maybe a callout when a switch is made or some better idea.

What do the developers consider to be best practices for this setup?

–
Dan Szkola
FNAL

ebirngru · August 3, 2020, 9:40am

Hi Dan,

We have a new EOS setup, 4.8.4 here in Vienna.
We’re running a 3 node mgm+quarkdb. So far the setup is quite stable.
For xrdcp etc. they will get redirected to the current master mgm as expected.
So for most connections we can use a DNS round-robin entry, that points to all 3 mgms.

However, the fusex client get’s an “too many redirects” error, when pointed to that RR entry.
Therefore we have a “master.eos.example.com” That points to the “current” mgm master.
It just so happened that we had a mgm failover during the weekend, and we temporarily lost all fusex mounts, as the DNS record was not updated automatically.
This morning I did a “ns master mgm-1.eos.example.com” to bring the mgm master back to the machine that is pointed to by the “master” DNS record.
Fusex mounts immediately started to be functional again.

I don’t consider this a real solution. Imho the fusex client should be able to do service discovery from the DNS RR correctly. The setup loses a lot of ease of operation, if you have to manually flip a DNS record on a HA failover.

Best,
Erich

crystal · August 3, 2020, 10:47am

Hi,

This post may be of interest to you guys: Config in quarkDB for master/slave(s)

Elvin or someone else may be able to give an update too on whether this is still considered best practice.

franck-jrc · August 5, 2020, 3:24pm

I am also interested in having some more information about this.
We currently don’t use any failover, but the redirect mechanism that sends any clients connecting to a slave to the known master is efficient only if the MGM it contacts is still alive. The same maybe might happen to the FSTs : they boot with the current master, if this one dies, can they still operate with the new elected master ?

What I retained about a way to have an immediate failover in case of failure is to use an extra MGM daemon in front of the main MGM, and use the eos route command to define a list of MGMs serving some path, for instance :

eos route link /eos/instance1 mgm1.eos.example.com:1094,mgm1.eos.example.com:1094.mgm1.eos.example.com:1094

This frontend MGM should then be able to detect which MGM is the current master.

Of course, the next question is What if this redirect MGM fails ? Maybe a failover in front of this might be needed, but the crashes of suche simple daemon are probably less likely to happen.

I have tried that at some point, but I couldn’t have it working, maybe I couldn’t find the correct configuration. Since this MGM has no namespace, could it be configured without QuarkDB ? Like an empty in-memory namespace.

Has someone some setup like this ? Or could someone from CERN give some configuration example.

Another question about the slave MGM : when launching a slave MGM, it correctly boots as a slave, recognize the master and shows the namespace content, but the commands eos fs ls or eos group ls give empty results. And eos node ls shows ??? in the activated column. Not sure if this is the expected behavior…

esindril · August 6, 2020, 7:42am

Hi Franck,

I will try to answer some of your questions:

if there is a new master elected then the FSTs will indeed connect to the new master as this one is broadcasting to all the nodes the fact that it is the new master.
if you use one more redirector in front you can indeed end up in the same situation if this machine dies - there is no bullet proof solution to this
for such a redirector you can indeed use a simple in-memory setup or just a standalone QDB ns on the same machine
CERN uses such a setup for the “home” instances i.e. in front of the CERNBOX subinstances
if you use one of the 4.8 releases, when the slave stars it will also load the configuration properly but any changes done on the master will not be reflected on the slave. This was done to simplify the synchronization step and also since there was no clear benefit to all this. Once the slave becomes a master then it will reload all the config from the QDB backend so everything will up to date.

Cheers,
Elvin

franck-jrc · August 6, 2020, 3:06pm

Thank you for your answers, that clarifies the behaviour we get when running several MGM on the same instance.

The mechanism to notify the new master to the FSTs is the one that is missing for the FUSE(x) clients, and makes that when a master-slave switch occurs, the clients do not connect to the new master, the user gets Input/output error and the log says Redirect limit has been reach ed as mentioned by Erich above.

In fact, I observe that after a switch, the old master still shows active clients :

# ------------------------------------------------------------------------------------
ALL      eosxd caps                       0
ALL      eosxd clients                    7
ALL      eosxd active clients             2
ALL      eosxd locked clients             0

Although we can’t know which ones :

 # eos fusex ls
error: you have to be root to list VSTs (errc=1) (Operation not permitted)

If the old master is still alive, it could inform the fusex clients which is the new master. But if it crashes, then the only way is to update the DNS RR record.

barbet · August 10, 2020, 12:35pm

Hello,
I would like to share here some observations from a test performed today on a small EOS test cluster.
Last week I had set up a 3 nodes cluster, each node being a quarkdb node, EOS manager and EOS fielserver (FST) with 2 filesystems.
Today I abruptly deleted the master manager to see what happened. There is no DNS alias on the managers.
I observed that another MGM did get the “master” role.
But: All filesystems were “offline”.
Restarting the master MGM did not cure the problem.
I had to restart the FST service on the 2 remaining nodes in order to have the corresponding filesystems “online” on the new master MGM. Before that I could not write to EOS (no space).
Not sure if this is expected; however, having read that the state of the configuration is not necessarily propagated to the slave managers, I am mwondering if the gain is worth it: it is a bit troublesome not to observe the real state of things when running “eos …” commands on a slave MGM…
JM

esindril · August 11, 2020, 7:19am

Hi JM,

Can I get a bit of context related to your setup?
What version of eos are you running?
Are you running also an MQ daemon along with each MGM daemon?
Do you have the MQ daemon configuration containing the QDB cluster information? Something like the following in /etc/xrd.cf.mq:

mq.qdbcluster <some_qdb_machine:7777>
mq.qdbpassword_file /etc/eos.keytab

Thanks,
Elvin

barbet · August 11, 2020, 8:21am

Hi Elvin,
Sure…
EOS_SERVER_VERSION=4.7.7
Yes: MQ running along with MGM
Yes:MQ knows about QDB:

[root@eosmgr02 ~]$ more /etc/xrd.cf.mq
[…]
mq.qdbpassword my_veryloooooonnnnnnnnggg_quarkdb_password
mq.qdbcluster eosmgr01.in2p3.fr:7777 eosmgr02.in2p3.fr:7777 eosmgr03.in2p3.fr:7777
I am havong other issues currently in this testbed (DNS), so I cannot assure you of a proper debugging… Sorry about that.
JM

esindril · August 11, 2020, 8:47am

Hi JM,

Thanks for the details. When you say “I abruptly deleted the master manger” does this mean deleting the entire VM i.e shutting down both the current MGM and MQ daemons? If this is the case, then indeed the FSTs will be offline since the new master info is currently propagated through the MQ and if you stop both daemons (MGM & MQ) then the FSTs don’t know who the new MQ (master) is and can not receive updates from the new MGM master. Note that the new master updates are propagated though the MQ daemon.

All this will be covered once we move the MQ functionality in QDB which should not be long know.

Cheers,
Elvin

barbet · August 11, 2020, 11:46am

Elvin,
Yes, I deleted the VM, so the MGM and MQ daemon on this machine disapeared simultaneously.
I understand what you say about MQ’s role. Now, what I did could correspond to a use case in real life, so it could be worth it to describe the steps to recover in such a case. What would you do to recover if it happens that a MGM+MQ+QDB server suddenly crashed and cannot be restarted immediately ?
Thanks
JM

esindril · August 11, 2020, 11:56am

Hi JM,

For the moment there is no workaround for this case. You need to restart the FSTs as you did initially. In the future, when we drop the MQ daemon and only rely on QDB to do this message passing this scenario will be covered.

Cheers,
Elvin

franck-jrc · August 19, 2020, 11:11am

I have made some tests with the fusex clients (and MGM server 4.7.7) in the case where the master switches, and the DNS alias record is not change, i.e. points on the old master. The fusex clients 4.5.x+ return “Input/output error” and message “Redirect limit has been reached”. When the DNS alias record changes, and some time passes (for the cache to expire) the mount seems to be restored.

From time to time, there is a log line saying :

200813 12:16:59 t=1597313819.670364 f=TrackMgm         l=WARN  tid=00007f6fefbff700 s=eosfuse:6009             reconnecting mqtarget=tcp://dns-alias:1100 => mqtarget=tcp://master-hostname:1100
200813 12:17:01 t=1597313821.792727 f=connect          l=NOTE  tid=00007f6fefbff700 s=md:141                   connected to tcp://dns-alias:1100

Which seems correct. But this doesn’t last much, it reconnects to old master, and fails again with “Redirect limit has been reached”

200813 12:17:01 t=1597313821.792783 f=TrackMgm         l=WARN  tid=00007f6fefbff700 s=eosfuse:6009             reconnecting mqtarget=tcp://dns-alias:1100 => mqtarget=tcp://dns-alias:1100
200813 12:17:01 t=1597313821.792797 f=fetchResponse    l=ERROR tid=00007f6fefbff700 s=backend:422              fetch-exec-ms=7.00 sum-query-exec-ms=7.00 ok=0 err=1 fatal=1 status-code=306 err-no=0
200813 12:17:01 t=1597313821.792812 f=fetchResponse    l=ERROR tid=00007f6fefbff700 s=backend:424              error=status is NOT ok : [FATAL] Redirect limit has been reached 306 0

It seems that in such situation the eosd clients and fusex clients < 4.5.x (tested 4.4.23) correctly access the mount, but fusex clients do not appear in eos fusex ls in the master, and are disturbed (i.e. go hanging) if the slave is stopped.

Is that the expected behavior, or could we have some configuration problem preventing it from working correctly ?

I will also test with some newer version of the MGM. Edit : with last version 4.8.11, the behaviour seems to be the same.

CERN Accelerating science

Managing failover of MGM