FSTs don't follow new master

ebirngru · January 7, 2021, 10:16am

Hi all,
We’ve upgraded to EOS 4.8.27 in November. It was a smooth ride so far. Now we started to observe some behavior which looks like FSTs not following correctly, when the MGM master changes:

From yesterday to today we had a full disk in the MGM master, which caused a failover - that itself has worked correctly. However the filesystems are then marked as “offline”, and their respective FST nodes as “unknown”:

# eos node ls
nodesview     fst-1.eos.grid.vbc.ac.at:1095 vbc::rack1::pod1    unknown           on     on          0       10      120                ~    28 
nodesview     fst-2.eos.grid.vbc.ac.at:1095 vbc::rack1::pod1    unknown           on     on          0       10      120                ~    28

The filesystems are shown offline, individuals (are also shown “off” - not more than a handful of 250 filesystems total.

# eos fs ls
fst-4.eos.grid.vbc.ac.at 1095    104                /srv/data/data.19       default.19 vbc::rack1::pod2       booted             rw      nodrain  offline        no mdstat 
fst-4.eos.grid.vbc.ac.at 1095    105                /srv/data/data.20       default.20 vbc::rack1::pod2       booted             rw      nodrain  offline        no mdstat 
fst-4.eos.grid.vbc.ac.at 1095    106                /srv/data/data.21       default.21 vbc::rack1::pod2       booted            off      nodrain  offline        no mdstat

For now we have to stop/start the FST process on the FST nodes to get these filesystems back (some “off” filesystems will only come back after a 2nd restart of the FST xrootd process).
Currently I think the root-cause is the node not correctly re-connecting after mgm failover. I can reproduce this on a test system (3 mgm, 2 fst), simply by doing an “eos ns master other” to move the master away from the current mgm node.
“eos node ls” will then show the fst nodes “offline” aafter some timeout. I’ve also tried sending an “eos fs boot *” but this seems to have no effect on the FSTs.

Any advice is appreciated,
Best
Erich

esindril · January 12, 2021, 10:04am

Hi Erich,

I tried to reproduce your issue but I was not successful. I suspect there might be some configuration issue wiht the instance in your case. Could you please share with me the following configuration files?
/etc/xrd.cf.mgm
/etc/xrd.cf.mq
one of the FST configurations /etc/xrd.cf.fst
and the /etc/sysconfig/eos_env file.

Thanks,
Elvin

ebirngru · January 12, 2021, 10:52am

Hi Elvin,
Thanks for you help. I’ve put the configs together in this gist: https://gist.github.com/ebirn/1145aa1fa33141225a6f52fa05281fe7

The config mostly the same since before the update. We did do some changes to sec.protocol and sec.protbind to fix some auth issues, see also https://github.com/CLIP-HPC/clip-grid-eos/blob/master/templates/xrd.cf.mgm.j2 (you should still have an invite for this)

Best,
Erich

esindril · January 12, 2021, 12:15pm

Hi Erich,

Thanks for the files! You should(must) remove the following env variable since support for MQ on QDB is experimental and not complete:
EOS_USE_MQ_ON_QDB=1

After this change please restart the MQ and MGM, and let me know if you still have issues.

Thanks,
Elvin

ebirngru · January 13, 2021, 3:05pm

Hi Elvin,

Thanks for that hint. The described problem (tested on our dev installation).
Also had to remove this on the FSTs and restart the processes there (at least to re-connect cleanly).
They are now also following the master MGM.
Also the group balancer is now back to normal (in one group we had to swap a disk).

Best,
Erich

CERN Accelerating science

FSTs don't follow new master