Hi all,
We’ve upgraded to EOS 4.8.27 in November. It was a smooth ride so far. Now we started to observe some behavior which looks like FSTs not following correctly, when the MGM master changes:
From yesterday to today we had a full disk in the MGM master, which caused a failover - that itself has worked correctly. However the filesystems are then marked as “offline”, and their respective FST nodes as “unknown”:
# eos node ls
nodesview fst-1.eos.grid.vbc.ac.at:1095 vbc::rack1::pod1 unknown on on 0 10 120 ~ 28
nodesview fst-2.eos.grid.vbc.ac.at:1095 vbc::rack1::pod1 unknown on on 0 10 120 ~ 28
The filesystems are shown offline, individuals (are also shown “off” - not more than a handful of 250 filesystems total.
# eos fs ls
fst-4.eos.grid.vbc.ac.at 1095 104 /srv/data/data.19 default.19 vbc::rack1::pod2 booted rw nodrain offline no mdstat
fst-4.eos.grid.vbc.ac.at 1095 105 /srv/data/data.20 default.20 vbc::rack1::pod2 booted rw nodrain offline no mdstat
fst-4.eos.grid.vbc.ac.at 1095 106 /srv/data/data.21 default.21 vbc::rack1::pod2 booted off nodrain offline no mdstat
For now we have to stop/start the FST process on the FST nodes to get these filesystems back (some “off” filesystems will only come back after a 2nd restart of the FST xrootd process).
Currently I think the root-cause is the node not correctly re-connecting after mgm failover. I can reproduce this on a test system (3 mgm, 2 fst), simply by doing an “eos ns master other” to move the master away from the current mgm node.
“eos node ls” will then show the fst nodes “offline” aafter some timeout. I’ve also tried sending an “eos fs boot *” but this seems to have no effect on the FSTs.
Any advice is appreciated,
Best
Erich