Hello,
I have two HA EOS clusters (on k8s) that were running v5.3.21.
When I upgraded them to v5.3.22 , the FST pods failed with
eos-fst 250923 22:49:58 time=1758660598.504591 func=QdbCommunicator level=CRIT logid=static.............................. unit=fst@eos-fst-1.eos-fst.eos.svc.kermes
-dev.local:1095 tid=00007f12f4af2640 source=Communicator:663 tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" msg="unable to obtain manager info for node"
It was consistent and reproducible on both. Then I rolled one back to v5.3.21 and still saw the same issue. So it seems related to upgrading or downgrading a HA system, rather than an issue with the specific EOS version.
The chart upgrade procedure will restart the QDBs one at a time, so they maintain a quorum at all times.
In parallel it will upgrade the non-master MGM first. Then I have to manually terminate the master MGM, in order for it to be upgraded, during which the other one may take over, unless the current master is back up and running the new version faster than the failover time.
Also in parallel, one FST at a time will be upgraded and try to register with the current master MGM.
The MGM and FST pods discover the QDB cluster via mgmofs.qdbcluster eos-qdb.eos.svc.kermes-dev.local:7777 in in /etc/xrd.cf.* , which is a headless service (DNS A record) indicating all the QDBs:
$ host eos-qdb.eos.svc.kermes-dev.local
eos-qdb.eos.svc.kermes-dev.local has address 10.224.0.77
eos-qdb.eos.svc.kermes-dev.local has address 10.224.1.80
eos-qdb.eos.svc.kermes-dev.local has address 10.224.5.203
And the FSTs talk to the current master MGM via these env vars
EOS_MGM_URL=root://eos-mgm.eos.svc.kermes-dev.local
EOS_MGM_ALIAS=eos-mgm.eos.svc.kermes-dev.local
and there is a kubernetes mechanism that ensures the load-balancer address eos-mgm.eos.svc.kermes-dev.local always points to the current master MGM by selecting the one where eos ns | grep -q is_master=true succeeds.
During the problem state, the two MGMs agree on which is the master:
[root@eos-mgm-0 /]# eos ns|grep master
ALL Replication is_master=false master_id=eos-mgm-1.eos-mgm.eos.svc.kermes-dev.local:1094
[root@eos-mgm-1 /]# eos ns|grep master
ALL Replication is_master=true master_id=eos-mgm-1.eos-mgm.eos.svc.kermes-dev.local:1094
but disagree on the state of the registered FSTs:
[root@eos-mgm-0 /]# eos fs ls
┌────────────────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host │port│ id│ path│ schedgroup│ geotag│ boot│ configstatus│ drain│ usage│ active│ health│
└────────────────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
eos-fst-2.eos-fst.eos.svc.kermes-dev.local 1095 1 /eos-storage/eos-data/eos-fst-2 default.0 rw nodrain 0.00
eos-fst-1.eos-fst.eos.svc.kermes-dev.local 1095 2 /eos-storage/eos-data/eos-fst-1 default.1 docker::k8s down rw nodrain 2.39 offline N/A
eos-fst-0.eos-fst.eos.svc.kermes-dev.local 1095 3 /eos-storage/eos-data/eos-fst-0 default.2 rw nodrain 0.00
[root@eos-mgm-1 /]# eos fs ls
┌────────────────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host │port│ id│ path│ schedgroup│ geotag│ boot│ configstatus│ drain│ usage│ active│ health│
└────────────────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
eos-fst-2.eos-fst.eos.svc.kermes-dev.local 1095 1 /eos-storage/eos-data/eos-fst-2 default.0 rw nodrain 0.00
eos-fst-1.eos-fst.eos.svc.kermes-dev.local 1095 2 /eos-storage/eos-data/eos-fst-1 default.1 docker::k8s down rw nodrain 2.39 offline N/A
eos-fst-0.eos-fst.eos.svc.kermes-dev.local 1095 3 /eos-storage/eos-data/eos-fst-0 default.2 docker::k8s down rw nodrain 2.27 offline N/A
I can sometimes workaround the issue by failing over the master MGM again after the upgrade, but this makes me think something is not entirely robust with the HA setup or upgrade procedure, and it’s not consistent or reproducible. Something must be getting stuck or in a bad state but I’m not sure what is needed to clear it. Anyone have ideas?
Thanks.