Upgrade of HA EOS fails with FST errors: unable to obtain manager info for node

Hello,

I have two HA EOS clusters (on k8s) that were running v5.3.21.
When I upgraded them to v5.3.22 , the FST pods failed with

eos-fst 250923 22:49:58 time=1758660598.504591 func=QdbCommunicator          level=CRIT  logid=static.............................. unit=fst@eos-fst-1.eos-fst.eos.svc.kermes
-dev.local:1095 tid=00007f12f4af2640 source=Communicator:663               tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" msg="unable to obtain manager info for node"  

It was consistent and reproducible on both. Then I rolled one back to v5.3.21 and still saw the same issue. So it seems related to upgrading or downgrading a HA system, rather than an issue with the specific EOS version.

The chart upgrade procedure will restart the QDBs one at a time, so they maintain a quorum at all times.
In parallel it will upgrade the non-master MGM first. Then I have to manually terminate the master MGM, in order for it to be upgraded, during which the other one may take over, unless the current master is back up and running the new version faster than the failover time.
Also in parallel, one FST at a time will be upgraded and try to register with the current master MGM.

The MGM and FST pods discover the QDB cluster via mgmofs.qdbcluster eos-qdb.eos.svc.kermes-dev.local:7777 in in /etc/xrd.cf.* , which is a headless service (DNS A record) indicating all the QDBs:

$ host eos-qdb.eos.svc.kermes-dev.local
eos-qdb.eos.svc.kermes-dev.local has address 10.224.0.77
eos-qdb.eos.svc.kermes-dev.local has address 10.224.1.80
eos-qdb.eos.svc.kermes-dev.local has address 10.224.5.203

And the FSTs talk to the current master MGM via these env vars

EOS_MGM_URL=root://eos-mgm.eos.svc.kermes-dev.local
EOS_MGM_ALIAS=eos-mgm.eos.svc.kermes-dev.local

and there is a kubernetes mechanism that ensures the load-balancer address eos-mgm.eos.svc.kermes-dev.local always points to the current master MGM by selecting the one where eos ns | grep -q is_master=true succeeds.

During the problem state, the two MGMs agree on which is the master:

[root@eos-mgm-0 /]# eos ns|grep master
ALL      Replication                      is_master=false master_id=eos-mgm-1.eos-mgm.eos.svc.kermes-dev.local:1094
[root@eos-mgm-1 /]# eos ns|grep master
ALL      Replication                      is_master=true master_id=eos-mgm-1.eos-mgm.eos.svc.kermes-dev.local:1094

but disagree on the state of the registered FSTs:

[root@eos-mgm-0 /]# eos fs ls
┌────────────────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host                                    │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│ usage│  active│          health│
└────────────────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
 eos-fst-2.eos-fst.eos.svc.kermes-dev.local 1095      1  /eos-storage/eos-data/eos-fst-2        default.0                                           rw      nodrain   0.00                           
 eos-fst-1.eos-fst.eos.svc.kermes-dev.local 1095      2  /eos-storage/eos-data/eos-fst-1        default.1      docker::k8s         down             rw      nodrain   2.39  offline              N/A 
 eos-fst-0.eos-fst.eos.svc.kermes-dev.local 1095      3  /eos-storage/eos-data/eos-fst-0        default.2                                           rw      nodrain   0.00     
 [root@eos-mgm-1 /]# eos fs ls
┌────────────────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host                                    │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│ usage│  active│          health│
└────────────────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
 eos-fst-2.eos-fst.eos.svc.kermes-dev.local 1095      1  /eos-storage/eos-data/eos-fst-2        default.0                                           rw      nodrain   0.00                           
 eos-fst-1.eos-fst.eos.svc.kermes-dev.local 1095      2  /eos-storage/eos-data/eos-fst-1        default.1      docker::k8s         down             rw      nodrain   2.39  offline              N/A 
 eos-fst-0.eos-fst.eos.svc.kermes-dev.local 1095      3  /eos-storage/eos-data/eos-fst-0        default.2      docker::k8s         down             rw      nodrain   2.27  offline              N/A 

I can sometimes workaround the issue by failing over the master MGM again after the upgrade, but this makes me think something is not entirely robust with the HA setup or upgrade procedure, and it’s not consistent or reproducible. Something must be getting stuck or in a bad state but I’m not sure what is needed to clear it. Anyone have ideas?
Thanks.

Seems basically the same as like FST stuck in booting "waiting to know manager" but I was never able to track down the cause of that, and it wasn’t totally reproducible. That was with EOS 5.1 I believe, and a non-HA setup.

Before the /opt/eos/xrootd/bin/xrootd process of the FST can even begin, an initContainer in the FST has already tested and confirmed that the MGM is available according to the success of the eos -r 0 0 ns command, using the same $EOS_MGM_URL. So given that the master MGM was already contacted successfully, how can the xrootd FST process produce these errors after that?

eos-fst 250924 01:17:49 time=1758669469.463743 func=Query2Delete             level=ERROR logid=static.............................. unit=fst@eos-fst-0.eos-fst.eos.svc.kermes
-dev.local:1095 tid=00007f923d7fa640 source=XrdFstOfs:2420                 tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" msg="no MGM endpoint available"

The IP address of the kubernetes loadbalancing service eos-mgm.eos.svc.kermes-dev.local never changes, and always directs traffic to the current master MGM.
The names of the individual MGMs (pods) are separate from that: eos-mgm-*.eos-mgm.eos.svc.kermes-dev.local
When a MGM pod loses its master status, its DNS record also disappears (this is how kubernetes ensures that unready pods don’t receive any traffic).

[root@eos-fst-2 /]# host eos-mgm-0.eos-mgm.eos.svc.kermes-dev.local
Host eos-mgm-0.eos-mgm.eos.svc.kermes-dev.local not found: 3(NXDOMAIN)
[root@eos-fst-2 /]# host eos-mgm-1.eos-mgm.eos.svc.kermes-dev.local
eos-mgm-1.eos-mgm.eos.svc.kermes-dev.local has address 10.224.5.30

So if EOS or xrootd is remembering the name of each MGM, rather than using the load-balancer address $EOS_MGM_URL to discover the master MGM, I wonder if it is getting confused by that.

Hi Ryan,

Thanks for this report! It’s interesting as we also observed a similar behavior in only one of the FSTs that we’ve upgraded across our instances - but we did not have the opportunity to collect more info to understand the root cause.

It seems that in your case this is more systematic so it would be great if you could give me some steps to reproduce - though I understand that you setup is quite complex. Or give me access to some test cluster where this happens.

In our initial evaluation we thought this was a 5.3.22 specific issue as it happens in a setup with no HA, but it’s surprising that in your case you also saw it with 5.3.21. Nevertheless, contact me directly so that I can collect more info and hopefully fix this corner case.

Thanks,

Elvin

Hi Elvin! Thanks for the reply!

The beauty of k8s and helm charts is that in principle, everything about a deployment is portable and reproducible, so you should be able to take the same Helm chart and values and run an EOS deployment identical to what I have. In practice the storage and network are a fuzzy area though…

Anyway I am pasting my complete Helm values file for EOS server chart 0.9.5, to deploy an ATLAS T2 EOS instance. Probably large chunks of it are irrelevant to the issue though.
I anonymized some parts with an “example.org” domain.

There is EOS_FST_ALIAS (e.g. eos-fst-0.example.org, eos-fst-1.example.org, …) which should only be used when MGM redirects a client to a FST, and is meant to ensure the public host name is used (which connects to e.g. k8s loadbalancers, specified in externalService loadBalancerIPs) for external access.
For internal access within the cluster you can use e.g. /etc/hosts-style tricks to go directly to the FST over the private network, or just not use EOS_FST_ALIAS at all I suppose.

global:
  pullPolicy: IfNotPresent
  clusterDomain: "cluster-dev.local"
  eos:
    instancename: "eoscluster-dev.local"
  sssKeytab:
    # use default secret name to avoid trying to read secret from file
    secret: eos-sss-keytab
  http:
    enabled: true
  extraObjects:
    - |
      apiVersion: v1
      kind: Secret
      metadata:
        name: eos-sss-keytab
      type: Opaque
      data:
        eos.keytab: "{{ eos_sss_keytab | b64encode }}"
    - |
      apiVersion: v1
      kind: Secret
      metadata:
        name: cephfs-rw-secret
      type: Opaque
      data:
        key: "{{ eos_cephfs_rw_key | b64encode }}"
    - |
      apiVersion: v1
      kind: Secret
      metadata:
        name: eos-certs
      type: Opaque
      data:
        hostcert.pem: "{{ lookup('file', 'eos-hostcert.pem') | b64encode }}"
        hostkey.pem: "{{ lookup('file', 'eos-hostkey.pem') | b64encode }}"
    - |
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: vomsdir-atlas
      data:
        lcg-voms2.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
        voms-atlas-auth.app.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=atlas-auth.web.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
        voms-atlas-auth.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=atlas-auth.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
        voms2.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
    - |
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: vomsdir-ops
      data:
        lcg-voms2.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
        voms-ops-auth.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=ops-auth.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
        voms2.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
    - |
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: vomsdir-dteam
      data:
        voms-dteam-auth.cern.ch.lsc: |
          /DC=ch/DC=cern/OU=computers/CN=dteam-auth.cern.ch
          /DC=ch/DC=cern/CN=CERN Grid Certification Authority
        voms2.hellasgrid.gr.lsc: |
          /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr
          /C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2016
    - |
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: scitoken-mapfile
      data:
        # TODO: refactor this into a multi-purpose eos configmap (e.g. /etc/eos.configmap which currently also contains preRun and init scripts).
        # Revisit https://gitlab.cern.ch/eos/eos-charts/-/merge_requests/66/ ,  depends on https://gitlab.cern.ch/dss/eos/-/issues/13
        scitokens.map: |
          [
            {"path": "/atlasscratchdisk/",    "result": "atlas", "comment": "Owner of ATLASSCRATCHDISK"},
            {"path": "/atlasdatadisk/",       "result": "atprd", "comment": "Owner of ATLASDATADISK"},
            {"path": "/atlaslocalgroupdisk/", "result": "atcan", "comment": "Owner of ATLASLOCALGROUPDISK"}
          ]

fst:
  replicaCount: 3
  resources:
    requests:
      cpu: 1
      memory: 1Gi
  persistence:
    enabled: true
    volumeClaimTemplates: false
    storageClass: "-"
    accessModes:
      - ReadWriteMany
    size: 1E
  # Each FST uses its own subdirectory, but they all mount the same volume path (for direct CephFS local redirect).
  selfRegister:
    datadir: "/eos-storage/eos-data/${HOSTNAME}"
    # https://eos-docs.web.cern.ch/diopside/manual/using.html#shared-filesystems-as-fst-backends
    sharedfs: "cephfs"
  storageMountPath: /eos-storage
  extraEnv:
  - name: EOS_POD_NAME
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.name
  - name: EOS_FST_ALIAS
    value: $(EOS_POD_NAME).example.org
  extraVolumes:
    volumes:
    - name: localcertdir
      emptyDir:
        sizeLimit: 1Mi
    - name: certsvolume
      secret:
        secretName: eos-certs
    - name: cvmfs-grid-certificates
      hostPath:
        path: /cvmfs/grid.cern.ch/etc/grid-security/certificates/
        type: Directory
    volumeMounts:
    - name: localcertdir
      mountPath: /etc/grid-security/daemon
      readOnly: true
    - name: cvmfs-grid-certificates
      mountPath: /etc/grid-security/certificates
      readOnly: true
  initContainer:
    enabled: true
    # Note: this script goes into a configmap, not the statefulset spec, so doesn't trigger a pod restart when changed
    script: |
      #!/usr/bin/bash
      #cp -v /certsvolume/${HOSTNAME}.key /localcertdir/hostkey.pem
      #cp -v /certsvolume/${HOSTNAME}.cert /localcertdir/hostcert.pem
      cp -v /certsvolume/host*.pem /localcertdir/
      chmod -v 0400 /localcertdir/hostkey.pem
      chmod -v 0444 /localcertdir/hostcert.pem
      chown -v daemon: /localcertdir/*.pem
      chmod -v 0755 /localcertdir
    volumeMounts:
    - name: certsvolume
      mountPath: /certsvolume
    - name: localcertdir
      mountPath: /localcertdir
  externalService:
    enabled: true
    template:
      type: LoadBalancer
    loadBalancerIPs:
      - 10.5.7.87
      - 10.5.7.88
      - 10.5.7.89

mgm:
  replicaCount: 2
  resources:
    requests:
      cpu: 1
      memory: 500Mi
  # GSI configuration references:
  #  - https://eos-docs.web.cern.ch/diopside/manual/using.html#voms-role-mapping
  #  - https://xrootd.slac.stanford.edu/doc/dev56/sec_config.htm#_Toc119617433
  #  - https://xrootd.slac.stanford.edu/doc/dev56/sec_config.htm#_The_voms_plug-in
  # Note: 'vomsat' should not be needed because the default is set to 'require' (2) if vomsfun is specified
  secProtocolGSI: "sec.protocol gsi -crl:use -moninfo:1 -cert:/etc/grid-security/daemon/hostcert.pem -key:/etc/grid-security/daemon/hostkey.pem -gmapopt:nomap -vomsfun:default -d:1"
  # TODO figure out if this is secure: https://eos-community.web.cern.ch/t/sec-protbind-config-for-eos/1012
  secProtBind: |
    sec.protbind localhost.localdomain unix sss
    sec.protbind localhost unix sss
    sec.protbind * only gsi sss unix
  http:
    macaroonsSecretKey: "{{ eos_macaroons_secretkey }}"
    sciTokens:
      enabled: true
      # See https://github.com/xrootd/xrootd/tree/master/src/XrdSciTokens
      config: |
        [Global]
        # If a token is not present or does not authorize the requested action, invoke the next configured authorization plugin.
        #onmissing = passthrough
        audience = https://wlcg.cern.ch/jwt/v1/any, https://eos-mgm.example.org:8443, roots://eos-mgm.example.org:1094, eos-mgm.example.org

        [Issuer ATLAS]
        issuer = https://atlas-auth.cern.ch/
        base_path = /eos/example.org/data/atlas
        map_subject = False
        name_mapfile = /etc/eos.config/scitokens.map
        default_user = nobody
        authorization_strategy = capability

        [Issuer DTeam]
        issuer = https://dteam-auth.cern.ch/
        base_path = /eos/example.org/data/dteam
        map_subject = False
        default_user = dteam
  xrd:
    # Use default thread settings from https://xrootd.slac.stanford.edu/doc/dev57/xrd_config.htm#_sched
    sched: "mint 8 maxt 2048 avlt 512 idle 780"
  service:
    template:
      type: LoadBalancer
      loadBalancerIP: 10.5.7.86
      clusterIP: null
  extraVolumes:
    volumes:
    - name: localcertdir
      emptyDir:
        sizeLimit: 1Mi
    - name: certsvolume
      secret:
        secretName: eos-certs
    - name: cvmfs-grid-certificates
      hostPath:
        path: /cvmfs/grid.cern.ch/etc/grid-security/certificates/
        type: Directory
    - name: vomsdir-atlas-volume
      configMap:
        name: vomsdir-atlas
    - name: vomsdir-ops-volume
      configMap:
        name: vomsdir-ops
    - name: vomsdir-dteam-volume
      configMap:
        name: vomsdir-dteam
    - name: scitoken-mapfile-volume
      configMap:
        name: scitoken-mapfile
    volumeMounts:
    - name: localcertdir
      mountPath: /etc/grid-security/daemon
      readOnly: true
    - name: cvmfs-grid-certificates
      mountPath: /etc/grid-security/certificates
      readOnly: true
    - name: vomsdir-atlas-volume
      mountPath: /etc/grid-security/vomsdir/atlas
    - name: vomsdir-ops-volume
      mountPath: /etc/grid-security/vomsdir/ops
    - name: vomsdir-dteam-volume
      mountPath: /etc/grid-security/vomsdir/dteam
    - name: scitoken-mapfile-volume
      mountPath: /etc/eos.config
  preRun:
    enabled: true
    script: |
      #!/usr/bin/bash
      echo "$0 starting local account creation"
      useradd -U -M -s /sbin/nologin -u 8000 dteam
      useradd -U -M -s /sbin/nologin -u 9000 ops
      groupadd -g 10000 atlas
      useradd -M -s /sbin/nologin -u 10000 -g 10000 atlas
      useradd -M -s /sbin/nologin -u 10100 -g 10000 atprd
      useradd -M -s /sbin/nologin -u 10200 -g 10000 atcan
      echo "$0 finished local account creation"
  initContainer:
    enabled: true
    # Note: this script goes into a configmap, not the statefulset spec, so doesn't trigger a pod restart when changed
    script: |
      #!/usr/bin/bash
      cp -v /certsvolume/host*.pem /localcertdir/
      chmod -v 0400 /localcertdir/hostkey.pem
      chmod -v 0444 /localcertdir/hostcert.pem
      chown -v daemon: /localcertdir/*.pem
      chmod -v 0755 /localcertdir
    volumeMounts:
    - name: certsvolume
      mountPath: /certsvolume
    - name: localcertdir
      mountPath: /localcertdir

qdb:
  resources:
    requests:
      cpu: 100m
      memory: 500Mi
  clusterID: "eos-cluster-dev.local"
  podAssignment:
    enablePodAntiAffinity: true
  persistence:
    enabled: true
    storageClass: "-"
    accessModes:
      - ReadWriteOnce
    size: '10Gi'

Hi @esindril ,
Further to our other discussion - I upgraded the same two clusters to 5.3.23. This time neither exhibited the problem. The problem never seemed entirely deterministic/reproducible so I can’t say for sure, but it’s definitely suggestive of being fixed. Not sure which of the JIRA issues in the release notes could have explained this though? Anyway thanks a lot!