Hello, Elvin.
The three MGM daemons are running eos 5.2.31-1, the three qdb daemons are running 5.1.2.5.1.22, the three FST daemons are running 5.2.29-1, and the remaining 18 FSTs are running version 5.2.31-1. First, we confirmed that mixing FST versions 5.1 and 5.2 was not the cause.
I set the FST, MGM, and MQ to use only the jbod-mgmt-04.sdfarm.kr server as the MGM server with the following configuration. Our system couldn’t use the environmentfile of systemd because we were running the eos daemon inside the Podman container. So the individual values have export
in front of them, which you can ignore.
MGM
...
export EOS_MGM_HOST=jbod-mgmt-0X.sdfarm.kr
export EOS_MGM_HOST_TARGET=jbod-mgmt-04.sdfarm.kr
export EOS_HA_REDIRECT_READS=1
export EOS_MGM_MASTER1=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_MASTER2=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_ALIAS=jbod-mgmt-04.sdfarm.kr
export EOS_BROKER_URL=root://${EOS_MGM_ALIAS}:1097//eos/
export EOS_MGM_URL="root://${EOS_MGM_ALIAS}:1094"
export EOS_FUSE_MGM_ALIAS=${EOS_MGM_ALIAS}
...
MQ
export EOS_MGM_MASTER1=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_MASTER2=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_ALIAS=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_URL="root://${EOS_MGM_ALIAS}:1094"
export EOS_BROKER_URL=root://${EOS_MGM_ALIAS}:1097//eos/
export EOS_FUSE_MGM_ALIAS=${EOS_MGM_ALIAS}
FST
export EOS_MGM_ALIAS=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_URL="root://${EOS_MGM_ALIAS}:1094"
export EOS_FUSE_MGM_ALIAS=${EOS_MGM_ALIAS}
export EOS_BROKER_URL=root://${EOS_MGM_ALIAS}:1097//eos/
After the above settings, when the jbod-mgmt-04 server is the MGM master (RW node), messages are not output when the MGM server is changed. However, it is assumed that the message is generated when the MGM server is changed in some cases.
The problem is that the failure occurs after that setup.
After running the MGM, MQ, and FST daemons, it works fine, but after about 3 hours, the processing time for operations such as writes exceeds 1 minute. ALICE’s ADD test
becomes critical because it expects operations to be processed in less than a minute. After about 5 hours, we see that GET operations exceed 4 minutes and become critical.
Even while this is happening, MGM shows that the nodes are connected in less than 3 seconds.
EOS Console [root://localhost] |/eos/gsdc/> node ls
┌──────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type │ hostport│ geotag│ status│ activated│ heartbeatdelta│ nofs│
└──────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
nodesview jbod-mgmt-01.sdfarm.kr:1095 kisti::gsdc::g01 online on 1 83
nodesview jbod-mgmt-01.sdfarm.kr:1096 kisti::gsdc::g01 online on 2 83
nodesview jbod-mgmt-02.sdfarm.kr:1095 kisti::gsdc::g01 online on 0 84
nodesview jbod-mgmt-02.sdfarm.kr:1096 kisti::gsdc::g01 online on 2 84
nodesview jbod-mgmt-03.sdfarm.kr:1095 kisti::gsdc::g01 online on 3 84
nodesview jbod-mgmt-03.sdfarm.kr:1096 kisti::gsdc::g01 online on 0 84
nodesview jbod-mgmt-04.sdfarm.kr:1095 kisti::gsdc::g02 online on 1 84
nodesview jbod-mgmt-04.sdfarm.kr:1096 kisti::gsdc::g02 online on 0 83
nodesview jbod-mgmt-05.sdfarm.kr:1095 kisti::gsdc::g02 online on 2 84
nodesview jbod-mgmt-05.sdfarm.kr:1096 kisti::gsdc::g02 online on 1 83
nodesview jbod-mgmt-06.sdfarm.kr:1095 kisti::gsdc::g02 online on 1 84
nodesview jbod-mgmt-06.sdfarm.kr:1096 kisti::gsdc::g02 online on 0 84
nodesview jbod-mgmt-07.sdfarm.kr:1095 kisti::gsdc::g03 online on 1 84
nodesview jbod-mgmt-07.sdfarm.kr:1096 kisti::gsdc::g03 online on 2 84
nodesview jbod-mgmt-08.sdfarm.kr:1095 kisti::gsdc::g03 online on 3 84
nodesview jbod-mgmt-08.sdfarm.kr:1096 kisti::gsdc::g03 online on 1 84
nodesview jbod-mgmt-09.sdfarm.kr:1095 kisti::gsdc::g03 online on 1 84
nodesview jbod-mgmt-09.sdfarm.kr:1096 kisti::gsdc::g03 online on 0 84
nodesview jbod-mgmt-11.sdfarm.kr:1095 kisti::gsdc::e01 online on 0 84
nodesview jbod-mgmt-11.sdfarm.kr:1096 kisti::gsdc::e01 online on 0 84
nodesview jbod-mgmt-12.sdfarm.kr:1095 kisti::gsdc::e01 online on 1 84
However, individual FST servers occasionally experienced the following behavior where the node appeared to be offline.(__offline_jbod-mgmt-02.sdarm.kr
) The full message is long, so we’re only showing a brief summary.
250120 04:48:11 time=1737348491.242950 func=fileOpen level=ERROR logid=c033e9ac-d6e9-11ef-b19a-b8599fa51310 unit=fst@jbod-mgmt-05.sdfarm.kr:1095 tid=00007f23183a0640 source=XrdIo:334 tident=<service> sec= uid=0 gid=0 name= geo="" error= "open failed url=root://1@__offline_jbod-mgmt-02.sdfarm.kr:1096//eos/gsdc/grid/02/63948/d8c38a44-6415-11e2-9717-2823a10abeef?cap.msg=~~~~&eos.clientinfo=zbase64:MD~~~~&fst.valid=1737348550&mgm.id=000647b1&mgm.logid=b6eb556a-d6e9-11ef-9f5a-b8599f9c4330&mgm.mtime=1642197616&mgm.replicaindex=3, errno=0, errc=101, msg=[FATAL] Invalid address"
The messages below were also found in the FST.
250120 07:45:04 time=1737359104.874558 func=Close level=WARN logid=46883f9a-d702-11ef-89ce-b8599f9c5190 unit=fst@jbod-mgmt-07.sdfarm.kr:1095 tid=00007fa5699ff640 source=RainMetaLayout:1727 tident=<service> sec= uid=0 gid=0 name= geo="" msg="failed close for null file"
250120 07:45:21 time=1737359121.774973 func=Read level=WARN logid=4b522450-d702-11ef-8aff-b8599f9c5190 unit=fst@jbod-mgmt-07.sdfarm.kr:1095 tid=00007fa512cbe640 source=RainMetaLayout:657 tident=<service> sec= uid=0 gid=0 name= geo="" msg="read too big resizing the read length" end_offset=1033895936 file_size=1032925807
250120 07:46:45 time=1737359205.172128 func=fileOpen level=WARN logid=b2339cee-d702-11ef-aa05-b8599f9c5190 unit=fst@jbod-mgmt-07.sdfarm.kr:1095 tid=00007fa5643ff640 source=XrdIo:339 tident=<service> sec= uid=0 gid=0 name= geo="" msg="error encountered despite errno=0; setting errno=22"