Read Checksum seems delayed but no clue found

Dear Experts,

It is quite tricky to explain an issue that we are currently experiencing. We are running a mixed setup of EOS v5.2.16 (for MGM and new FSTs) and v5.1.22 (old FSTs). Everything works fine except only remote read access occasionally failed upon new FSTs (old FSTs are perfectly working fine).

What we have observed up to now with MGM and FST logs in INFO mode is that there is a time gap between read checksum info and close about 2 minutes. In fact, there seems no clue from DEBUG message in between this gap.

Let me describe more in detail. As shown below, there is 2 minutes gap between read operation and close in MGM log.

MGM Log
240220 06:43:02 047 XrootdXeq: monalisa.958867:522@pcalimonitor.cern.ch pub IP46 login as monalisa
240220 06:43:02 time=1708411382.321968 func=IdMap                    level=DEBUG logid=static.............................. unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=Mapping:178                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="XrdSecEntity client" name="monalisa" role="(null)" group="alienmaster" tident="monalisa.958867:522@pcalimonitor.cern.ch" cred="none"
240220 06:43:02 time=1708411382.322050 func=IdMap                    level=INFO  logid=static.............................. unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=Mapping:999                    tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=unix sec.name="monalisa" sec.host="pcalimonitor.cern.ch" sec.vorg="" sec.grps="alienmaster" sec.role="" sec.info="" sec.app="" sec.tident="monalisa.958867:522@pcalimonitor.cern.ch" vid.uid=10367 vid.gid=1395 sudo=0 gateway=0
240220 06:43:02 time=1708411382.322092 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=XrdMgmOfsFile:548              tident=monalisa.958867:522@pcalimonitor.cern.ch sec=unix  uid=10367 gid=1395 name=monalisa geo="" op=read path=/eos/alicekistigsdc/grid/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e info=authz=<...>
240220 06:43:02 time=1708411382.322646 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=XrdMgmOfsFile:1105             tident=monalisa.958867:522@pcalimonitor.cern.ch sec=unix  uid=10367 gid=1395 name=monalisa geo="" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1 facl=0
240220 06:43:02 time=1708411382.322843 func=LogSchedulingInfo        level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=XrdMgmOfsFile:3673             tident=monalisa.958867:522@pcalimonitor.cern.ch sec=unix  uid=10367 gid=1395 name=monalisa geo="" msg="scheduling info selectedfs: 38  proxys:   firewallentrypoints:   "
240220 06:43:02 time=1708411382.322885 func=open                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=XrdMgmOfsFile:3143             tident=monalisa.958867:522@pcalimonitor.cern.ch sec=unix  uid=10367 gid=1395 name=monalisa geo="" capability=&mgm.access=read&mgm.ruid=10367&mgm.rgid=1395&mgm.uid=99&mgm.gid=99&mgm.path=/eos/alicekistigsdc/grid/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e&mgm.manager=rbod-mgmt-01.sdfarm.kr:1094&mgm.fid=0a13d60c&mgm.cid=22278&mgm.sec=unix|monalisa|pcalimonitor.cern.ch||alienmaster|||xrdcp&mgm.lid=1048578&mgm.bookingsize=1024&mgm.fsid=38
240220 06:43:02 time=1708411382.322941 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=XrdMgmOfsFile:3309             tident=monalisa.958867:522@pcalimonitor.cern.ch sec=unix  uid=10367 gid=1395 name=monalisa geo="" op=read path=/eos/alicekistigsdc/grid/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e info=authz=<...>  redirection=rbod-mgmt-03.sdfarm.kr?&cap.sym=<...>&cap.msg=<...>&mgm.logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=0a13d60c&mgm.mtime=1708411379 xrd_port=1095 http_port=8001
240220 06:43:02 time=1708411382.323165 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=mgm@rbod-mgmt-01.sdfarm.kr:1094 tid=00007f0284773640 source=XrdMgmOfsFile:3333             tident=monalisa.958867:522@pcalimonitor.cern.ch sec=unix  uid=10367 gid=1395 name=monalisa geo="" path=/eos/alicekistigsdc/grid/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e open:rt=0.98 io:bw=inf io:sched=0 io:type=buffered io:prio=default io:redirect=rbod-mgmt-03.sdfarm.kr:1095 duration=0.985ms timing=IdMap=0.099ms Bounce=0.021ms fid::fetch=0.064ms fid::fetched=0.001ms authorize=0.005ms authorized=0.399ms path-computed=0.007ms container::fetched=0.067ms Policy::begin=0.038ms Policy::end=0.088ms Scheduler::FileAccess=0.019ms Scheduler::FileAccessed=0.068ms end=0.109ms Open=0.985ms
240220 06:45:03 045 XrootdXeq: monalisa.958867:522@pcalimonitor.cern.ch disc 0:02:02

Regarding what we can see from FST log, we observed a couple of times of read checksum info operation performed before it closes. It reads a file of size about 10M. The func=VerifyChecksum appeared again with 1 minute delay after somehow close did not work. Also close operation is done after another 1 minute from read checksum info operation.

FST Log
240220 06:43:03 1896 XrootdXeq: monalisa.958867:62@pcalimonitor.cern.ch pub IP46 login as monalisa
240220 06:43:03 time=1708411383.682941 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:178              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=unix  uid=0 gid=0 name=monalisa geo="" path=/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e info=authz=<...>&cap.msg=<...>&cap.sym=<...>&eos.clientinfo=zbase64:MDAwMDAwNmF4nBXI2wmAMAxA0VVcwFIRQQsZpjWJFu2DGBG31/6de0ul7ETBmmXuYnHhgZi56Vp3QrCN+laCcDOTELZRJRZAYn+f2vrfUWhVkFCwT1vS3o7mQvaSzCFusMv0ActMJSU=&mgm.id=0a13d60c&mgm.logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40&mgm.mtime=1708411379&mgm.replicahead=0&mgm.replicaindex=0 open_mode=0
240220 06:43:03 time=1708411383.683113 func=ProcessCapOpaque         level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2739             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=(null) uid=99 gid=99 name=(null) geo="" capability=&mgm.access=read&mgm.ruid=10367&mgm.rgid=1395&mgm.uid=99&mgm.gid=99&mgm.path=/eos/alicekistigsdc/grid/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e&mgm.manager=rbod-mgmt-01.sdfarm.kr:1094&mgm.fid=0a13d60c&mgm.cid=22278&mgm.sec=unix|monalisa|pcalimonitor.cern.ch||alienmaster|||xrdcp&mgm.lid=1048578&mgm.bookingsize=1024&mgm.fsid=38&cap.valid=1708414982
240220 06:43:03 time=1708411383.683157 func=ProcessMixedOpaque       level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2916             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" msg="checksum requested" xs_ptr=0x7fb643a1ea00 lid=1048578 mgm.checksum=""
240220 06:43:03 time=1708411383.683181 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:258              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=(null) uid=10367 gid=1395 name=nobody geo="" ns_path=/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e fst_path=/rbod-b/box_06_disk_005/0000420b/0a13d60c
240220 06:43:03 time=1708411383.683305 func=open                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:339              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=(null) uid=10367 gid=1395 name=nobody geo="" removing creation flag because of 0 0
240220 06:43:03 time=1708411383.683317 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:515              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=(null) uid=10367 gid=1395 name=nobody geo="" fst_path=/rbod-b/box_06_disk_005/0000420b/0a13d60c open-mode=0 create-mode=180 layout-name=plain oss-opaque=&mgm.lid=1048578&mgm.bookingsize=0
240220 06:43:03 time=1708411383.683465 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:697              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=(null) uid=10367 gid=1395 name=nobody geo="" msg="layout size" disk_size=10459447 db_size= 10459447
240220 06:43:03 time=1708411383.683501 func=open                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:784              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=(null) uid=10367 gid=1395 name=nobody geo="" open-duration=0.604ms path='/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e' fxid=0a13d60c path::print=0.294ms creation::barrier=0.057ms layout::exists=0.004ms clone::fst=0.073ms layout::open=0.009ms layout::opened=0.060ms get::localfmd=0.000ms resync::localfmd=0.077ms layout::stat=0.000ms layout::stat=0.001ms fileio::object=0.026ms open::accounting=0.001ms end=0.002ms open=0.604ms
240220 06:43:03 time=1708411383.683518 func=stat                     level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:1309             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" path=/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e fxid=0a13d60c size=10459447 mtime=1708411378.12283710
240220 06:43:03 time=1708411383.952839 func=pgRead                   level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:933              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" offset=0 len=2097152
240220 06:43:03 time=1708411383.952850 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:832              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" fileOffset=0, buffer_size=2097152
240220 06:43:03 time=1708411383.953872 func=readofs                  level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2337             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" read 140421311619072 0 2097152 rc=2097152
240220 06:43:03 time=1708411383.953886 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:864              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" layout read 2097152 checkSum 1134684672
240220 06:43:03 time=1708411383.954611 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:902              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" rc=2097152 offset=0 size=2097152
240220 06:43:05 time=1708411385.472699 func=pgRead                   level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:933              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" offset=2097152 len=2097152
240220 06:43:05 time=1708411385.472719 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:832              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" fileOffset=2097152, buffer_size=2097152
240220 06:43:05 time=1708411385.473564 func=readofs                  level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2337             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" read 140421311619072 2097152 2097152 rc=2097152
240220 06:43:05 time=1708411385.473579 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:864              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" layout read 2097152 checkSum 1134684672
240220 06:43:05 time=1708411385.474318 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:902              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" rc=2097152 offset=2097152 size=2097152
240220 06:43:05 time=1708411385.656226 func=pgRead                   level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:933              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" offset=4194304 len=2097152
240220 06:43:05 time=1708411385.656238 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:832              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" fileOffset=4194304, buffer_size=2097152
240220 06:43:05 time=1708411385.656674 func=readofs                  level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2337             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" read 140421311619072 4194304 2097152 rc=2097152
240220 06:43:05 time=1708411385.656692 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:864              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" layout read 2097152 checkSum 1134684672
240220 06:43:05 time=1708411385.657420 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:902              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" rc=2097152 offset=4194304 size=2097152
240220 06:43:05 time=1708411385.921746 func=pgRead                   level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:933              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" offset=6291456 len=2097152
240220 06:43:05 time=1708411385.921757 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:832              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" fileOffset=6291456, buffer_size=2097152
240220 06:43:05 time=1708411385.922228 func=readofs                  level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2337             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" read 140421311619072 6291456 2097152 rc=2097152
240220 06:43:05 time=1708411385.922240 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:864              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" layout read 2097152 checkSum 1134684672
240220 06:43:05 time=1708411385.922952 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:902              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" rc=2097152 offset=6291456 size=2097152
240220 06:43:05 time=1708411385.923587 func=pgRead                   level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:933              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" offset=8388608 len=2070839
240220 06:43:05 time=1708411385.923594 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:832              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" fileOffset=8388608, buffer_size=2070839
240220 06:43:05 time=1708411385.924007 func=readofs                  level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2337             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" read 140421311619072 8388608 2070839 rc=2070839
240220 06:43:05 time=1708411385.924018 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:864              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" layout read 2070839 checkSum 1134684672
240220 06:43:05 time=1708411385.924716 func=read                     level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:902              tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" rc=2070839 offset=8388608 size=2070839
240220 06:43:05 time=1708411385.924728 func=VerifyChecksum           level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:3539             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" isrw=0 max-offset=10459447 opensize=10459447
240220 06:43:05 time=1708411385.924743 func=VerifyChecksum           level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:3672             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" msg="read checksum info" xs_type=adler xs_computed=ee06f584 xs_local=ee06f584 fxid=0a13d60c fsid=38
240220 06:44:03 time=1708411443.998812 func=_close                   level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:1541             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" viaDelete=0 writeDelete=0
240220 06:44:03 time=1708411443.998848 func=VerifyChecksum           level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:3539             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" isrw=0 max-offset=10459447 opensize=10459447
240220 06:44:03 time=1708411443.998865 func=VerifyChecksum           level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:3672             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" msg="read checksum info" xs_type=adler xs_computed=ee06f584 xs_local=ee06f584 fxid=0a13d60c fsid=38
240220 06:44:03 time=1708411443.998876 func=_close                   level=DEBUG logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:1632             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" checksumerror=0 targetsizerror=0 mMaxOffsetWritten=0 targetsize=0
240220 06:44:03 time=1708411443.999090 func=_close                   level=INFO  logid=static.............................. unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2025             tident= sec=(null) uid=99 gid=99 name=- geo="" msg="log=4b3d7f4a-cfbb-11ee-9697-b8599f995c40&path=/eos/alicekistigsdc/grid/02/06137/459e7614-cfbb-11ee-8aa2-00620ba19d1e&fstpath=/rbod-b/box_06_disk_005/0000420b/0a13d60c&ruid=10367&rgid=1395&td=monalisa.958867:62@pcalimonitor.cern.ch&host=rbod-mgmt-03.sdfarm.kr&lid=1048578&fid=169072140&fsid=38&ots=1708411383&otms=682&cts=1708411443&ctms=999&nrc=5&nwc=0&rb=10459447&rb_min=2070839&rb_max=2097152&rb_sigma=10525.20&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=0.01&iot=60316.082&idt=60308.273&lrt=6.941&lrvt=0.000&lwt=0.000&ot=0.604&ct=0.265&rt=3.20&rvt=0.00&wt=0.00&osize=10459447&csize=10459447&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=unix&sec.name=monalisa&sec.host=pcalimonitor.cern.ch&sec.vorg=&sec.grps=alienmaster&sec.role=&sec.info=&sec.app=xrdcp"
240220 06:44:03 time=1708411443.999116 func=_close                   level=INFO  logid=4b3d7f4a-cfbb-11ee-9697-b8599f995c40 unit=fst@rbod-mgmt-03.sdfarm.kr:1095 tid=00007fb645ff4640 source=XrdFstOfsFile:2256             tident=monalisa.958867:62@pcalimonitor.cern.ch sec=      uid=10367 gid=1395 name=nobody geo="" msg="done close" rc=0 errc=0
240220 06:45:04 14597 XrootdXeq: monalisa.958867:62@pcalimonitor.cern.ch disc 0:02:02

It would be very much grateful if you give us any clue about where to look to dig into this issue or if you can find something wrong in the full trace attached below.

Full trace of MGM and FST logs can be found here: mgm_log, fst_log

And also please just let us know if you need any further information from us.

Thank you.

Best regards,
Sang-Un

Hi Sang-Un,

Sorry for the late reply, I was on holidays.
Can you try reproducing this behavior with a simple xrdcp -d 3 and send me the logs for both the client and the entry-point FST which is involved in this read operation? I assume all your FSTs are now 5.2.16, right?

Thanks,
Elvin

Hi Elvin,

Yes. FSTs are all running with 5.2.16 except one, 5.2.17, which is newly installed with AlmaLinux 9.3 yesterday.

In fact, it is really difficult to reproduce on our side but I will try it. The timeout happened with periodic availability test jobs run by ALICE. I will keep in touch with ALICE and let you know the progress.

By the way, we have tried various ways to figure out the root cause of the timeout from our side. Long story short, it seems working with AlmaLinux 9.3.

At first, we suspected networks and storage. LACP bond topology applied to new FSTs was investigated and reconfigured. FSTs were reconfigured to use only local disks. All of them did not help.

Then we found the difference on kernel between CentOS 9 stream and AlmaLinux 9.3: 5.14.0-391.el9.x86_64 vs. 5.14.0-362.18.1.el9_3.x86_64. Downgrading kernel of CentOS 9 stream to the one of AlmaLinux 9.3 did not help. Just to make sure that EOS 5.2.16 works fine with new FST server and disks, we installed CentOS 7.9 from scratch and confirmed that it work, no timeout issues.

Lastly, we installed AlmaLinux 9.3 and deployed EOS 5.2.17 on one FST and so far it looks fine. But still there’s a slightly delay on read operation, 56s which is thankfully within timeout window (60s).

I think that EOS releases are fine because they are targeted to AlmaLinux 9.3. It was our ignorance that we didn’t think of the difference between CentOS 9 stream and AlmaLinux, we just assumed that both are identical. And currently we only have a provisioning infrastructure just only for CentOS 9 stream. Our system administrators were informed on such issue.

Thanks.

Best regards,
Sang-Un

We can also try to deploy EOS on bare (minimal) CentOS 9 stream installation without any additional packages for our security, monitoring and other purposes.

While we discuss internally, we have come to the conclusion that the same issue could be arisen again when new release of AlmaLinux 9 with newer kernel is out. The hang could be invoked by filesystem level and might then be related to kernel.

I will keep you posted as soon as we identify or reproduce the issue.

Sang-Un