Many "file system marked as unavailable" logs

ebirngru · March 17, 2021, 2:27pm

Hi all,

Due to traffic, I’ve recently set the log level to WARN for all nodes, now a lot of these messages came to light: “file system marked as unavailable”.
A typical pattern seems to be this: filesystem unavailable for all replicas of the file, then the i/o error is sent to the client as no replicas can be located and the file cannot be reconstructed?

210317 15:24:01 time=1615991041.405567 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=238
210317 15:24:01 time=1615991041.405602 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=42
210317 15:24:01 time=1615991041.405616 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=182
210317 15:24:01 time=1615991041.543614 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f487aafd700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=154
210317 15:24:01 time=1615991041.543683 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f496b6fd700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=51
210317 15:24:01 time=1615991041.543720 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f496b6fd700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=219
210317 15:24:01 time=1615991041.543740 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f496b6fd700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=135
210317 15:24:01 time=1615991041.543795 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=238
210317 15:24:01 time=1615991041.543823 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=42
210317 15:24:01 time=1615991041.543837 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=98
210317 15:24:01 time=1615991041.543853 func=accessHeadReplicaMultipleGroup level=WARN  logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=182
210317 15:24:01 time=1615991041.543874 func=Emsg                     level=ERROR logid=6b68fca2-872c-11eb-9685-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=XrdMgmOfsFile:3231             tident=suman.ch.8749:1417@[::ffff:172.24.77.154] sec=gsi   uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" Unable to open file /eos/vbc/experiments/cms/store/user/chatterj/ZH-0j-v2/ZH-0j-v2/210222_092728/0001/AODSIM_output_1092.root; Input/output error

In our monitoring, we check the filesystem status on a perr-minute basis - we have not seen any outages in network/node/disk for the recent times.
Any hints on this?

Edit: at the same time, for thee same user, I see a lot of “vector read errors”:

210317 15:24:01 time=1615991041.066417 func=ReadV                    level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778             tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix  uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1654920564, length=11044
210317 15:24:01 time=1615991041.066450 func=ReadV                    level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778             tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix  uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655067658, length=108681
210317 15:24:01 time=1615991041.066466 func=ReadV                    level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778             tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix  uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655216292, length=88256
210317 15:24:01 time=1615991041.066478 func=ReadV                    level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778             tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix  uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655338418, length=38438
210317 15:24:01 time=1615991041.066489 func=ReadV                    level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778             tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix  uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655432624, length=116243
210317 15:24:01 time=1615991041.066503 func=ReadV                    level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778             tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix  uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655584012, length=8863

Edit2: the high number of errors in the mgm log is > 300k log lines:
[root@mgm-1 ~]# grep -c “file system marked as unavailable” /var/log/eos/mgm/xrdlog.mgm
362448

Best,
Erich

apeters · March 17, 2021, 5:49pm

Are this CMS jobs? Can you check if you see URLs in the MGM log with “tried=…” ? or on the FST ?

ebirngru · March 17, 2021, 7:01pm

Hi Andreas,
indeed, CMS jobs, the specific CMSSW version I would have to find out. From the same user I see in the logs, grepping for a specific log id:

[root@mgm-1 ~]# grep logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0 /var/log/eos/mgm/xrdlog.mgm
210317 19:57:52 time=1616007472.773562 func=open                     level=INFO  logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f49675fc700 source=XrdMgmOfsFile:500              tident=suman.ch.27850:802@[::ffff:172.24.76.252] sec=gsi   uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" op=read path=/eos/vbc/experiments/cms/store/user/chatterj/ZH-0j-v2/ZH-0j-v2/210222_092728/0008/AODSIM_output_8827.root info=tried=fst-1.eos.grid.vbc.ac.at,fst-8.eos.grid.vbc.ac.at,fst-2.eos.grid.vbc.ac.at
210317 19:57:52 time=1616007472.774086 func=open                     level=INFO  logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f49675fc700 source=XrdMgmOfsFile:1037             tident=suman.ch.27850:802@[::ffff:172.24.76.252] sec=gsi   uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1 facl=0
210317 19:57:52 time=1616007472.774885 func=open                     level=INFO  logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f49675fc700 source=XrdMgmOfsFile:2930             tident=suman.ch.27850:802@[::ffff:172.24.76.252] sec=gsi   uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" op=read path=/eos/vbc/experiments/cms/store/user/chatterj/ZH-0j-v2/ZH-0j-v2/210222_092728/0008/AODSIM_output_8827.root info=tried=fst-1.eos.grid.vbc.ac.at,fst-8.eos.grid.vbc.ac.at,fst-2.eos.grid.vbc.ac.at target[0]=(__offline_fst-1.eos.grid.vbc.ac.at,1) target[1]=(fst-5.eos.grid.vbc.ac.at,113) target[2]=(__offline_fst-2.eos.grid.vbc.ac.at,29) target[3]=(fst-9.eos.grid.vbc.ac.at,225) target[4]=(fst-3.eos.grid.vbc.ac.at,57) target[5]=(fst-4.eos.grid.vbc.ac.at,85) target[6]=(__offline_fst-8.eos.grid.vbc.ac.at,197)  redirection=fst-5.eos.grid.vbc.ac.at?&cap.sym=<...>&cap.msg=<...>&mgm.logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0&mgm.replicaindex=1&mgm.replicahead=1&mgm.id=0119aaac&mgm.mtime=1614123941 xrd_port=1095 http_port=9001
210317 19:57:52 time=1616007472.774916 func=open                     level=INFO  logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f49675fc700 source=XrdMgmOfsFile:2934             tident=suman.ch.27850:802@[::ffff:172.24.76.252] sec=gsi   uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" info="redirection" hostport=fst-5.eos.grid.vbc.ac.at?&cap.sym=<...>&cap.msg=<...>&mgm.logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0&mgm.replicaindex=1&mgm.replicahead=1&mgm.id=0119aaac&mgm.mtime=1614123941:1095
210317 19:57:52 time=1616007472.877784 func=open                     level=INFO  logid=ad41cb38-8752-11eb-b75a-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f4991bff700 source=XrdMgmOfsFile:500              tident=suman.ch.27850:802@[::ffff:172.24.76.252] sec=gsi   uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" op=read path=/eos/vbc/experiments/cms/store/user/chatterj/ZH-0j-v2/ZH-0j-v2/210222_092728/0008/AODSIM_output_8827.root info=cap.msg=<...>&cap.sym=<...>&mgm.id=0119aaac&mgm.logid=ad31e2ae-8752-11eb-a580-3868dd28d0c0&mgm.mtime=1614123941&mgm.replicahead=1&mgm.replicaindex=1&tried=fst-1.eos.grid.vbc.ac.at,fst-8.eos.grid.vbc.ac.at,fst-2.eos.grid.vbc.ac.at,fst-5.eos.grid.vbc.ac.at,eos.grid.vbc.ac.at&triedrc=ioerr

looks like trying different FSTs then eventually failing.
Best,
Erich

apeters · March 19, 2021, 11:08am

This vector read errors are just the result of the fact, that some FSTs are excluded via this 'tried=…" CGI and it has to run reconstruction. If you just download this file, do you see any problem?
Can you try to take one file and follow all the open requests, so there should be first an open without “tried=…”, then it should append one after the other fsts in new open requests …

ebirngru · March 22, 2021, 5:19pm

Hi Andreas,

this is getting a bit out of hands with the logs here, I’ve added this to the opened SNOW ticket INC2738550. Will attach more logs there.

Best,
Erich

CERN Accelerating science

Many "file system marked as unavailable" logs