Hi all,
Due to traffic, I’ve recently set the log level to WARN for all nodes, now a lot of these messages came to light: “file system marked as unavailable”.
A typical pattern seems to be this: filesystem unavailable for all replicas of the file, then the i/o error is sent to the client as no replicas can be located and the file cannot be reconstructed?
210317 15:24:01 time=1615991041.405567 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=238
210317 15:24:01 time=1615991041.405602 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=42
210317 15:24:01 time=1615991041.405616 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=182
210317 15:24:01 time=1615991041.543614 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f487aafd700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=154
210317 15:24:01 time=1615991041.543683 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f496b6fd700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=51
210317 15:24:01 time=1615991041.543720 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f496b6fd700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=219
210317 15:24:01 time=1615991041.543740 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f496b6fd700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=135
210317 15:24:01 time=1615991041.543795 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=238
210317 15:24:01 time=1615991041.543823 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=42
210317 15:24:01 time=1615991041.543837 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=98
210317 15:24:01 time=1615991041.543853 func=accessHeadReplicaMultipleGroup level=WARN logid=f7ea1314-847a-11eb-8a39-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=GeoTreeEngine:1857 tident=<service> sec= uid=0 gid=0 name= geo="" msg="file system marked as unavailable" fsid=182
210317 15:24:01 time=1615991041.543874 func=Emsg level=ERROR logid=6b68fca2-872c-11eb-9685-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f493a6fa700 source=XrdMgmOfsFile:3231 tident=suman.ch.8749:1417@[::ffff:172.24.77.154] sec=gsi uid=43000 gid=43000 name=Suman Chatterjee geo="vbc" Unable to open file /eos/vbc/experiments/cms/store/user/chatterj/ZH-0j-v2/ZH-0j-v2/210222_092728/0001/AODSIM_output_1092.root; Input/output error
In our monitoring, we check the filesystem status on a perr-minute basis - we have not seen any outages in network/node/disk for the recent times.
Any hints on this?
Edit: at the same time, for thee same user, I see a lot of “vector read errors”:
210317 15:24:01 time=1615991041.066417 func=ReadV level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778 tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1654920564, length=11044
210317 15:24:01 time=1615991041.066450 func=ReadV level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778 tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655067658, length=108681
210317 15:24:01 time=1615991041.066466 func=ReadV level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778 tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655216292, length=88256
210317 15:24:01 time=1615991041.066478 func=ReadV level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778 tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655338418, length=38438
210317 15:24:01 time=1615991041.066489 func=ReadV level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778 tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655432624, length=116243
210317 15:24:01 time=1615991041.066503 func=ReadV level=ERROR logid=d46c0bd2-8726-11eb-a034-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007fa0dc996700 source=RainMetaLayout:778 tident=suman.ch.35910:1025@[::ffff:172.24.76.166] sec=unix uid=0 gid=0 name=suman.chatterjee geo="" msg="vector read error" offset=1655584012, length=8863
Edit2: the high number of errors in the mgm log is > 300k log lines:
[root@mgm-1 ~]# grep -c “file system marked as unavailable” /var/log/eos/mgm/xrdlog.mgm
362448
Best,
Erich