FST marked as offline and spurious MGM failovers

useren · August 24, 2020, 12:54pm

We have been in the process of transferring the data from our old DPM storage to our new EOS storage (4.8.4) and also preparing everything to make it available as a storage element for WLCG workloads (CMS, Alice & Bell). For more detais see the post of my colleague.
Additionally we have setup fusex mount ponts (no automount) from our HPC system (around 160 nodes) that is used by our users for copying and accessing their data from interactive compute jobs.

Recently we started to see two issues that might be related but we are not sure:

1.) FST marked as offline:
Recently we see some Site Availabiliy Tests (SAM) that read a root file using the xrootd endpoint from our compute nodes failing. However those failures were not reproducible. They seem to work most of the time but sometimes they encounter a read error on the EOS side. That’s for the same file. We made sure that the file is not corrupt by checking the checksum:

The tests access the exact same file and fail at different timepoints when reading events from the root file.
We then checked the MGM logs and we see following error messages:

200824 11:41:33 time=1598262093.975294 func=fileOpen                 level=ERROR logid=ff3291ea-e5ed-11ea-9094-3868dd2a6fb0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=XrdIo:280
                  tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://__offline_fst-7.eos.grid.vbc.ac.at:1095///eos/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSS
W_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root?cap.msg=3nbHSbN5UOFAk/eRkrrmP8ZgCi4h4ZJ27MpcXGvCJ7mMf5KALLfmlHQNpNcdXU+DlP2ws49B54c98UG26I8wT3uR0utbkA1GzkY4Pz+pd/uWO9yE+
Q7wj1EdWxKKmjeH4o+NsIwHiFLtOP/ID7vUJ9jN3MYy85wX8fJPZ/SwBnusv8uTychkSRZA3rh5+JBBpjVscNPe2DSCzJR3blRV3tRlB4zM7KJmlnSSi6JY4LYrrpjSsDMBMacCkFASlbAFXGJdEEELJd+zIyNDuOcwrp9TOrAPoDcNjeuRdHh8CR4yUZlq3nUo+jlGWSflx
ls177SbUL4tmHpYKlXT+/iUY21UJUg9FgRLFJ29RARndi1uO9NNVS4MunwL2ttUi1kU6FR1XaFLOFSrhAc53uz8RduH9hpI9G6RZhfKmRkSscnAM9kLQFiI8mtY8uByL34S5E9QRUfQiaKBGIcVUy+EwaL25AdeXEwA8QY12ovdNw6k8YyOhV/lmkx/VgU/YRcpdn1Mxg7as
PK92hLl0UIFRPRU/nfQBK5N+wIXos2aka0yrXW2k4sPH0UpFWdk/0NRh28fdIRbRvIfRbsy9DywR5EARGLsM6O38BCGw+He3gwY2uXhXg6nsm7v99TzrD11i1TTmnQzGJvgsYFswFM5jErhs+WkHVq4gCtRnLIe+lwJ9xvrNqUDfgq0duYbFW73hlmX302819sWiWlyeB/Ql
2Oo3axa152+onvhahlsXATAvP1p2kUfeRsrSIg2gVRCAQRzdH06ED8PMo/U+hI1Nd/zxBz2TGmPpX7yTv6VgYXK4HpPOFeUgIhW/bTFDhpi29aObImEIPXu/7ff7Jy7Q7Claky+OLReD3fEW0Zvq2T0tv6NWxnl6HZDsNqvczUDvM2jyQRj0uAUejyB0Bz5LXCTmYpCVlCUN
w+dFkHRhSg2MKLr/uMrjc+yMFVgCWQBCiCRS2u5mOhJvqb/E/QcYiiGRWP6aomhAcnw3MLBKkjDsb8V2IrLs0dcCwd6taIhDdZqX5oj2kod9E55CfgeyT5jDh24KnVtK/j6zXoZqL4+f7yEJcdGc86mnVmhyq5j/34U/kKjlu1rxMpO6mfYb9xx75w/2Bo4sF5+3Haoqjjcw
jrIMqUDCqseWgpk1NeOljbDis8AcZTXIBERNfJsuDNNl+cfNHNpPk8CKRIf6KzxrcnJCi0H/fP4KyU9JHwo+YqcJdwW1qCSqh7MQ9M/XweW5T9LBuCijZDCfyrBSPm2jooHHD10Ex4voT2KUv9ZQKoiNSUP35V771ONQszqB+aZWZBHirdKvqfoY17oZ1SRb4cXvMtx30Lc4
bnfLdfg4B/T0Zjq7yI=&cap.sym=TU8aiZKgP9JVwZSe2/tlKPrdWfk=&fst.blocksize=1048576&fst.readahead=true&fst.valid=1598262152&mgm.bookingsize=0&mgm.id=0024fca7&mgm.lid=543426626&mgm.logid=ff2fe846-e5ed-11ea-bbdb
-3868dd28d0c0&mgm.mtime=1597858863&mgm.path=/eos/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root&mgm.replicahead
=1&mgm.replicaindex=5&tried=fst-7.eos.grid.vbc.ac.at, errno=0, errc=101, msg=[FATAL] Invalid address"
200824 11:41:33 time=1598262093.975356 func=fileOpen                 level=WARN  logid=ff3291ea-e5ed-11ea-9094-3868dd2a6fb0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=XrdIo:284
                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="error encountered despite errno=0; setting errno=22"
200824 11:41:33 time=1598262093.975377 func=Open                     level=WARN  logid=ff2fe846-e5ed-11ea-bbdb-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=RaidMetaLayou
t:311             tident=grid.cms.148:184@clip-c2-79.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pool002 geo="" msg="open failed on remote stripe: root://__offline_fst-7.eos.grid.vbc.ac.at:1095///eo
s/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root?&cap.msg=3nbHSbN5UOFAk/eRkrrmP8ZgCi4h4ZJ27MpcXGvCJ7mMf5KALLfml
HQNpNcdXU+DlP2ws49B54c98UG26I8wT3uR0utbkA1GzkY4Pz+pd/uWO9yE+Q7wj1EdWxKKmjeH4o+NsIwHiFLtOP/ID7vUJ9jN3MYy85wX8fJPZ/SwBnusv8uTychkSRZA3rh5+JBBpjVscNPe2DSCzJR3blRV3tRlB4zM7KJmlnSSi6JY4LYrrpjSsDMBMacCkFASlbAFX
GJdEEELJd+zIyNDuOcwrp9TOrAPoDcNjeuRdHh8CR4yUZlq3nUo+jlGWSflxls177SbUL4tmHpYKlXT+/iUY21UJUg9FgRLFJ29RARndi1uO9NNVS4MunwL2ttUi1kU6FR1XaFLOFSrhAc53uz8RduH9hpI9G6RZhfKmRkSscnAM9kLQFiI8mtY8uByL34S5E9QRUfQiaKBG
IcVUy+EwaL25AdeXEwA8QY12ovdNw6k8YyOhV/lmkx/VgU/YRcpdn1Mxg7asPK92hLl0UIFRPRU/nfQBK5N+wIXos2aka0yrXW2k4sPH0UpFWdk/0NRh28fdIRbRvIfRbsy9DywR5EARGLsM6O38BCGw+He3gwY2uXhXg6nsm7v99TzrD11i1TTmnQzGJvgsYFswFM5jErhs
+WkHVq4gCtRnLIe+lwJ9xvrNqUDfgq0duYbFW73hlmX302819sWiWlyeB/Ql2Oo3axa152+onvhahlsXATAvP1p2kUfeRsrSIg2gVRCAQRzdH06ED8PMo/U+hI1Nd/zxBz2TGmPpX7yTv6VgYXK4HpPOFeUgIhW/bTFDhpi29aObImEIPXu/7ff7Jy7Q7Claky+OLReD3fEW
0Zvq2T0tv6NWxnl6HZDsNqvczUDvM2jyQRj0uAUejyB0Bz5LXCTmYpCVlCUNw+dFkHRhSg2MKLr/uMrjc+yMFVgCWQBCiCRS2u5mOhJvqb/E/QcYiiGRWP6aomhAcnw3MLBKkjDsb8V2IrLs0dcCwd6taIhDdZqX5oj2kod9E55CfgeyT5jDh24KnVtK/j6zXoZqL4+f7yEJ
cdGc86mnVmhyq5j/34U/kKjlu1rxMpO6mfYb9xx75w/2Bo4sF5+3HaoqjjcwjrIMqUDCqseWgpk1NeOljbDis8AcZTXIBERNfJsuDNNl+cfNHNpPk8CKRIf6KzxrcnJCi0H/fP4KyU9JHwo+YqcJdwW1qCSqh7MQ9M/XweW5T9LBuCijZDCfyrBSPm2jooHHD10Ex4voT2KU
v9ZQKoiNSUP35V771ONQszqB+aZWZBHirdKvqfoY17oZ1SRb4cXvMtx30Lc4bnfLdfg4B/T0Zjq7yI=&cap.sym=TU8aiZKgP9JVwZSe2/tlKPrdWfk=&mgm.id=0024fca7&mgm.logid=ff2fe846-e5ed-11ea-bbdb-3868dd28d0c0&mgm.mtime=1597858863&mgm
.replicahead=1&mgm.replicaindex=5&tried=fst-7.eos.grid.vbc.ac.at&mgm.path=/eos/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-002
5905B85BA.root"

It seems that when the file is re-assembled from the stripes (we are using RAID6 layout with 9 stripes), some of the fsts are marked as offline (i.e. in the above root://__offline_fst-7.eos.grid.vbc.ac.at) and the read fails. This happens randomly to all fsts. We checked the FSTs and we couldn’t see anything out of the ordinary. The FSTs seem to be running fine the entire time.

What could be the reason why the FSTs are marked offline ? We did check the quarkdb status with quarkdb-health and it seems to be fine:

[root@mgm-1 mgm]# redis-cli -p 9999 quarkdb-health
1) NODE-HEALTH GREEN
2) NODE mgm-1.eos.grid.vbc.ac.at:9999
3) VERSION 0.4.2
4) ----------
5) GREEN  >> SM-FREE-SPACE 781253386240 bytes (97.7653%)
6) GREEN  >> SM-MANIFEST-TIMEDIFF 0 sec
7) GREEN  >> PART-OF-QUORUM Yes | LEADER mgm-2.eos.grid.vbc.ac.at:9999

2.) Our users mostly use fuesx to copy/access data on the EOS storage but they report that 50 % of the times they receive errors such as Input/output error or Protocol error.
We could partially reproduce this. When we run md5sum on a file over fusex on one node we receive an Input/output error. Running it on a node with a fresh fusex mount point we get the correct md5 checksum.
We cheked the MGM and FST logs and we couldn’t really see any errors.

The fuse.eos.log on the node where we get the error shows following message:

200824 14:46:02 t=1598273162.328177 f=HandleResponseWithHosts l=ERROR tid=00007fdbea5d9700 s=xrdclproxy:469           state=failed async open returned errmsg=[FATAL] Redirect limit has been reached
200824 14:46:02 t=1598273162.328245 f=prefetch         l=ERROR ino:000250b5d0000000 s=data:854                 pre-fetch failed error=[FATAL] Redirect limit has been reached
200824 14:46:02 t=1598273162.333197 f=TryRecovery      l=ERROR ino:000250b5d0000000 s=data:1034                unrecoverable error - code=306 errNo=0
200824 14:46:02 t=1598273162.333219 f=peek_pread       l=ERROR ino:000250b5d0000000 s=data:2447                sync remote-io recovery failed errno=71
200824 14:46:02 t=1598273162.333359 f=prefetch         l=ERROR ino:000250b5d0000000 s=data:854                 pre-fetch failed error=[FATAL] Redirect limit has been reached
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/eos/vbc/incoming/incoming_natascha/store/mc/RunIISummer17MiniAOD/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/MINIAODSIM/92X_upgrade2017_realistic_v10_ext1-v2/50000/0A1CC839-2899-E711-8781-4C79BA180A45.root'
#        fid  := 2427741
#        -[ 000 ] 200824 14:46:02 t=1598273162.328259 f=peek_pread       l= ino:000250b5d0000000 s=data:2424                status='[FATAL] Redirect limit has been reached' hint='will TryRecovery'
#        -[ 001 ] 200824 14:46:02 t=1598273162.333215 f=peek_pread       l= ino:000250b5d0000000 s=data:2431                status='[FATAL] Redirect limit has been reached' errno='71' hint='failed TryRecovery'

On the node where it works we don’t see any of those messages.

Additionally in the FST log we can see sometimes following log message:

200824 14:48:27 time=1598273307.984288 func=ProcessCapOpaque         level=WARN  logid=1b3545a8-e608-11ea-a066-3868dd28d968 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f089b7fe700 source=XrdFstOfsFile
:2323             tident=daemon.8103:111@fst-3 sec=(null) uid=99 gid=99 name=(null) geo="" msg="MGM master seems to have changed - adjusting global config" old-manager="mgm-1.eos.grid.vbc.ac.at:1094" new-manager="mgm-3.eos.grid.vbc.ac.at:1094"
200824 14:48:44 time=1598273324.287918 func=ProcessCapOpaque         level=WARN  logid=24fa7f22-e608-11ea-80dd-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=XrdFstOfsFile:2323             tident=daemon.8103:111@fst-3 sec=(null) uid=99 gid=99 name=(null) geo="" msg="MGM master seems to have changed - adjusting global config" old-manager="mgm-3.eos.grid.vbc.ac.at:1094" new-
manager="mgm-1.eos.grid.vbc.ac.at:1094"

It looks like the master is moved from mgm-1 to mgm-3 and 10 seconds later back to mgm-1, which might explain the fusex errors.
However when we check the mgm-3 log we can not see that it ever became master:

[root@mgm-3 mgm]# grep -c is_master=false xrdlog.mgm
6740
[root@mgm-3 mgm]# grep -c is_master=true xrdlog.mgm
0
[root@mgm-3 mgm]#

We also don’t see any real load on the mgm-1 or any other errors that would indicate that the master is moved to another node.
The fact that the fusex is broken 50 % of the times makes it unusable for our users.

We would really appreciate any help or hints how to furhter debug the issue

ebirngru · August 28, 2020, 11:37am

Hi @apeters, @esindril, all
As far as errors of open failed url=root://__offline_fst-X... are concerned, my current suspicion is that they are penalized by the geo scheduler for some unknown reason.
The services themselves (i.e. the eos processes on those hosts) are always fine and runnning, and we also don’t encounter any extreme load situations.

What might be of interest, is that we don’t see any stats on any of the FS (this has always been the case):
note the stat.disk.* values

# ------------------------------------------------------------------------------------
# FileSystem Variables
# ------------------------------------------------------------------------------------
bootcheck                        := 0
configstatus                     := rw
host                             := fst-1.eos.grid.vbc.ac.at
hostport                         := fst-1.eos.grid.vbc.ac.at:1095
id                               := 1
path                             := /srv/data/data.00
port                             := 1095
queue                            := /eos/fst-1.eos.grid.vbc.ac.at:1095/fst
queuepath                        := /eos/fst-1.eos.grid.vbc.ac.at:1095/fst/srv/data/data.00
schedgroup                       := default.0
stat.active                      := online
stat.balancer.running            := 0
stat.boot                        := booted
stat.bootdonetime                := 1598470752
stat.disk.bw                     := 0
stat.disk.iops                   := 0
stat.disk.load                   := 0.000000
stat.disk.readratemb             := 0.000000
stat.disk.writeratemb            := 0.000000
stat.drain                       := nodrain
stat.errc                        := 0
stat.geotag                      := vbc::rack1::pod1
stat.health                      := no mdstat
stat.health.drives_failed        := 0
stat.health.drives_total         := 1
stat.health.indicator            := N/A
stat.health.redundancy_factor    := 1
stat.http.port                   := 9001
stat.publishtimestamp            := 1598613297699
stat.ropen                       := 99
stat.ropen.hotfiles              := 2:00250775 2:00278c9b 1:0005d533 1:0005d555 1:0005d939 1:0005d98f 1:0005d9ed 1:0005da09 1:00067efd 1:00071439 
stat.statfs.bavail               := 2634601347
stat.statfs.bfree                := 2634601347
stat.statfs.blocks               := 2929199616
stat.statfs.bsize                := 4096
stat.statfs.capacity             := 11998001627136
stat.statfs.ffree                := 1171838329
stat.statfs.files                := 1171888512
stat.statfs.filled               := 10.057296
stat.statfs.freebytes            := 10791327117312
stat.statfs.fused                := 205549568
stat.statfs.namelen              := 255
stat.statfs.type                 := 1481003842
stat.statfs.usedbytes            := 1206674509824
stat.usedfiles                   := 24952
stat.wopen                       := 0
stat.wopen.hotfiles              :=  
uuid                             := 834625b5-48e5-4da0-945e-571e00092158

The same also for the nodes: see the stat.net.* values. The nodes have a 2x40Gbit/s LACP interface (tagged, i.e. bond0.310).

[root@mgm-1 ~]# eos node status fst-1.eos.grid.vbc.ac.at
# ------------------------------------------------------------------------------------
# Node Variables
# ....................................................................................
debug.level                      := info
debug.state                      := note 
domain                           := MGM
gw.ntx                           := 10
gw.rate                          := 120
http.port                        := 9001
manager                          := mgm-1.eos.grid.vbc.ac.at:1094
stat.active                      := online
stat.balance.ntx                 := 2
stat.balance.rate                := 25
stat.geotag                      := vbc::rack1::pod1
stat.hostport                    := fst-1.eos.grid.vbc.ac.at:1095
stat.net.ethratemib              := 9536
stat.net.inratemib               := 0
stat.net.outratemib              := 0
stat.publishtimestamp            := 1598613423826
stat.sys.eos.start               := Wed Aug 26 21:39:08 2020
stat.sys.eos.version             := 4.8.4-1
stat.sys.kernel                  := 3.10.0-1127.el7.x86_64
stat.sys.keytab                  := 2c02506a
stat.sys.rss                     := 2420224000
stat.sys.sockets                 := 19
stat.sys.threads                 := 160
stat.sys.uptime                  :=  13:17:03 up 62 days, 21:26,  1 user,  load average: 6.32, 5.72, 5.28
stat.sys.vsize                   := 28586524672
stat.sys.xrootd.version          := v4.12.3
status                           := on
symkey                           := GLrsQYN++pefQFpQi8EWqZVUH6o=
txgw                             := on

Can the txgw setting be a part of the problem?

The geosched params are basic defaults:

[root@mgm-1 ~]# eos geosched show param
### GeoTreeEngine parameters :
skipSaturatedAccess = 1
skipSaturatedDrnAccess = 1
skipSaturatedBlcAccess = 1
proxyCloseToFs = 1
penaltyUpdateRate = 1
plctDlScorePenalty = 10(default) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
plctUlScorePenalty = 10(defaUlt) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
accessDlScorePenalty = 10(default) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
accessUlScorePenalty = 10(defaUlt) | 10(1Gbps) | 10(10Gbps) | 10(100Gbps) | 10(1000Gbps)
fillRatioLimit = 80
fillRatioCompTol = 100
saturationThres = 10
timeFrameDurationMs = 1000

What is very curious though, is the avg latency of ca 1000ms that geosched sees:


┏━> fst2GeotreeEngine latency
┌──────┬──────────┬──────────┬───────────┬───────────┐
│fsid  │   minimum│    averge│    maximum│  age(last)│
└──────┴──────────┴──────────┴───────────┴───────────┘
 1        53.00 ms 1048.00 ms 12149.00 ms  5813.00 ms 
 2        52.00 ms 1048.48 ms 12149.00 ms  5812.00 ms 
 3        52.00 ms 1048.53 ms 12149.00 ms  5813.00 ms 
 4        53.00 ms 1048.77 ms 12149.00 ms  5812.00 ms 
 5        52.00 ms 1048.77 ms 12149.00 ms  5811.00 ms 
 6        52.00 ms 1048.96 ms 12149.00 ms  5814.00 ms 
 7        53.00 ms 1049.63 ms 12149.00 ms  5813.00 ms 
 8        54.00 ms 1049.30 ms 12151.00 ms  5813.00 ms 
[...]

Also on the mgm we see error triplets like this:

200828 13:34:10 time=1598614450.387723 func=accessHeadReplicaMultipleGroup level=WARN  logid=d36b5c0c-d5c8-11ea-8897-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007faa7a2fd700 source=GeoTreeEngine:1855             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="" fsid=%lu
200828 13:34:22 291568 XrootdXeq: AAAAAABg.14375:1051@clip-c2-31.cbe.vbc.ac.at pvt IPv4 login as lukas.lechner
200828 13:34:23 time=1598614463.427435 func=accessHeadReplicaMultipleGroup level=WARN  logid=d36b5c0c-d5c8-11ea-8897-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007faa8ade5700 source=GeoTreeEngine:1855             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="" fsid=%lu
200828 13:34:23 time=1598614463.427520 func=accessHeadReplicaMultipleGroup level=WARN  logid=d36b5c0c-d5c8-11ea-8897-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007faa8ade5700 source=GeoTreeEngine:1855             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="" fsid=%lu

These are generated around here: https://gitlab.cern.ch/dss/eos/-/blob/master/mgm/GeoTreeEngine.cc#L1857

Our EOS version is 4.8.4.

We’re grateful for any and all advice since this is currently causing production jobs to fail.

Best,
Erich

esindril · September 8, 2020, 1:09pm

Hi all,

Can you confirm that you set explicitly the network interface that EOS is supposed to use in /etc/sysconfig/eos_env file?
By setting accordingly the following env variable:
EOS_FST_NETWORK_INTERFACE="<your_net_interface>"

Also can you grep the FST logs for one machine that was marked as offline and see if the eos-iops/iobw program managed to run successfully when the FST daemon was started? You should see some messages similar to the following:

200818 22:30:13 time=1597782613.547154 func=IoPing                   level=INFO  logid=96c30a0a-e191-11ea-b79d-dc4a3e6b9f27 unit=fst@esdss000.cern.ch:2001 tid=00007fbb8477b700 source=FileSystem:351                 tident=<service> sec=      uid=0 gid=0 name= geo="" bw=54 iops=34
200908 14:48:49 time=1599569329.435357 func=IoPing                   level=INFO  logid=9cc30d38-f1d1-11ea-999d-dc4a3e6b9f27 unit=fst@esdss000.cern.ch:2001 tid=00007faf5fdfd700 source=FileSystem:330                 tident=<service> sec=      uid=0 gid=0 name= geo="" "eos-iobw /home/esindril/Eos_data/fst1" "eos-iops /home/esindril/Eos_data/fst1"
200908 14:49:15 time=1599569355.870822 func=IoPing                   level=INFO  logid=9cc30d38-f1d1-11ea-999d-dc4a3e6b9f27 unit=fst@esdss000.cern.ch:2001 tid=00007faf5fdfd700 source=FileSystem:351                 tident=<service> sec=      uid=0 gid=0 name= geo="" bw=19 iops=279

Furthermore, can you leave a process monitoring the output of eos fs ls and see if indeed some FSes become offline from time to time? Something like watch -n 1 "eos fs ls" >> /tmp/dump

Thanks,
Elvin

ebirngru · September 8, 2020, 7:01pm

Hi @esindril
I’ve scanned the logs since June now, we’ve always gotten io=0 and bw=0.
And there are some lines like

xrdlog.fst-20200717.gz:awk: cmd. line:1: (FILENAME=- FNR=1) fatal: 200716 17:37:22 time=1594913842.894209 func=IoPing                   level=INFO  logid=3dd197a8-c77a-11ea-9cf1-3868dd28d528 unit=fst@fst-1.eos.grid.vbc.ac.at:1095 tid=00007fc1aa9fc700 source=FileSystem:330                 tident=<service> sec=      uid=0 gid=0 name= geo="" "eos-iobw /srv/data/data.22" "eos-iops /srv/data/data.22"

Also, upon further investigation: The iobw tool in /sbin/eos-iobw is calling /usr/bin/time, which is not installed on my host, as it is a shell builtin:

[root@fst-1 tmp]# type time
time is a shell keyword

after yum installing this, I eos-iobw is working correctly. I’m assuming that after returning no output here https://gitlab.cern.ch/dss/eos/-/blob/4.8.4/fst/storage/FileSystem.cc#L344 both values were taken as 0.

In fact, now we see those io values after fst restart. I’ve also added the interface setting to the FST config.
So, now there are also values for the b/w in the node status.

[root@mgm-1 ~]# eos node status fst-1.eos.grid.vbc.ac.at
# ------------------------------------------------------------------------------------
# Node Variables
# ....................................................................................
debug.level                      := info
debug.state                      := info 
domain                           := MGM
gw.ntx                           := 10
gw.rate                          := 120
http.port                        := 9001
manager                          := mgm-1.eos.grid.vbc.ac.at:1094
stat.active                      := online
stat.balance.ntx                 := 2
stat.balance.rate                := 25
stat.geotag                      := vbc::rack1::pod1
stat.hostport                    := fst-1.eos.grid.vbc.ac.at:1095
stat.net.ethratemib              := 9536
stat.net.inratemib               := 523.917
stat.net.outratemib              := 264.979
[...]

The filesystem status show bw and iops, but the current rates are empty:

[root@mgm-1 ~]# eos fs status 1
# ------------------------------------------------------------------------------------
# FileSystem Variables
# ------------------------------------------------------------------------------------
bootcheck                        := 0
configstatus                     := rw
host                             := fst-1.eos.grid.vbc.ac.at
hostport                         := fst-1.eos.grid.vbc.ac.at:1095
id                               := 1
path                             := /srv/data/data.00
port                             := 1095
queue                            := /eos/fst-1.eos.grid.vbc.ac.at:1095/fst
queuepath                        := /eos/fst-1.eos.grid.vbc.ac.at:1095/fst/srv/data/data.00
schedgroup                       := default.0
stat.active                      := online
stat.balancer.running            := 0
stat.boot                        := booted
stat.bootdonetime                := 1599709261
stat.disk.bw                     := 146
stat.disk.iops                   := 73
stat.disk.load                   := 0.000000
stat.disk.readratemb             := 0.000000
stat.disk.writeratemb            := 0.000000
[...]

I believe the problem is coming from around here: https://gitlab.cern.ch/dss/eos/-/blob/4.8.4/fst/Load.cc#L255
Our FSTs are using linux multipathing, for the externally attached SAS disks.

The code is using /proc/diskstats (device name dm-XX)

[...]
 253      25 dm-25 9844279 56 7923712513 209288062 5425971 256480 4836139624 1138215550 0 53845928 1347438169
 253      26 dm-26 10228126 60 7969399657 215851366 5328499 241265 4744832832 1109155912 0 55088039 1324940842
 253      27 dm-27 6980891 20 5339823681 138418023 5389989 233582 4824027864 1102239081 0 42580692 1240609285
 253      28 dm-28 11395516 101 8795954641 246915954 5307940 234597 4720922248 1087552015 0 59733512 1334395724
 253      29 dm-29 10442624 89 8238636897 226979496 5199869 236949 4614378000 1099975632 0 56096552 1326889913

The device name lookup is based on /etc/mtab (https://gitlab.cern.ch/dss/eos/-/blob/4.8.4/fst/Load.cc#L66) We mount the devices via LABEL in fstab, which shows up as the named multipath device /dev/mapper/mpathYY, i.e.

[root@fst-1 ~]# cat /etc/mtab 
[...]
/dev/mapper/mpathw /srv/data/data.22 xfs rw,seclabel,noatime,attr2,inode64,noquota 0 0
/dev/mapper/mpathx /srv/data/data.23 xfs rw,seclabel,noatime,attr2,inode64,noquota 0 0
/dev/mapper/mpathy /srv/data/data.24 xfs rw,seclabel,noatime,attr2,inode64,noquota 0 0
/dev/mapper/mpathz /srv/data/data.25 xfs rw,seclabel,noatime,attr2,inode64,noquota 0 0
[...]

For the rate calculation mpathYY cannot be matched with the dm-NN name.
Is there a way we can make it use the other codpath for load calculation in https://gitlab.cern.ch/dss/eos/-/blob/4.8.4/fst/storage/Publish.cc#L325

Best,
Erich

esindril · September 11, 2020, 10:24am

Hi Erich,

Ok, there seems to be some progress concerning publishing the information that the scheduler needs to take decisions. After doing all these modifications, do you still see FSTs marked as offline from time to time?

The code that you are referring to in the Publish file (the other branch of the if) is not actually doing anything as that functionality does not exist. I will look into adding support for such multipath setups in the future.

Thanks,
Elvin

ebirngru · September 11, 2020, 11:06am

Hi Elvin,

Thanks for looking into this.
Unfortunately we still see those errors. I just grepped for it, today it was already 76 times.
This also confirms it’s really in the scheduler - we have never seen an FS or node offline in “eos fs ls” or “eos node ls”.

How do we proceed from here? Can we tell the scheduler to ignore certain information?

Best,
Erich

esindril · September 11, 2020, 1:28pm

Hi Erich,

Having a closer look at the initial comment it looks to me that the problem is with the actual files or more likely on the FST side. The interesting bit is the following info &tried=fst-7.eos.grid.vbc.ac.at which means that it’s not the scheduler who excludes the FST but a first attempt at reading from that FST failed.

So it would probably be better to investigate such a failed transfer and understand what is the initial error on the FST. Just to give you a bit of context how this works: the xrootd client the first time will get redirected to FSTxyz let’s say, that contains the file, if during the open on the FST side it gets a recoverable error, then the xrootd client will go gain to the MGM with this tried opaque info saying it would like to read that particular file but not from FSTxyz since that one already failed.

Modifying the scheduler (which is not possible at the moment) would not modify the outcome that you currently see.

Is there any way that you can reliably reproduce such an error scenario?

Cheers,
Elvin

esindril · September 17, 2020, 9:52am

Hi Erich,

Can you take one of the files that showed such behavior and paste the output for the following commands?

eos fileinfo <file_path> --fullpath
eos fileinfo <file_path> --fullpath | grep online | awk '{print "ssh -XCl root "$3" -C \"eos-rain-hd-dump "$11"\"";}' | while read -r line; do echo "$line" | sh -x ; done
eos fileinfo <file_path> --fullpath | grep online | awk '{print "ssh -XCl root "$3" -C \"ls -lrt "$11"\"";}' | while read -r line; do echo "$line" | sh -x ; done

Thanks,
Elvin

ebirngru · September 17, 2020, 10:18am

Hi Elvin,
Thanks for these additional steps. I wanted to get back to you now, as we have also retransferred those datasets from upstream - to ensure we did not have bad file in the beginning. So we complete deleted all files and transferred them again - without any errors.

I ran the 3 commands you listed, with the only change of running them with a non-root account (disabled ssh root logins), but the output does not differ.

[adm_ebirn@mgm-1 ~]$ eos fileinfo /eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/64558671-1776-E711-806B-FA163E995805.root --fullpath
Secsss (getKeyTab): Unable to open /etc/eos.keytab; Permission denied
Unable to open keytab file.
  File: '/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/64558671-1776-E711-806B-FA163E995805.root'  Flags: 0644
  Size: 2403043981
Modify: Wed Sep 16 13:23:10 2020 Timestamp: 1600255390.931658000
Change: Wed Sep 16 13:20:46 2020 Timestamp: 1600255246.438856949
 Birth: Wed Sep 16 13:20:46 2020 Timestamp: 1600255246.438856949
  CUid: 43349 CGid: 43350 Fxid: 00350936 Fid: 3475766 Pid: 19095 Pxid: 00004a97
XStype: adler    XS: 94 a2 b0 ae    ETAGs: "933018831159296:94a2b0ae"
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d3::t0 
  #Rep: 7
┌───┬──────┬────────────────────────┬────────────────┬─────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┬───────────────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│             path│      boot│  configstatus│       drain│  active│                  geotag│                  physical location│
└───┴──────┴────────────────────────┴────────────────┴─────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┴───────────────────────────────────┘
 0      246 fst-9.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod3 /srv/data/data.21/0000015b/00350936 
 1       78 fst-3.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod1 /srv/data/data.21/0000015b/00350936 
 2      106 fst-4.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod2 /srv/data/data.21/0000015b/00350936 
 3       50 fst-2.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod1 /srv/data/data.21/0000015b/00350936 
 4      134 fst-5.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod2 /srv/data/data.21/0000015b/00350936 
 5      190 fst-7.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod3 /srv/data/data.21/0000015b/00350936 
 6      218 fst-8.eos.grid.vbc.ac.at       default.21 /srv/data/data.21     booted             rw      nodrain   online         vbc::rack1::pod3 /srv/data/data.21/0000015b/00350936 

*******

[adm_ebirn@mgm-1 ~]$ eos fileinfo /eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/64558671-1776-E711-806B-FA163E995805.root --fullpath | grep online | awk '{print "ssh -XCl adm_ebirn "$3" -C \"/sbin/eos-rain-hd-dump "$11"\"";}' | while read -r line; do echo "$line" | sh -x ; done
Secsss (getKeyTab): Unable to open /etc/eos.keytab; Permission denied
Unable to open keytab file.
+ ssh -XCl adm_ebirn fst-9.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 0
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365

+ ssh -XCl adm_ebirn fst-3.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 1
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365

+ ssh -XCl adm_ebirn fst-4.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 2
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365

+ ssh -XCl adm_ebirn fst-2.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 3
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365

+ ssh -XCl adm_ebirn fst-5.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 4
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365

+ ssh -XCl adm_ebirn fst-7.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 5
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365

+ ssh -XCl adm_ebirn fst-8.eos.grid.vbc.ac.at -C '/sbin/eos-rain-hd-dump /srv/data/data.21/0000015b/00350936'
RAIN header info:
Stripe index    : 6
Num. blocks     : 2292
Block size      : 1048576
Size last block : 756365


[adm_ebirn@mgm-1 ~]$ eos fileinfo /eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/64558671-1776-E711-806B-FA163E995805.root --fullpath | grep online | awk '{print "ssh -XC "$3" -C \"ls -lrt "$11"\"";}' | while read -r line; do echo "$line" | sh -x ; done
Secsss (getKeyTab): Unable to open /etc/eos.keytab; Permission denied
Unable to open keytab file.
+ ssh -XC fst-9.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936
+ ssh -XC fst-3.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936
+ ssh -XC fst-4.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936
+ ssh -XC fst-2.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936
+ ssh -XC fst-5.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936
+ ssh -XC fst-7.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936
+ ssh -XC fst-8.eos.grid.vbc.ac.at -C 'ls -lrt /srv/data/data.21/0000015b/00350936'
-rw-r--r--. 1 daemon daemon 481300480 Sep 16 13:23 /srv/data/data.21/0000015b/00350936

I’ve also grepped for errors with this file on fst-5 and fst-3, but the logs here:
fst-5: https://stikked.web.cern.ch/stikked/view/ea25581e
fst-3: https://stikked.web.cern.ch/stikked/view/cee06f36

Edit: you asked also if we can reproduce such scenarios: unfortunately not. I’ve tried copying and reading those files and could never trigger this problem manually.

Best,
Erich

esindril · September 17, 2020, 3:22pm

Hi Erich,

Could you please also upload the contents you get from the MGM and FST5 when grepping for this e5829110-f898-11ea-b604-3868dd28d0c0 ?

Thanks,
Elvin

ebirngru · September 18, 2020, 8:58am

Hi Elvin,

Thanks for looking into this:
fst-5: https://stikked.web.cern.ch/stikked/view/d57371ec
I did not find this logid on any of the MGMs. On mgm, the only thing I see at the same timestamp is (different loigid though):

200917 05:50:13 16116 XrootdXeq: grid.cms.75860:1134@gridftp-1.grid.vbc.ac.at disc 0:00:00
200917 05:50:15 time=1600314615.748742 func=accessHeadReplicaMultipleGroup level=WARN  logid=d36b5c0c-d5c8-11ea-8897-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007faa821d2700 source=GeoTreeEngine:1855             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="" fsid=%lu
200917 05:50:15 time=1600314615.748801 func=accessHeadReplicaMultipleGroup level=WARN  logid=d36b5c0c-d5c8-11ea-8897-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007faa821d2700 source=GeoTreeEngine:1855             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="" fsid=%lu

Best,
Erich

ebirngru · September 25, 2020, 8:05am

Hi Elvin,
Are there any updates for me, or is there anything else that I can try?
Best,
Erich

esindril · September 30, 2020, 12:49pm

Hi Erich,

Sorry for the late reply, could I get the MGM and FST (involved) logs when such an event occurs? Preferably, a more recent one. I had another look at the initial report but it’s not clear to me what is triggering this behavior.

Thanks,
Elvin

esindril · September 30, 2020, 12:50pm

Note, I would need the full MGM and FST logs for an hour or so around the event timestamp.

Looking a the JIRA ticket that you opened, I see there is this master-slave failover that happened. Are these correlated in any way? Also could you send me the MGM logs when such an MGM transition occurs so that we better understand why it happened? Please mention the version of eos running on the MGM and FSTs. Thanks!

ebirngru · October 1, 2020, 9:03am

Hi Elvin,
We’re currently running EOS 4.8.4.
Previously we had to reduce log verbosity on mgm to ERROR as otherwise the log traffic would almost fill the disk within a day.
I’ve increased mgm log verbosity to INFO and wait for these error events to happen again, I’ll get back to you with all logs as soon as I have collected the data.
Best,
Erich

ebirngru · October 1, 2020, 10:38am

Hi elvin,
I’ve collected the logs of all machines roughly between 11h and 12h today, and put it on my cernbox. beware, the fully extracted archive is 1,8gb. it contains a text file with some sample errors + the complete logs.

Error looks like this

201001 11:45:55 time=1601545555.954185 func=fileOpen                 level=ERROR logid=e70be8e6-03ca-11eb-a9d4-3868dd28d3f8 unit=fst@fst-5.eos.grid.vbc.ac.at:1095 tid=00007f97d85fd700 source=XrdIo:280                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://__offline_fst-1.eos.grid.vbc.ac.at:1095///eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root?cap.msg=3nbHSbN5UOFAk/eRkrrmP8ZgCi4h4ZJ2hn6TkALdrDAk05JNgGFWjaQHd/oUwereXgJFa4sqIptzH9uxYKPf24GV7q7ahQmK4ICSms1Ut6FdPQt1OYAC4LBmoTH32rZXh8gGJS+wgW7U7YlY+bSMfPtQ6J8WGBs2f2wEz5vo070bLVBCXLxg3yxrZnExibRIRjgwNuRqeyM3lLbiHwmeNSYU2Lk9F2amMG76XzFT0qLfLnUM22JoAwIpBeCdC2Xj+6Qn0u4HWr1aRfecgwLyt/mEfQatt6zfPJbXWKRGM0B6I8izQmmXETFuhr4ikFpeKwtTQY7qQUzJw74DSKyCliTAHZghK5gyXwzYMKFnB9GSvN6x1llNnKHrPQ9FZjgFafho9yHuWqVAxEp8a7rQh2BoT8d5nvh9jYbfF2AtFhlNVMbW1z6T5QEPektXWxKehv5mMaa3w3n4zCA3N2FgV0M7TY8TaHXHzKRknHyGPznMGO4CFOO52xk2jLSK39VyWfCq8Qt++87PpeSmxWCuNYtjRVKaexAh98tvpOCVPhsteFdoB5OFWhIWW9LKO4s0C2L5qvNC8sRbX+KzS2KVVjGlZXftTvXqT7p0KFMiuxn5/f7QOLQz13oQavHSAZwdn5lf2AsuD6Q4soqo1WaYFYQNegUA3UML0o7JDcQ69oGgrInQlZPOkjCCGfRq06rOe+Yy7CVnyolKHwlnIr9C/DQ5o8Jkg4wdCZzIJGy8YzM8tTPAszI//bcGbrpLWdYcY5CpmqLcpwJnwN1soY8ZZw9JJGFlZ4BKa5HUDgyYntJYgCqefKZT7EpRwXbyyCJK/9Faok1+fu58Gp3gPQb4KC+FNj3ImhuZoQUGXSCQM8RKvAnEmSADjzrJwe06F8ruG4CdYJeFUOUpMLu+ZG0Aii23IVTv4YN7L9JHynnhzizR2Npr1DlXUG+wAPU0eLoetWflQUcBVZr9WK+QGj7twyUanG8amZ9PdOISnN7eZzUgY82NS24QzPWpFUcCzrQNN9/pcaH5zK41XbviMlcz4Zhnzr7hcoeiNFBl/mhQvE/+pLPG/7sAPXnCupFKxlk/CRUFhCSVeuXQVdvgaFI5mNxmGtyhdTwenQDfbac2vJPzfcjbrA+mmNrrzLqBZHlRU8Io9JtoA0lq4aI8Du9d4E5Znk5q7RuzasO58CO5z1y9fo7wgc58TQ==&cap.sym=TU8aiZKgP9JVwZSe2/tlKPrdWfk=&fst.blocksize=1048576&fst.readahead=true&fst.valid=1601545614&mgm.bookingsize=0&mgm.id=0034f1a6&mgm.lid=543426114&mgm.logid=e707ef2a-03ca-11eb-89d8-3868dd28d0c0&mgm.mtime=1600254025&mgm.path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root&mgm.replicahead=5&mgm.replicaindex=4&tried=fst-7.eos.grid.vbc.ac.at,fst-1.eos.grid.vbc.ac.at, errno=0, errc=101, msg=[FATAL] Invalid address"
201001 11:53:28 time=1601546008.081332 func=fileOpen                 level=ERROR logid=f4890084-03cb-11eb-93cd-3868dd28d3f8 unit=fst@fst-5.eos.grid.vbc.ac.at:1095 tid=00007f97e3fff700 source=XrdIo:280                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://__offline_fst-6.eos.grid.vbc.ac.at:1095///eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root?cap.msg=3nbHSbN5UOFAk/eRkrrmP8ZgCi4h4ZJ2hn6TkALdrDAk05JNgGFWjaQHd/oUwereXgJFa4sqIptzH9uxYKPf24GV7q7ahQmK4ICSms1Ut6FdPQt1OYAC4LBmoTH32rZXh8gGJS+wgW7U7YlY+bSMfPtQ6J8WGBs2f2wEz5vo070bLVBCXLxg3yxrZnExibRIRjgwNuRqeyM3lLbiHwmeNSYU2Lk9F2amMG76XzFT0qLfLnUM22JoAwIpBeCdC2Xj+6Qn0u4HWr1aRfecgwLyt/mEfQatt6zfPJbXWKRGM0B6I8izQmmXETFuhr4ikFpeKwtTQY7qQUzJw74DSKyCliTAHZghK5gyXwzYMKFnB9GSvN6x1llNnKHrPQ9FZjgFafho9yHuWqVAxEp8a7rQh2BoT8d5nvh9jYbfF2AtFhlNVMbW1z6T5QEPektXWxKehv5mMaa3w3n4zCA3N2FgV0M7TY8TaHXHzKRknHyGPznMGO4CFOO52xk2jLSK39VyWfCq8Qt++87PpeSmxWCuNYtjRVKaexAh98tvpOCVPhsteFdoB5OFWhIWW9LKO4s0C2L5qvNC8sRfd7fW6sDbRMCTlQyONajDWliNM7hSJZ6PGU1VCT5xbiG6aSlnj0iTraFzPu3qv8gg6KPFz1YlbtAma0QAavA8dW3UNxNLllooKoqdEQ5HaIB3rFaVTxTW5rU1P9kHQj6UOxdbBMpzBg/zWdeebjtfM3FbywsD23nggJXBhhHH9ZqftlnIJzi54nCPSups9Gt+j378s6C2ArjRapKKZvHQ+9T4q0S8Tot8R1ONMwemRyPB4CEM4WHvlXnBl1hokYA9SHIc3d6Q9IIBOeqqxg2BlC7k7k3ypre9ZnkbSLrjD6RBUjoAqNEmIU/syq/s46SaRTK+JRSohpCFKSCrSQW3ZtxUPUKRywcTtWBzTC9G5KMJ4BZEu8y5vlCReUgwRrp9ADrNAV2qh87WSihpR8r623VUdEu8bReyaQQd1CFwDHU00bexPDit5ayZu5F34eOw7klp+SjGCGIJlDajLVR2lVNnRpstSJ/Rdt6s4ujq6lhzBoD1F9EIcDvSqRmHz+OebNj7hTkZv/7lVJeK407VdeRxyLDIas+dfz1eBeUjOE4Al+LTWTCyWEDKdXBvuZPaGmdqjp8wHBVRFtnpkazHwTjxEqzra5I=&cap.sym=TU8aiZKgP9JVwZSe2/tlKPrdWfk=&fst.blocksize=1048576&fst.readahead=true&fst.valid=1601546067&mgm.bookingsize=0&mgm.id=0034f1a6&mgm.lid=543426114&mgm.logid=f487135a-03cb-11eb-b908-3868dd28d0c0&mgm.mtime=1600254025&mgm.path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root&mgm.replicahead=5&mgm.replicaindex=0&tried=fst-6.eos.grid.vbc.ac.at, errno=0, errc=101, msg=[FATAL] Invalid address

Tanks again for looking into this,
Best
Erich

esindril · October 2, 2020, 8:33am

Hi Erich,

Thanks for the logs! They were very useful in identifying the root cause of this issue. The problem comes from the software that is accessing the data which in this case is CMS SW. I looked at several failures and they all exhibit the same pattern. Let me first explain this:

the client (which is uniquely identified by the tident value in the log line) comes to access a file for reading at the MGM

201001 11:33:13 time=1601544793.576448 func=open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f550e2fd700 source=XrdMgmOfsFile:2794             tident=grid.cms.330:845@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=43701 gid=43700 name=grid.cms.pilot02 geo="vbc" op=read path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root info= target[0]=(fst-6.eos.grid.vbc.ac.at,156) target[1]=(fst-9.eos.grid.vbc.ac.at,240) target[2]=(fst-2.eos.grid.vbc.ac.at,44) target[3]=(fst-7.eos.grid.vbc.ac.at,184) target[4]=(fst-1.eos.grid.vbc.ac.at,16) target[5]=(fst-5.eos.grid.vbc.ac.at,128) target[6]=(fst-8.eos.grid.vbc.ac.at,212)  redirection=fst-7.eos.grid.vbc.ac.at?&cap.sym=<...>&cap.msg=<...>&mgm.logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0&mgm.replicaindex=3&mgm.replicahead=3&mgm.id=0034f1a6&mgm.mtime=1600254025 xrd_port=1095 http_port=9001

it then gets redirect the one of the FSTs, in this particular case: fst-7.eos.grid.vbc.ac.at
on the FST side we see the access which happens immediately and the file is read in full in around 9 minutes

[esindril@esdss000 eos_logs_vbc]$ grep 20a24dea-03c9-11eb-aef3-3868dd28d0c0 xrdlog.fst-7-2020-10-01-1100.txt 
201001 11:33:13 time=1601544793.579743 func=open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=XrdFstOfsFile:119              tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root info=cap.msg=<...>&cap.sym=<...>&mgm.id=0034f1a6&mgm.logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0&mgm.mtime=1600254025&mgm.replicahead=3&mgm.replicaindex=3 open_mode=0
201001 11:33:13 time=1601544793.580085 func=ProcessCapOpaque         level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=XrdFstOfsFile:2232             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=(null) uid=99 gid=99 name=(null) geo="" capability=&mgm.access=read&mgm.ruid=43701&mgm.rgid=43700&mgm.uid=99&mgm.gid=99&mgm.path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root&mgm.manager=mgm-1.eos.grid.vbc.ac.at:1094&mgm.fid=0034f1a6&mgm.cid=19095&mgm.sec=unix|grid.cms.pilot02|clip-c2-87.cbe.vbc.ac.at||role.grid.cms.pilot|||&mgm.lid=543426114&mgm.bookingsize=1024&mgm.fsid=184&mgm.url0=root://fst-6.eos.grid.vbc.ac.at:1095//&mgm.fsid0=156&mgm.url1=root://fst-9.eos.grid.vbc.ac.at:1095//&mgm.fsid1=240&mgm.url2=root://fst-2.eos.grid.vbc.ac.at:1095//&mgm.fsid2=44&mgm.url3=root://fst-7.eos.grid.vbc.ac.at:1095//&mgm.fsid3=184&mgm.url4=root://fst-1.eos.grid.vbc.ac.at:1095//&mgm.fsid4=16&mgm.url5=root://fst-5.eos.grid.vbc.ac.at:1095//&mgm.fsid5=128&mgm.url6=root://fst-8.eos.grid.vbc.ac.at:1095//&mgm.fsid6=212&cap.valid=1601548393
201001 11:33:13 time=1601544793.580158 func=open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=XrdFstOfsFile:197              tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=(null) uid=43701 gid=43700 name=nobody geo="" ns_path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root fst_path=/srv/data/data.15/0000015a/0034f1a6
201001 11:33:13 time=1601544793.580603 func=open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=XrdFstOfsFile:462              tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=(null) uid=43701 gid=43700 name=nobody geo="" fst_path=/srv/data/data.15/0000015a/0034f1a6 open-mode=0 create-mode=1b0 layout-name=raid6 oss-opaque=&mgm.lid=543426114&mgm.bookingsize=0
201001 11:33:13 time=1601544793.581038 func=Open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=RaidMetaLayout:272             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" Open remote stripe i=0 
201001 11:33:13 time=1601544793.593308 func=Open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=RaidMetaLayout:272             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" Open remote stripe i=1 
201001 11:33:13 time=1601544793.601176 func=Open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=RaidMetaLayout:272             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" Open remote stripe i=2 
201001 11:33:13 time=1601544793.665287 func=Open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=RaidMetaLayout:272             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" Open remote stripe i=4 
201001 11:33:13 time=1601544793.705504 func=Open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=RaidMetaLayout:272             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" Open remote stripe i=5 
201001 11:33:13 time=1601544793.729635 func=Open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=RaidMetaLayout:272             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=0 gid=0 name=grid.cms.pilot02 geo="" Open remote stripe i=6 
201001 11:33:13 time=1601544793.740694 func=open                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=XrdFstOfsFile:550              tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=(null) uid=43701 gid=43700 name=nobody geo="" msg="layout size" disk_size=2396434243 db_size= 2396434243
201001 11:33:13 time=1601544793.740807 func=stat                     level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa2857ff700 source=XrdFstOfsFile:922              tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=      uid=43701 gid=43700 name=nobody geo="" path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root fxid=0034f1a6 size=2396434243 mtime=1600254025.27145740
201001 11:42:48 time=1601545368.945744 func=_close                   level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa355ffe700 source=XrdFstOfsFile:1145             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=      uid=43701 gid=43700 name=nobody geo="" viaDelete=0 writeDelete=0
201001 11:42:49 time=1601545369.020797 func=_close                   level=INFO  logid=20a24dea-03c9-11eb-aef3-3868dd28d0c0 unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa355ffe700 source=XrdFstOfsFile:1801             tident=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at sec=      uid=43701 gid=43700 name=nobody geo="" msg="done close" rc=0 errc=0
201001 11:42:49 time=1601545369.365831 func=Report                   level=INFO  logid=static.............................. unit=fst@fst-7.eos.grid.vbc.ac.at:1095 tid=00007fa3825fa700 source=Report:51                      tident= sec=(null) uid=99 gid=99 name=- geo="" log=20a24dea-03c9-11eb-aef3-3868dd28d0c0&path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root&fstpath=/srv/data/data.15/0000015a/0034f1a6&ruid=43701&rgid=43700&td=grid.cms.330:412@clip-c2-87.cbe.vbc.ac.at&host=fst-7.eos.grid.vbc.ac.at&lid=543426114&fid=3469734&fsid=184&ots=1601544793&otms=579&cts=1601545369&ctms=20&nrc=120&nwc=0&rb=40954259&rb_min=184&rb_max=1048576&rb_sigma=490781.72&rv_op=42&rvb_min=0&rvb_max=1048576&rvb_sum=10871262&rvb_sigma=337110.84&rs_op=116&rsb_min=611&rsb_max=524288&rsb_sum=10871262&rsb_sigma=121917.59&rc_min=0&rc_max=16&rc_sum=116&rc_sigma=3.69&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=1349073241&sbwdb=21407776047&sxlfwdb=1349073241&sxlbwdb=21407776047&nfwds=18&nbwds=93&nxlfwds=18&nxlbwds=93&rt=1146483.37&rvt=48.56&wt=0.00&osize=2396434243&csize=2396434243&delete_on_close=0&sec.prot=unix&sec.name=grid.cms.pilot02&sec.host=clip-c2-87.cbe.vbc.ac.at&sec.vorg=&sec.grps=role.grid.cms.pilot&sec.role=&sec.info=&sec.app=

after ~70 seconds since the inital open at the MGM, we see another open requests from the same client, asking for the same file but now excluding the fst-7 storage node by using the tried information

 201001 11:34:24 time=1601544864.499441 func=open                     level=INFO  logid=4ae87a7a-03c9-11eb-98a9-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007f54e15f4700 source=XrdMgmOfsFile:462              tident=grid.cms.330:845@clip-c2-87.cbe.vbc.ac.at sec=unix  uid=43701 gid=43700 name=grid.cms.pilot02 geo="vbc" op=read path=/eos/vbc/experiments/cms/store/mc/HC/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/00000/C8E5357B-1476-E711-8503-02163E013EB2.root info=tried=fst-7.eos.grid.vbc.ac.at

this pattern is repeated for several FSTs until there are not enough FSTs to serve the data

Therefore, this explains fully what happens in this case. This is definitely not a problem of the scheduler or of the EOS service in general.

Now, to give you a bit of background info, from what I know, CMS SW is using a different retry mechanism than what the vanilla XRootD client is using and they might decide during a transfer that the transfer speeds are not satisfactory and then they retry opening the file by excluding the previous location to which the client was redirected. Funny enough, we discovered exactly the same issue on our CMS instance two days ago while investigating similar complaints from CMS people.

Therefore, in this case I don’t think there is much you can do to avoid this. I would bring this to the attention of the CMS SW people so that they are aware of the implications. I think part of this mechanism was written by Brian Bockelman - but I don’t know if he’s still following this.

If you do see such a failures but for files which are not accessed by CMS SW then let me know.

Cheers,
Elvin

ebirngru · October 2, 2020, 9:05am

Hi Elvin,
This is in agreement with what we observe, i.e. CMS files being affected, but not other experiments’ files (ALICE, BELLE). We found this surprising, but were also not aware of the different clients that are around.

Are you in contact with CMS SW developers already and could you connect us to them too?
We’re in the middle of T2 site qualification for CMS, which is basically halted now, as we produce too high error rates because of these file access errors.

Best,
Erich

esindril · October 2, 2020, 12:24pm

Hi Erich,

I just talked to @marsuaga who is the EOSCMS service manager on our side and she will bring up this issue at the next meeting that we have with the CMS representatives - in one week I believe. Nevertheless, you should also bring this up with whoever is in contact with you from the CMS side. I will let you know if we have more news on this.

Thanks,
Elvin

ebirngru · October 2, 2020, 2:56pm

Hi Elvin,
Thanks so much for your efforts, we’ll do the same from our side, we started figuring out who to talk to and will update you likewise once we have more information.

Best,
Erich

CERN Accelerating science

FST marked as offline and spurious MGM failovers