We have been in the process of transferring the data from our old DPM storage to our new EOS storage (4.8.4) and also preparing everything to make it available as a storage element for WLCG workloads (CMS, Alice & Bell). For more detais see the post of my colleague.
Additionally we have setup fusex mount ponts (no automount) from our HPC system (around 160 nodes) that is used by our users for copying and accessing their data from interactive compute jobs.
Recently we started to see two issues that might be related but we are not sure:
1.) FST marked as offline:
Recently we see some Site Availabiliy Tests (SAM) that read a root file using the xrootd endpoint from our compute nodes failing. However those failures were not reproducible. They seem to work most of the time but sometimes they encounter a read error on the EOS side. Thatβs for the same file. We made sure that the file is not corrupt by checking the checksum:
The tests access the exact same file and fail at different timepoints when reading events from the root file.
We then checked the MGM logs and we see following error messages:
200824 11:41:33 time=1598262093.975294 func=fileOpen level=ERROR logid=ff3291ea-e5ed-11ea-9094-3868dd2a6fb0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=XrdIo:280
tident=<service> sec= uid=0 gid=0 name= geo="" error= "open failed url=root://__offline_fst-7.eos.grid.vbc.ac.at:1095///eos/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSS
W_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root?cap.msg=3nbHSbN5UOFAk/eRkrrmP8ZgCi4h4ZJ27MpcXGvCJ7mMf5KALLfmlHQNpNcdXU+DlP2ws49B54c98UG26I8wT3uR0utbkA1GzkY4Pz+pd/uWO9yE+
Q7wj1EdWxKKmjeH4o+NsIwHiFLtOP/ID7vUJ9jN3MYy85wX8fJPZ/SwBnusv8uTychkSRZA3rh5+JBBpjVscNPe2DSCzJR3blRV3tRlB4zM7KJmlnSSi6JY4LYrrpjSsDMBMacCkFASlbAFXGJdEEELJd+zIyNDuOcwrp9TOrAPoDcNjeuRdHh8CR4yUZlq3nUo+jlGWSflx
ls177SbUL4tmHpYKlXT+/iUY21UJUg9FgRLFJ29RARndi1uO9NNVS4MunwL2ttUi1kU6FR1XaFLOFSrhAc53uz8RduH9hpI9G6RZhfKmRkSscnAM9kLQFiI8mtY8uByL34S5E9QRUfQiaKBGIcVUy+EwaL25AdeXEwA8QY12ovdNw6k8YyOhV/lmkx/VgU/YRcpdn1Mxg7as
PK92hLl0UIFRPRU/nfQBK5N+wIXos2aka0yrXW2k4sPH0UpFWdk/0NRh28fdIRbRvIfRbsy9DywR5EARGLsM6O38BCGw+He3gwY2uXhXg6nsm7v99TzrD11i1TTmnQzGJvgsYFswFM5jErhs+WkHVq4gCtRnLIe+lwJ9xvrNqUDfgq0duYbFW73hlmX302819sWiWlyeB/Ql
2Oo3axa152+onvhahlsXATAvP1p2kUfeRsrSIg2gVRCAQRzdH06ED8PMo/U+hI1Nd/zxBz2TGmPpX7yTv6VgYXK4HpPOFeUgIhW/bTFDhpi29aObImEIPXu/7ff7Jy7Q7Claky+OLReD3fEW0Zvq2T0tv6NWxnl6HZDsNqvczUDvM2jyQRj0uAUejyB0Bz5LXCTmYpCVlCUN
w+dFkHRhSg2MKLr/uMrjc+yMFVgCWQBCiCRS2u5mOhJvqb/E/QcYiiGRWP6aomhAcnw3MLBKkjDsb8V2IrLs0dcCwd6taIhDdZqX5oj2kod9E55CfgeyT5jDh24KnVtK/j6zXoZqL4+f7yEJcdGc86mnVmhyq5j/34U/kKjlu1rxMpO6mfYb9xx75w/2Bo4sF5+3Haoqjjcw
jrIMqUDCqseWgpk1NeOljbDis8AcZTXIBERNfJsuDNNl+cfNHNpPk8CKRIf6KzxrcnJCi0H/fP4KyU9JHwo+YqcJdwW1qCSqh7MQ9M/XweW5T9LBuCijZDCfyrBSPm2jooHHD10Ex4voT2KUv9ZQKoiNSUP35V771ONQszqB+aZWZBHirdKvqfoY17oZ1SRb4cXvMtx30Lc4
bnfLdfg4B/T0Zjq7yI=&cap.sym=TU8aiZKgP9JVwZSe2/tlKPrdWfk=&fst.blocksize=1048576&fst.readahead=true&fst.valid=1598262152&mgm.bookingsize=0&mgm.id=0024fca7&mgm.lid=543426626&mgm.logid=ff2fe846-e5ed-11ea-bbdb
-3868dd28d0c0&mgm.mtime=1597858863&mgm.path=/eos/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root&mgm.replicahead
=1&mgm.replicaindex=5&tried=fst-7.eos.grid.vbc.ac.at, errno=0, errc=101, msg=[FATAL] Invalid address"
200824 11:41:33 time=1598262093.975356 func=fileOpen level=WARN logid=ff3291ea-e5ed-11ea-9094-3868dd2a6fb0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=XrdIo:284
tident=<service> sec= uid=0 gid=0 name= geo="" msg="error encountered despite errno=0; setting errno=22"
200824 11:41:33 time=1598262093.975377 func=Open level=WARN logid=ff2fe846-e5ed-11ea-bbdb-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=RaidMetaLayou
t:311 tident=grid.cms.148:184@clip-c2-79.cbe.vbc.ac.at sec=unix uid=0 gid=0 name=grid.cms.pool002 geo="" msg="open failed on remote stripe: root://__offline_fst-7.eos.grid.vbc.ac.at:1095///eo
s/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-0025905B85BA.root?&cap.msg=3nbHSbN5UOFAk/eRkrrmP8ZgCi4h4ZJ27MpcXGvCJ7mMf5KALLfml
HQNpNcdXU+DlP2ws49B54c98UG26I8wT3uR0utbkA1GzkY4Pz+pd/uWO9yE+Q7wj1EdWxKKmjeH4o+NsIwHiFLtOP/ID7vUJ9jN3MYy85wX8fJPZ/SwBnusv8uTychkSRZA3rh5+JBBpjVscNPe2DSCzJR3blRV3tRlB4zM7KJmlnSSi6JY4LYrrpjSsDMBMacCkFASlbAFX
GJdEEELJd+zIyNDuOcwrp9TOrAPoDcNjeuRdHh8CR4yUZlq3nUo+jlGWSflxls177SbUL4tmHpYKlXT+/iUY21UJUg9FgRLFJ29RARndi1uO9NNVS4MunwL2ttUi1kU6FR1XaFLOFSrhAc53uz8RduH9hpI9G6RZhfKmRkSscnAM9kLQFiI8mtY8uByL34S5E9QRUfQiaKBG
IcVUy+EwaL25AdeXEwA8QY12ovdNw6k8YyOhV/lmkx/VgU/YRcpdn1Mxg7asPK92hLl0UIFRPRU/nfQBK5N+wIXos2aka0yrXW2k4sPH0UpFWdk/0NRh28fdIRbRvIfRbsy9DywR5EARGLsM6O38BCGw+He3gwY2uXhXg6nsm7v99TzrD11i1TTmnQzGJvgsYFswFM5jErhs
+WkHVq4gCtRnLIe+lwJ9xvrNqUDfgq0duYbFW73hlmX302819sWiWlyeB/Ql2Oo3axa152+onvhahlsXATAvP1p2kUfeRsrSIg2gVRCAQRzdH06ED8PMo/U+hI1Nd/zxBz2TGmPpX7yTv6VgYXK4HpPOFeUgIhW/bTFDhpi29aObImEIPXu/7ff7Jy7Q7Claky+OLReD3fEW
0Zvq2T0tv6NWxnl6HZDsNqvczUDvM2jyQRj0uAUejyB0Bz5LXCTmYpCVlCUNw+dFkHRhSg2MKLr/uMrjc+yMFVgCWQBCiCRS2u5mOhJvqb/E/QcYiiGRWP6aomhAcnw3MLBKkjDsb8V2IrLs0dcCwd6taIhDdZqX5oj2kod9E55CfgeyT5jDh24KnVtK/j6zXoZqL4+f7yEJ
cdGc86mnVmhyq5j/34U/kKjlu1rxMpO6mfYb9xx75w/2Bo4sF5+3HaoqjjcwjrIMqUDCqseWgpk1NeOljbDis8AcZTXIBERNfJsuDNNl+cfNHNpPk8CKRIf6KzxrcnJCi0H/fP4KyU9JHwo+YqcJdwW1qCSqh7MQ9M/XweW5T9LBuCijZDCfyrBSPm2jooHHD10Ex4voT2KU
v9ZQKoiNSUP35V771ONQszqB+aZWZBHirdKvqfoY17oZ1SRb4cXvMtx30Lc4bnfLdfg4B/T0Zjq7yI=&cap.sym=TU8aiZKgP9JVwZSe2/tlKPrdWfk=&mgm.id=0024fca7&mgm.logid=ff2fe846-e5ed-11ea-bbdb-3868dd28d0c0&mgm.mtime=1597858863&mgm
.replicahead=1&mgm.replicaindex=5&tried=fst-7.eos.grid.vbc.ac.at&mgm.path=/eos/vbc/experiments/cms/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/887C13FB-8B31-E711-BCE7-002
5905B85BA.root"
It seems that when the file is re-assembled from the stripes (we are using RAID6 layout with 9 stripes), some of the fsts are marked as offline (i.e. in the above root://__offline_fst-7.eos.grid.vbc.ac.at) and the read fails. This happens randomly to all fsts. We checked the FSTs and we couldnβt see anything out of the ordinary. The FSTs seem to be running fine the entire time.
What could be the reason why the FSTs are marked offline ? We did check the quarkdb status with quarkdb-health and it seems to be fine:
[root@mgm-1 mgm]# redis-cli -p 9999 quarkdb-health
1) NODE-HEALTH GREEN
2) NODE mgm-1.eos.grid.vbc.ac.at:9999
3) VERSION 0.4.2
4) ----------
5) GREEN >> SM-FREE-SPACE 781253386240 bytes (97.7653%)
6) GREEN >> SM-MANIFEST-TIMEDIFF 0 sec
7) GREEN >> PART-OF-QUORUM Yes | LEADER mgm-2.eos.grid.vbc.ac.at:9999
2.) Our users mostly use fuesx to copy/access data on the EOS storage but they report that 50 % of the times they receive errors such as Input/output error
or Protocol error
.
We could partially reproduce this. When we run md5sum on a file over fusex on one node we receive an Input/output error
. Running it on a node with a fresh fusex mount point we get the correct md5 checksum.
We cheked the MGM and FST logs and we couldnβt really see any errors.
The fuse.eos.log on the node where we get the error shows following message:
200824 14:46:02 t=1598273162.328177 f=HandleResponseWithHosts l=ERROR tid=00007fdbea5d9700 s=xrdclproxy:469 state=failed async open returned errmsg=[FATAL] Redirect limit has been reached
200824 14:46:02 t=1598273162.328245 f=prefetch l=ERROR ino:000250b5d0000000 s=data:854 pre-fetch failed error=[FATAL] Redirect limit has been reached
200824 14:46:02 t=1598273162.333197 f=TryRecovery l=ERROR ino:000250b5d0000000 s=data:1034 unrecoverable error - code=306 errNo=0
200824 14:46:02 t=1598273162.333219 f=peek_pread l=ERROR ino:000250b5d0000000 s=data:2447 sync remote-io recovery failed errno=71
200824 14:46:02 t=1598273162.333359 f=prefetch l=ERROR ino:000250b5d0000000 s=data:854 pre-fetch failed error=[FATAL] Redirect limit has been reached
# -------------------
# - recovery record -
# -------------------
# path := '/eos/vbc/incoming/incoming_natascha/store/mc/RunIISummer17MiniAOD/TT_TuneCUETP8M2T4_13TeV-powheg-pythia8/MINIAODSIM/92X_upgrade2017_realistic_v10_ext1-v2/50000/0A1CC839-2899-E711-8781-4C79BA180A45.root'
# fid := 2427741
# -[ 000 ] 200824 14:46:02 t=1598273162.328259 f=peek_pread l= ino:000250b5d0000000 s=data:2424 status='[FATAL] Redirect limit has been reached' hint='will TryRecovery'
# -[ 001 ] 200824 14:46:02 t=1598273162.333215 f=peek_pread l= ino:000250b5d0000000 s=data:2431 status='[FATAL] Redirect limit has been reached' errno='71' hint='failed TryRecovery'
On the node where it works we donβt see any of those messages.
Additionally in the FST log we can see sometimes following log message:
200824 14:48:27 time=1598273307.984288 func=ProcessCapOpaque level=WARN logid=1b3545a8-e608-11ea-a066-3868dd28d968 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f089b7fe700 source=XrdFstOfsFile
:2323 tident=daemon.8103:111@fst-3 sec=(null) uid=99 gid=99 name=(null) geo="" msg="MGM master seems to have changed - adjusting global config" old-manager="mgm-1.eos.grid.vbc.ac.at:1094" new-manager="mgm-3.eos.grid.vbc.ac.at:1094"
200824 14:48:44 time=1598273324.287918 func=ProcessCapOpaque level=WARN logid=24fa7f22-e608-11ea-80dd-3868dd28d0c0 unit=fst@fst-9.eos.grid.vbc.ac.at:1095 tid=00007f0644532700 source=XrdFstOfsFile:2323 tident=daemon.8103:111@fst-3 sec=(null) uid=99 gid=99 name=(null) geo="" msg="MGM master seems to have changed - adjusting global config" old-manager="mgm-3.eos.grid.vbc.ac.at:1094" new-
manager="mgm-1.eos.grid.vbc.ac.at:1094"
It looks like the master is moved from mgm-1 to mgm-3 and 10 seconds later back to mgm-1, which might explain the fusex errors.
However when we check the mgm-3 log we can not see that it ever became master:
[root@mgm-3 mgm]# grep -c is_master=false xrdlog.mgm
6740
[root@mgm-3 mgm]# grep -c is_master=true xrdlog.mgm
0
[root@mgm-3 mgm]#
We also donβt see any real load on the mgm-1 or any other errors that would indicate that the master is moved to another node.
The fact that the fusex is broken 50 % of the times makes it unusable for our users.
We would really appreciate any help or hints how to furhter debug the issue