I’ve been setting up EOS in standalone mode (i.e. only one instance of every service) for testing purposes, although in the long run I plan to add more FSTs. To try stuff out I decided to copy a bunch of files to it, and I chose my entire ~/git
folder for this.
I started seeing weird errors from eosxd3 about no space being left on device. At first I thought that was caused by the cache partition not being big enough but that was not it.
240118 07:03:39 t=1705561419.778166 f=HandleResponse l=CRIT tid=00007f5d7dbfa700 s=xrdclproxy:922 write error '[ERROR] Operation expired'
---- high rate error messages suppressed ----
240118 07:03:39 t=1705561419.788031 f=flush_nolock l=ERROR ino:00004f6d10000000 s=data:373 write error error=[ERROR] Operation expired
240118 07:03:39 t=1705561419.788050 f=TryRecovery l=CRIT ino:00004f6d10000000 s=data:1103 triggering write recovery state = 2
240118 07:03:39 t=1705561419.788556 f=lookupNonLocalJail l=ALERT tid=00007f5d81bff700 s=SecurityChecker:212 Failed to openat file
---- high rate error messages suppressed ----
240118 07:03:39 t=1705561419.788702 f=lookupNonLocalJail l=ALERT tid=00007f5d81bff700 s=SecurityChecker:198 Failed to openat next child
240118 07:03:39 t=1705561419.949411 f=recover_write l=WARN ino:00004f6d10000000 s=data:1887 re-opening with repair flag for recovery root://AAAAAAAE@mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094//fusex-open?eos.app=fuse&eos.bookingsize=0&eos.lfn=ino:4f6d10000000&fuse.exe=rsync&fuse.gid=1000&fuse.pid=3944602&fuse.uid=1000&fuse.ver=5.2.4&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=KEYRING:persistent:1000&xrd.wantprot=krb5,unix&xrdcl.secgid=1000&xrdcl.secuid=1000&eos.repair=1&eos.bookingsize=0
240118 07:03:40 t=1705561420.007886 f=HandleResponseWithHosts l=ERROR tid=00007f5d7b3f5700 s=xrdclproxy:565 state=failed async open returned errmsg=[ERROR] Error response: no space left on device
After further investigation, I’m getting the following in the MGM logs:
240117 11:59:25 time=1705492765.562852 func=IdMap level=INFO logid=static.............................. unit=mgm@mgm-0.mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094 tid=00007fac5c3ff700 source=Mapping:1001 tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=krb5 sec.name="risson" sec.host="[2a01:cb10:944:3200:f40:85f0:9fec:fa6a]" sec.vorg="" sec.grps="" sec.role="" sec.info="" sec.app="fuse" sec.tident="AAAAAAAE.685048:455@[2a01:cb10:944:3200:f40:85f0:9fec:fa6a]" vid.uid=2004 vid.gid=2004 sudo=0 gateway=0
240117 11:59:25 time=1705492765.562968 func=open level=INFO logid=dc16ad9e-b52f-11ee-b4be-fe8bf25c6639 unit=mgm@mgm-0.mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094 tid=00007fac5c3ff700 source=XrdMgmOfsFile:587 tident=AAAAAAAE.685048:455@[2a01:cb10:944:3200:f40:85f0:9fec:fa6a] sec=krb5 uid=2004 gid=2004 name=risson geo="" msg="access by inode" ino=ino:3560000000 path=/eos/user/risson/bench/cern/dss/eos/.git/objects/pack/.pack-4aaab988483f984e09b8b83c8fded69bd59d6859.pack.iyyhWs
240117 11:59:25 time=1705492765.563031 func=open level=INFO logid=dc16ad9e-b52f-11ee-b4be-fe8bf25c6639 unit=mgm@mgm-0.mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094 tid=00007fac5c3ff700 source=XrdMgmOfsFile:1108 tident=AAAAAAAE.685048:455@[2a01:cb10:944:3200:f40:85f0:9fec:fa6a] sec=krb5 uid=2004 gid=2004 name=risson geo="" acl=1 r=1 w=1 wo=0 egroup=0 shared=0 mutable=1 facl=0
240117 11:59:25 time=1705492765.563039 func=open level=INFO logid=dc16ad9e-b52f-11ee-b4be-fe8bf25c6639 unit=mgm@mgm-0.mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094 tid=00007fac5c3ff700 source=XrdMgmOfsFile:1186 tident=AAAAAAAE.685048:455@[2a01:cb10:944:3200:f40:85f0:9fec:fa6a] sec=krb5 uid=2004 gid=2004 name=risson geo="" msg="client acting as directory owner" path="/eos/user/risson/bench/cern/dss/eos/.git/objects/pack/.pack-4aaab988483f984e09b8b83c8fded69bd59d6859.pack.iyyhWs" uid="2004=>2004" gid="2004=>2004"
240117 11:59:25 time=1705492765.563161 func=open level=INFO logid=dc16ad9e-b52f-11ee-b4be-fe8bf25c6639 unit=mgm@mgm-0.mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094 tid=00007fac5c3ff700 source=XrdMgmOfsFile:2247 tident=AAAAAAAE.685048:455@[2a01:cb10:944:3200:f40:85f0:9fec:fa6a] sec=krb5 uid=2004 gid=2004 name=risson geo="" msg="file-recreation due to offline/full locations" path=/eos/user/risson/bench/cern/dss/eos/.git/objects/pack/.pack-4aaab988483f984e09b8b83c8fded69bd59d6859.pack.iyyhWs retc=28
240117 11:59:25 time=1705492765.563216 func=Emsg level=ERROR logid=dc16ad9e-b52f-11ee-b4be-fe8bf25c6639 unit=mgm@mgm-0.mgm.services-eos.svc.c.k3s.fsn.lama.tel:1094 tid=00007fac5c3ff700 source=XrdMgmOfsFile:3527 tident=AAAAAAAE.685048:455@[2a01:cb10:944:3200:f40:85f0:9fec:fa6a] sec=krb5 uid=2004 gid=2004 name=risson geo="" Unable to get free physical space /eos/user/risson/bench/cern/dss/eos/.git/objects/pack/.pack-4aaab988483f984e09b8b83c8fded69bd59d6859.pack.iyyhWs; No space left on device
Trying to reproduce further, I found that running the copy with rsync --max-size 100M
did not trigger the issue. It turns out that if I copy a file bigger than 127M, then it fails:
# Works
size=127; dd if=/dev/urandom of=/tmp/test bs=1M count=$size; rsync -avP /tmp/test /eos/user/risson/test
# Fails with the error above
size=128; dd if=/dev/urandom of=/tmp/test bs=1M count=$size; rsync -avP /tmp/test /eos/user/risson/test
Now, this only happens when using eosxd3 (or eosxd for that matter). When running eos cp
directly on the mgm node, this is working, but I don’t see anything weird in the fuse config I’m using (details below).
# Works fine
[root@mgm-0]# size=200; dd if=/dev/urandom of=/tmp/test bs=1M count=$size; eos cp /tmp/test /eos/user/risson/bench/test
I checked a lot of topic on this forum and from all the troubleshooting steps I’ve found, everything looks fine:
# eos space ls
│type │ name│ groupsize│ groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│sched.capacity│ usage│ quota│ balancing│ threshold│ converter│ ntx│ active│ wfe│ ntx│ active│ intergroup│
spaceview default 0 24 1 1 20.09 GB 52.52 GB 52.52 GB 1.00 TB 32.43 GB 2.01 off off 20 off 2 0 off 1 0 off
# eos fs ls
│host │port│ id│ path│ schedgroup│ geotag│ boot│ configstatus│ drain│ usage│ active│ health│
fst-0.fst.services-eos.svc.c.k3s.fsn.lama.tel 1095 1 /fst_storage default.0 fsn::k3s booted rw nodrain 38.07 online no smartctl
# eos fusex ls
client : eosxd horse 5.2.4 online Wed, 17 Jan 2024 10:38:09 GMT 4.41 7.57 8178302a-b524-11ee-91c3-50ebf6559d17 p=685048 caps=0 fds=0 static [vacant] act mount=/eos prot=4 app=fuse
client : eosxd k3s-1.fsn.lama.tel 5.2.4 online Wed, 17 Jan 2024 08:35:40 GMT 9.23 0.70 659d0300-b513-11ee-9b54-525400f55a2a p=1212146 caps=0 fds=1 static [vacant] act mount=/eos prot=4 app=fuse
I don’t have any specific quota or recycle configuration (yet).
Fuse config:
{
"name": "eos",
"hostport": "mgm.services-eos.svc.c.k3s.fsn.lama.tel",
"remotemountdir": "/eos",
"localmountdir": "/eos",
"auth": {
"shared-mount": 1,
"sss": 0,
"gsi-first": 0,
"krb5": 1,
"oauth2": 0
},
"options": {
"hide-versions": 0
}
}
Mounted with /usr/bin/eosxd3 -f -ofsname=eos
.
Now, all of this is either running via docker or k8s, using gitlab-registry.cern.ch/dss/eos/eos-ci:5.2.4
.
The eosxd3 on which this is happening is ran with docker run --rm -it --name eos-fuse -e EOS_FUSE_KERNELCACHE=0 -e EOS_FUSE_CACHE=0 --entrypoint="" -v /etc/krb5.conf:/etc/krb5.conf:ro -v /persist/var/cache/eos:/var/cache/eos --mount=type=bind,source=/eos,target=/eos,bind-propagation=shared -v $PWD/fuse.eos.conf:/etc/eos/fuse.eos.conf -v /dev/fuse:/dev/fuse -e EOS_MGM_URL=root://mgm.services-eos.svc.c.k3s.fsn.lama.tel --privileged --network=host --pid=host --uts=host --ipc=host gitlab-registry.cern.ch/dss/eos/eos-ci:5.2.4 /usr/bin/eosxd3 -f -ofsname=eos
If you need to look at a specific mgm/fst/other configuration, they’re all public at k8s/services/eos/app · 20049b69682a0765a70bda06d06e1f6da5199aa8 · lama-corp / Lama Corp. Infrastructure / Infrastructure · GitLab
Thanks in advance for your help