FUSEX: Cases of incompletely written files after successful copy

franck-jrc · March 28, 2024, 3:48pm

Following our discussion during the recent EOS Workshop about files that are seen as correctly written by the user, but happen to be corrupted (which usually means incomplete, i.e. either 0 size or smaller than the expected length), I want to share some logs with you to understand how we can avoid this.

The fusex client writes something like this :

240321 07:46:46 t=1711003606.852129 f=HandleResponse   l=CRIT  tid=00007f08cc7f4700 s=xrdclproxy:914           write error '[ERROR] Operation expired'
                 ---- high rate error messages suppressed ----
240321 07:46:50 t=1711003610.854069 f=HandleResponseWithHosts l=ERROR tid=00007f08cb7f2700 s=xrdclproxy:559           state=failed async open returned errmsg=[ERROR] Operation expired
240321 07:46:50 t=1711003610.854302 f=TryRecovery      l=CRIT  ino:80000000a7ea552c s=data:1103                triggering write recovery state = 5
240321 07:46:50 t=1711003610.857345 f=recover_write    l=CRIT  ino:80000000a7ea552c s=data:1769                failed to open local stagefile 
240321 07:46:50 t=1711003610.857399 f=flush_nolock     l=ERROR ino:80000000a7ea552c s=data:450                 remote open failed - returning 121
240321 07:46:50 t=1711003610.857464 f=flush            l=WARN  tid=00007f08977ff700 s=eosfuse:5373             invalidating ino=0x80000000a7ea552c after flush error
240321 07:46:50 t=1711003610.857527 f=TryRecovery      l=CRIT  ino:80000000a7ea5520 s=data:1103                triggering write recovery state = 3
240321 07:46:50 t=1711003610.859791 f=recover_write    l=CRIT  ino:80000000a7ea5520 s=data:1769                failed to open local stagefile 
240321 07:46:50 t=1711003610.860427 f=ioflush          l=WARN  tid=00007f08de7f3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a7ea552c err-code:206
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/MSK_DETFOO_B01.jp2'
#        fid  := 2817152300
#        -[ 000 ] 240321 07:46:50 t=1711003610.854256 f=flush_nolock     l= ino:80000000a7ea552c s=data:444                 status='XrdCl::Proxy::FAILED' hint='will TryRecovery'
#        -[ 001 ] 240321 07:46:50 t=1711003610.854329 f=recover_write    l= ino:80000000a7ea552c s=data:1671                hint='recover write'
#        -[ 002 ] 240321 07:46:50 t=1711003610.854343 f=recover_write    l= ino:80000000a7ea552c s=data:1708                hint='recover from file cache'
#        -[ 003 ] 240321 07:46:50 t=1711003610.857367 f=recover_write    l= ino:80000000a7ea552c s=data:1774                hint='failed to open local stagefile'
#        -[ 004 ] 240321 07:46:50 t=1711003610.857393 f=flush_nolock     l= ino:80000000a7ea552c s=data:449                 errno='121' hint='failed TryRecovery'

240321 07:47:02 t=1711003622.193859 f=ioflush          l=WARN  tid=00007f08de7f3700 s=data:3334                OpenAsync failed - trying recovery - ino:80000000a7ea5520 err-code:0
240321 07:47:02 t=1711003622.193908 f=ioflush          l=WARN  tid=00007f08de7f3700 s=data:3360                giving up OpenAsync request - ino:80000000a7ea5520 err-code:0
240321 07:47:02 t=1711003622.194129 f=ioflush          l=CRIT  tid=00007f08de7f3700 s=data:3389                ino:80000000a7ea5520 msg=file writing failed file-recovery= journal-recovery=/scratch_hdd/fusex_cache/journal/srs_downloader_Sentinel2_46/00A/80000000A7EA5520.jc.recover
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/T36NWM_20240319T075631_B07.jp2'
#        fid  := 2817152288
#        -[ 000 ] 240321 07:46:50 t=1711003610.857548 f=recover_write    l= ino:80000000a7ea5520 s=data:1671                hint='recover write'
#        -[ 001 ] 240321 07:46:50 t=1711003610.857559 f=recover_write    l= ino:80000000a7ea5520 s=data:1708                hint='recover from file cache'
#        -[ 002 ] 240321 07:46:50 t=1711003610.859815 f=recover_write    l= ino:80000000a7ea5520 s=data:1774                hint='failed to open local stagefile'

240321 07:49:26 t=1711003766.917928 f=HandleResponse   l=CRIT  tid=00007f08cc7f4700 s=xrdclproxy:914           write error '[ERROR] Operation expired'
                 ---- high rate error messages suppressed ----
240321 07:49:27 t=1711003767.106613 f=TryRecovery      l=CRIT  ino:80000000a7ea5612 s=data:1103                triggering write recovery state = 3
240321 07:49:27 t=1711003767.108746 f=recover_write    l=CRIT  ino:80000000a7ea5612 s=data:1769                failed to open local stagefile 
240321 07:49:27 t=1711003767.249698 f=ioflush          l=WARN  tid=00007f08de7f3700 s=data:3334                OpenAsync failed - trying recovery - ino:80000000a7ea5612 err-code:0
240321 07:49:27 t=1711003767.249724 f=ioflush          l=WARN  tid=00007f08de7f3700 s=data:3360                giving up OpenAsync request - ino:80000000a7ea5612 err-code:0
240321 07:49:27 t=1711003767.249866 f=ioflush          l=CRIT  tid=00007f08de7f3700 s=data:3389                ino:80000000a7ea5612 msg=file writing failed file-recovery= journal-recovery=/scratch_hdd/fusex_cache/journal/srs_downloader_Sentinel2_46/00A/80000000A7EA5612.jc.recover
#      -------------------
#      - recovery record -
#      -------------------
#        path := ''
#        fid  := 2817152530
#        -[ 000 ] 240321 07:49:27 t=1711003767.106633 f=recover_write    l= ino:80000000a7ea5612 s=data:1671                hint='recover write'
#        -[ 001 ] 240321 07:49:27 t=1711003767.106651 f=recover_write    l= ino:80000000a7ea5612 s=data:1708                hint='recover from file cache'
#        -[ 002 ] 240321 07:49:27 t=1711003767.108798 f=recover_write    l= ino:80000000a7ea5612 s=data:1774                hint='failed to open local stagefile'

From the above log, one files results as incorrect (a7ea552c is 0 size, were the correct size after re-download is 15783)

# eos file info fxid:a7ea552c
  File: '/eos/jeodpp/proc/recycle/uid:47000/2024/03/21/0/....#:#MSK_DETFOO_B01.jp2.00000000a7ea552c'  Flags: 0644  Clock: 17bebae36ade04d5
  Size: 0
Status: fuse::missingcommmits
Modify: Thu Mar 21 07:45:50 2024 Timestamp: 1711003550.107999647
Change: Thu Mar 21 09:39:54 2024 Timestamp: 1711010394.360592784
Access: Thu Mar 21 07:45:50 2024 Timestamp: 1711003550.107999647
 Birth: Thu Mar 21 07:45:50 2024 Timestamp: 1711003550.107999647
# path="/eos/jeodpp/proc/recycle/uid:47000/2024/03/21/0/...#:#MSK_DETFOO_B01.jp2.00000000a7ea552c" fxid="a7ea552c" size="0" nrep="2" checksumtype="adler" checksum="0000000100000000000000000000000000000000000000000000000000000000"
nrep="00" fsid="2963" host="fst-018.cidsn.jrc.it:1095" fstpath="/data41/00044c73/a7ea552c" size="0" statsize="0" checksum="0000000100000000000000000000000000000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000000000000000000000000000" error_label="none"
nrep="01" fsid="1043" host="s-jrciprjeos234p.cidsn.jrc.it:1095" fstpath="/data41/00044c73/a7ea552c" size="0" statsize="0" checksum="0000000100000000000000000000000000000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000000000000000000000000000" error_label="none"

# eos attr ls fxid:a7ea552c
sys.eos.btime="1711003550.107999647"
sys.fusex.state="Cs|c|"

This happens on one specific host which continuously downloads many products. They are shipped in zip files which are copied locally, extracted locally, then copied on eos using the rsync command, whose error code is checked before removing the local data. However from time to time it seems that something happens in the client, or with the FST, and the file is not correctly written. The process uses a user eosxd mount. The failure rate is quite low, around 3 files out of 10000 which, after sanity check, have to be redownloaded. But it would be good to be sure that a file is actually correctly written when no error is given on the OS, and with no need to check them afterwards.

The version in use is 5.1.30. It seems that with previous version 5.1.21 the same was happening, but the error message was a bit different : 'unable to read file for recovery from local cache file instead of failed to open local stagefile.

We are planning to upgrade to a newer 5.2.x version on this host, we’ll see if this still happens.

franck-jrc · April 26, 2024, 10:17am

Good morning,

Regarding the problem above, it seems that it doesn’t happen so much after upgrading the client to version 5.2.22. Although, we had a similar case at the occasion of the unwanted reboot of an FST. The fusex client doesn’t correctly recover is this situation. Could it be because they are running a user mount ?

I have another interesting case of an unwanted behaviour when writing files with fusex (big files it seems it that case, above 20GB) which causes an increase of the load average of the FSTs, so kind of worrying situation.

The files fails to be finalised, and the eosxd clients sends some additional requests every minute for ever (can be days, which tends to cause the FST to read the whole file (which is big) every time.

The files is in status fuse::missingcommits (sys.fusex.state="|>⚌U±U±U±U±U±U±U±U±U±U±U±UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU+3483s|c|"), although it seems to be correctly finished (i.e readable).

The output of the eos fs status of the FS containing the file can give very high stat.wopen.hotfiles numbers And the fs has a high read rate onthe disk (causing high load average on the FST)

stat.disk.load                   := 0.807891
stat.disk.readratemb             := 206.006678

stat.wopen.hotfiles              := 54:a902f2e6

Here are the logs of the eosx client :

240420 01:48:53 t=1713570533.265490 f=DumpStatistic    l=WARN  tid=00007fdab47ff700 s=eosfuse:2389             IO unblock on ino=0x80000000a902f2e6 for op=write since 1180.00 ms { /eos/jeodpp/data/projects/path/to/2008_v10_ERA5LP.nc [root://AAAAAIEM@fst-026.cidsn.jrc.it:1095//fusex-open?cap.msg=qPFuqcArwW0OMB5N8IJ7+ycmP1zAR9lwdftRAgsvTeCRHDyDXDWjj3kQywjSj5sxY2Sbrm82jcG5TRkbFZkf+oTKElogmfEd/ogPvP6t9Beg9Oc9WSsFGL8EdH0heI+2RJ5AqUM/CyxAeDdo7lvoCgfHCX9ksbUhkz642xqsqNqalN6qG8IMwZuoUshHV1AaqaEKCQcl6H9+ifRHWuWjDJR5p7EBuWy2Q/BTHaZjj0Us2letZBgg0jFPbCwYTjVXs0FA5ueFsJVFGqKPLObyRlgzzhL43902riPhfI/UvPaUJ4L9hk8TF6h4TJ+mMnpftPt9DfgsBQ6HfNxCeX1EzPUU0/ph0q+XH1RQ8qYVEQ5HCJA0mMCf9STxbJesjbB2QxE3LqKOffny3UTtVJdnBHa0cTGtCs5VeBMbRmm2/i8q6bUD2BE1CUK0SlU69AV+MJ6bfNppC713vMgGbmHmTFRG79zNT3TSNX0MnxBLK2YpP5s+IqY/7buxaIOE0IWRRHU11VoEbyWNYx9XtHK0TR7CYHhivhKtsMqOaB5DEmy0HI2dS2HM3KHjFp40gohMkUdH5WdL/q6PyEX3VhLjuYTDC93b9msE0qe8KlggNV5sKrGAmEpsjcB7wuhTVJj5&cap.sym=nq9z3fi7CoSgkW2BJuOkw28sFZo=&eos.app=fuse::jhub&eos.bookingsize=0&eos.clientinfo=zbase64:MDAwMDAwNjh4nBXIbQqAIAyA4at0gcSkDxJ2mNKNFqEyF9Hty3/P++aCyYuCM8vYcfb7A5yoqYYDI9hGfQvCfhOhYGyjCGeISNt9aet/s2BQoKq9dbMJHGsypwTD6ge7Th923yQb&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.id=a902f2e6&mgm.logid=dd77d7ee-fea4-11ee-bd0f-48df374dec7c&mgm.mtime=0&mgm.replicahead=0&mgm.replicaindex=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix&xrdcl.requuid=cc6f2d89-685d-4fa6-b10f-71c0bd04df62&xrdcl.secgid=35637&xrdcl.secuid=35637 ] }
240420 01:50:14 t=1713570614.981235 f=HandleResponse   l=CRIT  tid=00007fdaa7ff3700 s=xrdclproxy:741           1818c210 current status = 4 - setting CLOSEFAILED - msg=[ERROR] Operation expired url=root://AAAAAIEM@eos-jeodpp:1094//fusex-open?eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix&xrdcl.secgid=35637&xrdcl.secuid=35637
240420 01:50:15 t=1713570615.003882 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a902f2e6 err-code:206
240420 01:51:16 t=1713570676.152088 f=HandleResponse   l=CRIT  tid=00007fdaaa7f8700 s=xrdclproxy:741           8e4da010 current status = 4 - setting CLOSEFAILED - msg=[ERROR] Operation expired url=root://AAAAAIEM@eos-jeodpp:1094//fusex-open?eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix&xrdcl.secgid=35637&xrdcl.secuid=35637
240420 01:51:16 t=1713570676.153217 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a902f2e6 err-code:206
...
240426 11:29:26 t=1714123766.885375 f=HandleResponse   l=CRIT  tid=00007fdaa7ff3700 s=xrdclproxy:741           4e4f2410 current status = 4 - setting CLOSEFAILED - msg=[ERROR] Operation expired url=root://AAAAAIEM@eos-jeodpp:1094//fusex-open?eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix&xrdcl.secgid=35637&xrdcl.secuid=35637
240426 11:29:26 t=1714123766.910598 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a902f2e6 err-code:206

On the FST side, the logs are these ones :

240426 11:47:00 time=1714124820.001776 func=Close                    level=ERROR logid=3d5dd8dc-03a7-11ef-ad48-48df374dec7c unit=fst@fst-026.cidsn.jrc.it:1095 tid=00007f6127ffa700 source=ReplicaParLayout:473           tident=AAAAAIEM.1248:163@gpu-004 sec=unix  uid=0 gid=0 name=root geo="" msg="failed to close replica 1" url="root://fst-024.cidsn.jrc.it:1095///fusex-open?&cap.msg=qPFuqcArwW0OMB5N8IJ7+ycmP1zAR9lwdftRAgsvTeCRHDyDXDWjj3kQywjSj5sxY2Sbrm82jcG5TRkbFZkf+oTKElogmfEd/ogPvP6t9Beg9Oc9WSsFGL8EdH0heI+2RJ5AqUM/CyxAeDdo7lvoCgfHCX9ksbUhkz642xqsqNqalN6qG8IMwZuoUshHV1AaqaEKCQcl6H9+ifRHWuWjDJR5p7EBuWy2Q/BTHaZjj0Us2letZBgg0jFPbCwYTjVXs0FA5ueFsJVFGqKPLObyRlgzzhL43902riPhfI/UvPaUJ4L9hk8TF6h4TJ+mMnpftPt9DfgsBQ6HfNxCeX1EzPUU0/ph0q+XH1RQ8qYVEQ5HCJA0mMCf9STxbJesjbB2QxE3LqKOffny3UTtVJdnBHa0cTGtCs5VeBMbRmm2/i8q6bUD2BE1CUK0SlU69AV+MJ6bfNppC713vMgGbmHmTFRG79zNT3TSNX0MnxBLK2YpP5s+IqY/7buxaIOE0IWRRHU11VoEbyWNYx9XtHK0TR7CYHhivhKtsMqOaB5DEmy0HI2dS2HM3KHjFp40gohMkUdH5WdL/q6PyEX3VhLjuYTDC93b9msE0qe8KlggNV4778cbPkdIcbqHah36Q6qE&cap.sym=nq9z3fi7CoSgkW2BJuOkw28sFZo=&eos.app=fuse::jhub&eos.bookingsize=0&eos.clientinfo=zbase64:MDAwMDAwNjh4nBXIYQqAIAxA4at0gcSEBIUdpnSjRajMRXT78t/3Xm1Yoig448PENe4PcKGhng7MYAf1bQj7TYSCeYwmXCEjbfelo//NgkmBus7WeZM492JOSYY1LjasH3hSJB8=&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.id=a902f2e6&mgm.logid=3d5dd8dc-03a7-11ef-ad48-48df374dec7c&mgm.mtime=0&mgm.replicahead=0&mgm.replicaindex=1&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix&mgm.path=/fusex-open"
240426 11:47:00 time=1714124820.002064 func=_close                   level=INFO  logid=static.............................. unit=fst@fst-026.cidsn.jrc.it:1095 tid=00007f6127ffa700 source=XrdFstOfsFile:2025             tident= sec=(null) uid=99 gid=99 name=- geo="" msg="log=3d5dd8dc-03a7-11ef-ad48-48df374dec7c&path=/eos/jeodpp/data/projects/path/to/2008_v10_ERA5LP.nc&fstpath=/data57/000453a2/a902f2e6&ruid=35637&rgid=50007&td=AAAAAIEM.1248:163@gpu-004&host=fst-026.cidsn.jrc.it&lid=1048850&fid=2835542758&fsid=3483&ots=1714123705&otms=696&cts=1714124820&ctms=1&nrc=81426&nwc=0&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=3476088186470400&sbwdb=0&sxlfwdb=3476088186470400&sxlbwdb=0&nfwds=81425&nbwds=0&nxlfwds=81425&nxlbwds=0&usage=33.29&iot=1114305.750&idt=743334.625&lrt=0.000&lrvt=0.000&lwt=0.000&ot=3.346&ct=370967.807&rt=243054.84&rvt=0.00&wt=0.00&osize=85380611281&csize=85380611281&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=krb5&sec.name=userproc&sec.host=gpu-004.cidsn.jrc.it&sec.vorg=&sec.grps=&sec.role=&sec.info=&sec.app=fuse::jhub"
240426 11:48:00 time=1714124880.001942 func=MgmSyncer                level=INFO  logid=static.............................. unit=fst@fst-026.cidsn.jrc.it:1095 tid=00007f6116bff700 source=MgmSyncer:94                   tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=a902f2e6 mtime=1713571196

And on the MGM :

40426 11:49:27 time=1714124967.237765 func=open                     level=INFO  logid=453baa10-03b2-11ef-b891-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007ef8993fb700 source=XrdMgmOfsFile:549              tident=AAAAAIEM.1248:747@gpu-004 sec=krb5  uid=35637 gid=50007 name=userproc geo="JRC::DC1" op=write trunc=0 path=ino:80000000a902f2e6 info=eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix
240426 11:49:27 time=1714124967.238007 func=open                     level=INFO  logid=453baa10-03b2-11ef-b891-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007ef8993fb700 source=XrdMgmOfsFile:590              tident=AAAAAIEM.1248:747@gpu-004 sec=krb5  uid=35637 gid=50007 name=userproc geo="JRC::DC1" msg="access by inode" ino=ino:80000000a902f2e6 path=/eos/jeodpp/data/projects/path/to/2008_v10_ERA5LP.nc
240426 11:49:27 time=1714124967.239539 func=open                     level=INFO  logid=453baa10-03b2-11ef-b891-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007ef8993fb700 source=XrdMgmOfsFile:3201             tident=AAAAAIEM.1248:747@gpu-004 sec=krb5  uid=35637 gid=50007 name=userproc geo="JRC::DC1" op=write path=/eos/jeodpp/data/projects/path/to/2008_v10_ERA5LP.nc info=eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix target[0]=(fst-026.cidsn.jrc.it,3483) target[1]=(fst-024.cidsn.jrc.it,3363)  redirection=fst-026.cidsn.jrc.it?&cap.sym=<...>&cap.msg=<...>&mgm.logid=453baa10-03b2-11ef-b891-48df374dec7c&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=a902f2e6&mgm.mtime=0 xrd_port=1095 http_port=8001
240426 11:50:27 time=1714125027.341407 func=open                     level=INFO  logid=690ec134-03b2-11ef-8acf-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f4b4dffa700 source=XrdMgmOfsFile:549              tident=AAAAAIEM.1248:747@gpu-004 sec=krb5  uid=35637 gid=50007 name=userproc geo="JRC::DC1" op=write trunc=0 path=ino:80000000a902f2e6 info=eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix
240426 11:50:27 time=1714125027.341616 func=open                     level=INFO  logid=690ec134-03b2-11ef-8acf-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f4b4dffa700 source=XrdMgmOfsFile:590              tident=AAAAAIEM.1248:747@gpu-004 sec=krb5  uid=35637 gid=50007 name=userproc geo="JRC::DC1" msg="access by inode" ino=ino:80000000a902f2e6 path=/eos/jeodpp/data/projects/path/to/2008_v10_ERA5LP.nc
240426 11:50:27 time=1714125027.343065 func=open                     level=INFO  logid=690ec134-03b2-11ef-8acf-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f4b4dffa700 source=XrdMgmOfsFile:3201             tident=AAAAAIEM.1248:747@gpu-004 sec=krb5  uid=35637 gid=50007 name=userproc geo="JRC::DC1" op=write path=/eos/jeodpp/data/projects/path/to/2008_v10_ERA5LP.nc info=eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a902f2e6&fuse.exe=python3.10&fuse.gid=35637&fuse.pid=20235&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix target[0]=(fst-026.cidsn.jrc.it,3483) target[1]=(fst-024.cidsn.jrc.it,3363)  redirection=fst-026.cidsn.jrc.it?&cap.sym=<...>&cap.msg=<...>&mgm.logid=690ec134-03b2-11ef-8acf-48df374dec7c&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=a902f2e6&mgm.mtime=0 xrd_port=1095 http_port=8001

On some files, the process doesn’t go forever, they end up in failure, like the ones mentioned in the first comment here

240416 10:25:51 t=1713255951.501290 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a8d1fd7a err-code:206
240416 10:26:57 t=1713256017.621539 f=HandleResponse   l=CRIT  tid=00007fdaabffb700 s=xrdclproxy:741           3853eb10 current status = 4 - setting CLOSEFAILED - msg=[ERROR] Operation expired url=root://AAAAAIBA@eos-jeodpp:1094//fusex-open?eos.app=fuse::jhub&eos.bookingsize=0&eos.lfn=ino:80000000a8d1fd7a&fuse.exe=/usr/bin/mv&fuse.gid=35637&fuse.pid=28747&fuse.uid=35637&fuse.ver=5.1.21&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_35637&xrd.wantprot=krb5,unix&xrdcl.secgid=35637&xrdcl.secuid=35637
240416 10:26:57 t=1713256017.644417 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a8d1fd7a err-code:206
240416 10:26:57 t=1713256017.772780 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3334                OpenAsync failed - trying recovery - ino:80000000a8d1fd7a err-code:400
240416 10:26:57 t=1713256017.772798 f=ioflush          l=WARN  tid=00007fdab83f3700 s=data:3360                giving up OpenAsync request - ino:80000000a8d1fd7a err-code:400
240416 10:26:57 t=1713256017.772988 f=ioflush          l=CRIT  tid=00007fdab83f3700 s=data:3389                ino:80000000a8d1fd7a msg=file open failed file-recovery=<none> journal-recovery=/var/cache/eos/fusex/cache/jeodpp/00A/80000000A8D1FD7A.jc.recover
#        -[ 000 ] 240416 10:10:13 t=1713255013.517340 f=flush_nolock     l= ino:80000000a8d1fd7a s=data:371                 status='[ERROR] Error response: no such device' hint='will TryRecovery'
#        -[ 001 ] 240416 10:10:13 t=1713255013.517656 f=recover_write    l= ino:80000000a8d1fd7a s=data:1671                hint='recover write'
#        -[ 002 ] 240416 10:10:13 t=1713255013.517662 f=recover_write    l= ino:80000000a8d1fd7a s=data:1728                hint='recover from remote file'
#        -[ 003 ] 240416 10:10:13 t=1713255013.517710 f=try_ropen        l= ino:80000000a8d1fd7a s=data:1325                hint='try read-open'
#        -[ 004 ] 240416 10:13:39 t=1713255219.456446 f=recover_write    l= ino:80000000a8d1fd7a s=data:1859                hint='failed to write to local stage file'
#        -[ 005 ] 240416 10:13:39 t=1713255219.456469 f=flush_nolock     l= ino:80000000a8d1fd7a s=data:404                 errno='121' hint='failed TryRecovery'
#        -[ 006 ] 240416 10:13:44 t=1713255224.223889 f=recover_write    l= ino:80000000a8d1fd7a s=data:1671                hint='recover write'
#        -[ 007 ] 240416 10:13:44 t=1713255224.223895 f=recover_write    l= ino:80000000a8d1fd7a s=data:1728                hint='recover from remote file'
#        -[ 008 ] 240416 10:13:44 t=1713255224.223955 f=try_ropen        l= ino:80000000a8d1fd7a s=data:1325                hint='try read-open'
#        -[ 009 ] 240416 10:16:50 t=1713255410.631339 f=recover_write    l= ino:80000000a8d1fd7a s=data:1859                hint='failed to write to local stage file'

The situation is kind of worrying because many such files are written from one client, and the only way to stop it would be to restart the client.

This client is running 5.1.22, but the same is observed on another one running 5.2.22 but creates less big files.

apeters · April 26, 2024, 11:30am

Hi Franck,
we have seen the same problem https://its.cern.ch/jira/browse/EOS-6126?filter=-2

And I think David has understood the origin. This happens with very large files where the checksumming takes a long time during the close then the subsequent open calls timeout because they need for the close to finish.

Will keep you up to date.

Cheers Andreas.

franck-jrc · April 26, 2024, 2:02pm

OK, thank you, so we might have some fix in the up coming versions, right ?

In the meantime, is there a way to mitigate that ? Like decrease the rate of the reissue requests, stop the client, something else ? Or will eventually all these requests stop for already written files ? Because they really seem to cause the FST to read the huge file over and over.

franck-jrc · May 23, 2024, 10:48am

Dear Andreas,

Concerning the initial issue of this thread, it seems that since the upgrade to version 5.2.22, the rate of the errors has decreased, however it didn’t disappear completely. We still have some occurrence of such issues (outside of a unwanted FST restart like mentioned above in comment #2, which occur at the same time for many files :

240515 15:52:15 t=1715781135.584383 f=DumpStatistic    l=WARN  tid=00007fc8d7bff700 s=eosfuse:2704             IO blocked on ino=0x80000000aa2a1582 for op=flush since 1737.00 ms { /AUX_ECMWFT [root://AAAAACyw@eos-jeodpp:1094//fusex-open?eos.app=fuse::download&eos.bookingsize=0&eos.lfn=ino:80000000aa2a1582&fuse.exe=/usr/bin/cp&fuse.gid=40501&fuse.pid=4646&fuse.uid=47001&fuse.ver=5.2.22&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/tmp/krb5cc_47001&xrd.wantprot=krb5,unix&xrdcl.secgid=40501&xrdcl.secuid=47001] }
240515 15:53:13 t=1715781193.948264 f=TryRecovery      l=CRIT  ino:80000000aa2a1582 s=data:1110                triggering write recovery state = 5
240515 15:53:13 t=1715781193.951972 f=recover_write    l=CRIT  ino:80000000aa2a1582 s=data:1787                failed to openlocal stagefile
240515 15:53:13 t=1715781193.952027 f=flush_nolock     l=ERROR ino:80000000aa2a1582 s=data:457                 remote open failed - returning 121
240515 15:53:13 t=1715781193.952105 f=flush            
240515 15:53:13 t=1715781193.974028 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 15:54:35 t=1715781275.061982 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 15:55:35 t=1715781335.125250 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 15:56:56 t=1715781416.205754 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 15:58:19 t=1715781499.292661 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 15:59:48 t=1715781588.366675 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 16:01:19 t=1715781679.439410 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 16:02:45 t=1715781765.489720 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 16:03:47 t=1715781827.536267 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
240515 16:05:05 t=1715781905.583998 f=ioflush          l=WARN  tid=00007fc8db3f2700 s=data:3361                re-issuing OpenAsync request after timeout - ino:80000000aa2a1582 err-code:206
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/AUX_ECMWFT'
#        fid  := 2854884738
#        -[ 000 ] 240515 15:53:13 t=1715781193.948165 f=flush_nolock     l= ino:80000000aa2a1582 s=data:451                 status='XrdCl::Proxy::FAILED' hint='will TryRecovery'
#        -[ 001 ] 240515 15:53:13 t=1715781193.948281 f=recover_write    l= ino:80000000aa2a1582 s=data:1678                hint='recover write'
#        -[ 002 ] 240515 15:53:13 t=1715781193.948319 f=recover_write    l= ino:80000000aa2a1582 s=data:1715                hint='recover from file cache'
#        -[ 003 ] 240515 15:53:13 t=1715781193.951990 f=recover_write    l= ino:80000000aa2a1582 s=data:1793                hint='failed to open local stagefile'
#        -[ 004 ] 240515 15:53:13 t=1715781193.952017 f=flush_nolock     l= ino:80000000aa2a1582 s=data:456                 errno='121' hint='failed TryRecovery'

I also want to report some similar issues encountered while modifying some files (i.e. the files were already existing, and were rewritten), some of them (rate of 6 out of 10000) triggered this error :

240517 20:42:13 t=1715971333.299502 f=TryRecovery      l=CRIT  ino:80000000a9fe9d04 s=data:1122                no proxy object
240517 20:42:13 t=1715971333.299533 f=peek_pread       l=ERROR ino:80000000a9fe9d04 s=data:2611                sync remote-io failed msg="[ERROR] Operation expired"
240517 20:42:13 t=1715971333.301131 f=TryRecovery      l=CRIT  ino:80000000a9fe9d04 s=data:1122                no proxy object
240517 20:42:13 t=1715971333.301161 f=peek_pread       l=ERROR ino:80000000a9fe9d04 s=data:2690                sync remote-io recovery failed errno=14
240517 20:42:13 t=1715971333.430031 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
240517 20:42:13 t=1715971333.935062 f=DumpStatistic    l=WARN  tid=00007f827dfff700 s=eosfuse:2397             IO unblock on ino=0x80000000a9fe9d04 for op=read since 59199.00 ms {  [ ] }
240517 20:44:12 t=1715971452.395700 f=HandleResponse   l=CRIT  tid=00007f82763fb700 s=xrdclproxy:741           aa0b6610 current status = 4 - setting CLOSEFAILED - msg=[ERROR] Operation expired url=root://AAAAABKE@eos-jeodpp:1094//fusex-open?eos.app=fuse::condor&eos.bookingsize=0&eos.lfn=ino:80000000a9fe9d04&fuse.exe=python3.10&fuse.gid=41068&fuse.pid=77499&fuse.uid=35835&fuse.ver=5.1.30&mgm.fusex=1&mgm.mtime=0&xrd.k5ccname=/var/run/eos/credentials/uid35835.krb5&xrd.wantprot=krb5,unix&xrdcl.secgid=41068&xrdcl.secuid=35835
240517 20:44:12 t=1715971452.404224 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
240517 20:45:12 t=1715971512.463311 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
240517 20:47:10 t=1715971630.531741 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
240517 20:49:15 t=1715971755.707827 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
240517 20:50:15 t=1715971815.682487 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
240517 20:51:29 t=1715971889.640668 f=ioflush          l=WARN  tid=00007f8283ff3700 s=data:3323                re-issuing OpenAsync request after timeout - ino:80000000a9fe9d04 err-code:206
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/eos/jeodpp/data/.../IMG_PL00_DOV_MS_L3A_20180607T090914_20180607T090914_TOU_1234_d821_R1C1.TIF'
#        fid  := 2852035844
#        -[ 000 ] 240517 20:42:13 t=1715971333.299482 f=peek_pread       l= ino:80000000a9fe9d04 s=data:2602                status='[ERROR] Operation expired' hint='will TryRecovery'
#        -[ 001 ] 240517 20:42:13 t=1715971333.299523 f=peek_pread       l= ino:80000000a9fe9d04 s=data:2609                status='[ERROR] Operation expired' errno='14' hint='failed TryRecovery'
#        -[ 002 ] 240517 20:42:13 t=1715971333.301116 f=peek_pread       l= ino:80000000a9fe9d04 s=data:2666                status='[ERROR] Operation expired' hint='will TryRecovery'
#        -[ 003 ] 240517 20:42:13 t=1715971333.301154 f=peek_pread       l= ino:80000000a9fe9d04 s=data:2673                status='[ERROR] Operation expired' errno='14' hint='failed TryRecovery'

In that case, the eosx client did return an error to the calling application, but the recovery log is different (the failed to openlocal stagefile message is absent). In these cases, which happened on various processing nodes, the mount is not a user mount, on the contrary of the first report, but a host mount. The version is also older, 5.1.30. I have suggested to upgrade the clients on these processing nodes.

Again, any hint to mitigate such errors, or how to investigate is welcome.

CERN Accelerating science

FUSEX: Cases of incompletely written files after successful copy