Dear @apeters,
Following our discussion during the recent EOS Workshop about files that are seen as correctly written by the user, but happen to be corrupted (which usually means incomplete, i.e. either 0 size or smaller than the expected length), I want to share some logs with you to understand how we can avoid this.
The fusex client writes something like this :
240321 07:46:46 t=1711003606.852129 f=HandleResponse l=CRIT tid=00007f08cc7f4700 s=xrdclproxy:914 write error '[ERROR] Operation expired'
---- high rate error messages suppressed ----
240321 07:46:50 t=1711003610.854069 f=HandleResponseWithHosts l=ERROR tid=00007f08cb7f2700 s=xrdclproxy:559 state=failed async open returned errmsg=[ERROR] Operation expired
240321 07:46:50 t=1711003610.854302 f=TryRecovery l=CRIT ino:80000000a7ea552c s=data:1103 triggering write recovery state = 5
240321 07:46:50 t=1711003610.857345 f=recover_write l=CRIT ino:80000000a7ea552c s=data:1769 failed to open local stagefile
240321 07:46:50 t=1711003610.857399 f=flush_nolock l=ERROR ino:80000000a7ea552c s=data:450 remote open failed - returning 121
240321 07:46:50 t=1711003610.857464 f=flush l=WARN tid=00007f08977ff700 s=eosfuse:5373 invalidating ino=0x80000000a7ea552c after flush error
240321 07:46:50 t=1711003610.857527 f=TryRecovery l=CRIT ino:80000000a7ea5520 s=data:1103 triggering write recovery state = 3
240321 07:46:50 t=1711003610.859791 f=recover_write l=CRIT ino:80000000a7ea5520 s=data:1769 failed to open local stagefile
240321 07:46:50 t=1711003610.860427 f=ioflush l=WARN tid=00007f08de7f3700 s=data:3323 re-issuing OpenAsync request after timeout - ino:80000000a7ea552c err-code:206
# -------------------
# - recovery record -
# -------------------
# path := '/MSK_DETFOO_B01.jp2'
# fid := 2817152300
# -[ 000 ] 240321 07:46:50 t=1711003610.854256 f=flush_nolock l= ino:80000000a7ea552c s=data:444 status='XrdCl::Proxy::FAILED' hint='will TryRecovery'
# -[ 001 ] 240321 07:46:50 t=1711003610.854329 f=recover_write l= ino:80000000a7ea552c s=data:1671 hint='recover write'
# -[ 002 ] 240321 07:46:50 t=1711003610.854343 f=recover_write l= ino:80000000a7ea552c s=data:1708 hint='recover from file cache'
# -[ 003 ] 240321 07:46:50 t=1711003610.857367 f=recover_write l= ino:80000000a7ea552c s=data:1774 hint='failed to open local stagefile'
# -[ 004 ] 240321 07:46:50 t=1711003610.857393 f=flush_nolock l= ino:80000000a7ea552c s=data:449 errno='121' hint='failed TryRecovery'
240321 07:47:02 t=1711003622.193859 f=ioflush l=WARN tid=00007f08de7f3700 s=data:3334 OpenAsync failed - trying recovery - ino:80000000a7ea5520 err-code:0
240321 07:47:02 t=1711003622.193908 f=ioflush l=WARN tid=00007f08de7f3700 s=data:3360 giving up OpenAsync request - ino:80000000a7ea5520 err-code:0
240321 07:47:02 t=1711003622.194129 f=ioflush l=CRIT tid=00007f08de7f3700 s=data:3389 ino:80000000a7ea5520 msg=file writing failed file-recovery= journal-recovery=/scratch_hdd/fusex_cache/journal/srs_downloader_Sentinel2_46/00A/80000000A7EA5520.jc.recover
# -------------------
# - recovery record -
# -------------------
# path := '/T36NWM_20240319T075631_B07.jp2'
# fid := 2817152288
# -[ 000 ] 240321 07:46:50 t=1711003610.857548 f=recover_write l= ino:80000000a7ea5520 s=data:1671 hint='recover write'
# -[ 001 ] 240321 07:46:50 t=1711003610.857559 f=recover_write l= ino:80000000a7ea5520 s=data:1708 hint='recover from file cache'
# -[ 002 ] 240321 07:46:50 t=1711003610.859815 f=recover_write l= ino:80000000a7ea5520 s=data:1774 hint='failed to open local stagefile'
240321 07:49:26 t=1711003766.917928 f=HandleResponse l=CRIT tid=00007f08cc7f4700 s=xrdclproxy:914 write error '[ERROR] Operation expired'
---- high rate error messages suppressed ----
240321 07:49:27 t=1711003767.106613 f=TryRecovery l=CRIT ino:80000000a7ea5612 s=data:1103 triggering write recovery state = 3
240321 07:49:27 t=1711003767.108746 f=recover_write l=CRIT ino:80000000a7ea5612 s=data:1769 failed to open local stagefile
240321 07:49:27 t=1711003767.249698 f=ioflush l=WARN tid=00007f08de7f3700 s=data:3334 OpenAsync failed - trying recovery - ino:80000000a7ea5612 err-code:0
240321 07:49:27 t=1711003767.249724 f=ioflush l=WARN tid=00007f08de7f3700 s=data:3360 giving up OpenAsync request - ino:80000000a7ea5612 err-code:0
240321 07:49:27 t=1711003767.249866 f=ioflush l=CRIT tid=00007f08de7f3700 s=data:3389 ino:80000000a7ea5612 msg=file writing failed file-recovery= journal-recovery=/scratch_hdd/fusex_cache/journal/srs_downloader_Sentinel2_46/00A/80000000A7EA5612.jc.recover
# -------------------
# - recovery record -
# -------------------
# path := ''
# fid := 2817152530
# -[ 000 ] 240321 07:49:27 t=1711003767.106633 f=recover_write l= ino:80000000a7ea5612 s=data:1671 hint='recover write'
# -[ 001 ] 240321 07:49:27 t=1711003767.106651 f=recover_write l= ino:80000000a7ea5612 s=data:1708 hint='recover from file cache'
# -[ 002 ] 240321 07:49:27 t=1711003767.108798 f=recover_write l= ino:80000000a7ea5612 s=data:1774 hint='failed to open local stagefile'
From the above log, one files results as incorrect (a7ea552c is 0 size, were the correct size after re-download is 15783)
# eos file info fxid:a7ea552c
File: '/eos/jeodpp/proc/recycle/uid:47000/2024/03/21/0/....#:#MSK_DETFOO_B01.jp2.00000000a7ea552c' Flags: 0644 Clock: 17bebae36ade04d5
Size: 0
Status: fuse::missingcommmits
Modify: Thu Mar 21 07:45:50 2024 Timestamp: 1711003550.107999647
Change: Thu Mar 21 09:39:54 2024 Timestamp: 1711010394.360592784
Access: Thu Mar 21 07:45:50 2024 Timestamp: 1711003550.107999647
Birth: Thu Mar 21 07:45:50 2024 Timestamp: 1711003550.107999647
# path="/eos/jeodpp/proc/recycle/uid:47000/2024/03/21/0/...#:#MSK_DETFOO_B01.jp2.00000000a7ea552c" fxid="a7ea552c" size="0" nrep="2" checksumtype="adler" checksum="0000000100000000000000000000000000000000000000000000000000000000"
nrep="00" fsid="2963" host="fst-018.cidsn.jrc.it:1095" fstpath="/data41/00044c73/a7ea552c" size="0" statsize="0" checksum="0000000100000000000000000000000000000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000000000000000000000000000" error_label="none"
nrep="01" fsid="1043" host="s-jrciprjeos234p.cidsn.jrc.it:1095" fstpath="/data41/00044c73/a7ea552c" size="0" statsize="0" checksum="0000000100000000000000000000000000000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000000000000000000000000000" error_label="none"
# eos attr ls fxid:a7ea552c
sys.eos.btime="1711003550.107999647"
sys.fusex.state="Cs|c|"
This happens on one specific host which continuously downloads many products. They are shipped in zip files which are copied locally, extracted locally, then copied on eos using the rsync command, whose error code is checked before removing the local data. However from time to time it seems that something happens in the client, or with the FST, and the file is not correctly written. The process uses a user eosxd mount. The failure rate is quite low, around 3 files out of 10000 which, after sanity check, have to be redownloaded. But it would be good to be sure that a file is actually correctly written when no error is given on the OS, and with no need to check them afterwards.
The version in use is 5.1.30. It seems that with previous version 5.1.21 the same was happening, but the error message was a bit different : 'unable to read file for recovery from local cache file
instead of failed to open local stagefile
.
We are planning to upgrade to a newer 5.2.x version on this host, we’ll see if this still happens.