Dear Elvin,
Thank you for having had a look and for your answer. OK for the fact that the Read-Only filesystem error is normal, but indeed strange that the file is not repaired by fsck. Both collect and repair are enabled on our instance (though they are currently busy in trying repairing our leftover files which are not possible to be fixed and we still have to process and decide what to do with them !)
It appears that the same thing just happened these day on a file. The good news is that this use case is easy, because only one host accesses the file, on a daily basis on a scheduled cron job. What the cron does (it is some user’s code, in python, using standard libraries I’d say) is simply downloading some simple CSV data and process them. The CSV files are open in update, i.e. the file id never changes, but rewritten fully with the new downloaded data. And this is at that download step that our case occurs.
I have gathered all possible logs for this case, and I can send them to you if you need them full.
As as summary, the update of this file of 2 days ago caused a synchronisation error on one FST replication for fxid=9c4f645b resulted in a different file size on fsid=1000 - 1417514 vs 16384 - rejecting replica
. The fuse client also has log corresponding to this file at that time :
240319 10:22:48 t=1710840168.104114 f=HandleResponse l=CRIT tid=00007ff455ff6700 s=xrdclproxy:914 write error '[ERROR] Error response: no such device'
240319 10:22:48 t=1710840168.214246 f=TryRecovery l=CRIT ino:800000009c4f645b s=data:1103 triggering write recovery state = 3
240319 10:22:48 t=1710840168.220821 f=try_ropen l=WARN ino:800000009c4f645b s=data:1339 recover read-open succesfull
240319 10:22:48 t=1710840168.220843 f=recover_write l=CRIT ino:800000009c4f645b s=data:2011 no local cache data for recovery
240319 10:22:48 t=1710840168.221020 f=ioflush l=WARN tid=00007ff464bf3700 s=data:3323 re-issuing OpenAsync request after timeout - ino:800000009c4f6477 err-code:206
240319 10:22:48 t=1710840168.349872 f=ioflush l=WARN tid=00007ff464bf3700 s=data:3334 OpenAsync failed - trying recovery - ino:800000009c4f645b err-code:0
240319 10:22:48 t=1710840168.349890 f=ioflush l=WARN tid=00007ff464bf3700 s=data:3360 giving up OpenAsync request - ino:800000009c4f645b err-code:0
240319 10:22:48 t=1710840168.350058 f=ioflush l=CRIT tid=00007ff464bf3700 s=data:3389 ino:800000009c4f645b msg=file writing failed file-recovery= journal-recovery=/var/cache/eos/fusex/cache/jeodpp/009/800000009C4F645B.jc.recover
# -------------------
# - recovery record -
# -------------------
# path := '/eos/jeodpp/.../..._timeseries.csv'
# fid := 2622448731
# -[ 000 ] 240319 10:22:48 t=1710840168.214278 f=recover_write l= ino:800000009c4f645b s=data:1671 hint='recover write'
# -[ 001 ] 240319 10:22:48 t=1710840168.214297 f=recover_write l= ino:800000009c4f645b s=data:1728 hint='recover from remote file'
# -[ 002 ] 240319 10:22:48 t=1710840168.214430 f=try_ropen l= ino:800000009c4f645b s=data:1325 hint='try read-open'
# -[ 003 ] 240319 10:22:48 t=1710840168.220857 f=recover_write l= ino:800000009c4f645b s=data:2016 hint='no local cache data for recovery'
So the file was left with only one replica. Note that the size 1417514
is the current one of the existing file with only one replica. The file itself is valid and seems complete. The metadata match between the FST and the MGM, and a simple eos file adjustreplica
command fixes it.
The next run of the process (yesterday and today) failed when treating this file with this error Read-Only file system. However, the modification timestamp was of today. Also some log is available on the client :
240321 10:22:18 t=1711012938.179949 f=HandleResponseWithHosts l=ERROR tid=00007ff457ffa700 s=xrdclproxy:559 state=failed async open returned errmsg=[ERROR] Error response: read-only file system
240321 10:22:18 t=1711012938.180113 f=flush_nolock l=ERROR ino:800000009c4f645b s=data:363 file not open
240321 10:22:18 t=1711012938.180155 f=flush_nolock l=ERROR ino:800000009c4f645b s=data:373 write error error=[ERROR] Server responded with an error: [3025] Unable to access quota space /eos/jeodpp/data/projects/ARCTIC/crottil/ArcticPS/automatic_procedure/EEA/Data_meteo/SE_5_31124_2022_timeseries.csv; Read-only file system
240321 10:22:18 t=1711012938.180179 f=TryRecovery l=ERROR ino:800000009c4f645b s=data:1090 unrecoverable error - code=400 errNo=3025
240321 10:22:18 t=1711012938.180197 f=flush_nolock l=ERROR ino:800000009c4f645b s=data:405 journal-flushing recovery failed rc=30
240321 10:22:18 t=1711012938.181996 f=pwrite l=ERROR ino:800000009c4f645b s=data:2356 unrecoverable error - code=400 errNo=3025
240321 10:22:18 t=1711012938.182053 f=write l=ERROR tid=00007ff4263ff700 s=eosfuse:5015 io-error: inode=-9223372034232327077 size=8192 off=0 buf=140687089623120 errno=30
240321 10:22:18 t=1711012938.278968 f=ioflush l=WARN tid=00007ff464bf3700 s=data:3334 OpenAsync failed - trying recovery - ino:800000009c4f645b err-code:400
240321 10:22:18 t=1711012938.278987 f=ioflush l=WARN tid=00007ff464bf3700 s=data:3360 giving up OpenAsync request - ino:800000009c4f645b err-code:400
240321 10:22:18 t=1711012938.279223 f=ioflush l=CRIT tid=00007ff464bf3700 s=data:3389 ino:800000009c4f645b msg=file open failed file-recovery= journal-recovery=/var/cache/eos/fusex/cache/jeodpp/009/800000009C4F645B.jc.recover
# -------------------
# - recovery record -
# -------------------
# path := '/eos/jeodpp/.../..._timeseries.csv'
# fid := 2622448731
# -[ 000 ] 240321 10:22:18 t=1711012938.180139 f=flush_nolock l= ino:800000009c4f645b s=data:371 status='[ERROR] Error response: read-only file system' hint='will TryRecovery'
# -[ 001 ] 240321 10:22:18 t=1711012938.180190 f=flush_nolock l= ino:800000009c4f645b s=data:404 errno='30' hint='failed TryRecovery'
But the file id doesn’t appear in no fsck category. Maybe the reason is that we have a scaninterval which is high and won’t trigger within one day ? But I understood that if a file is modified this resets the time and it should be checked at next pass (which occurs every 2 hours now)
Versions : MGM 5.1.22 FST 5.1.20 EOSX 5.1.30