Good morning,
We have since some time regular, but not very frequent, unexplained issues with some of our data visualization tools randomly reporting corrupted files when reading them with eosd or eosxd clients.
We had a recent case which allowed us to analyze more deeply, and it appeared that on this node and this mountpoint (eosx 4.5.15), some files were reporting a corrupted content (checkum different from the original file on eos). We understood that this content was basically from the os cache : having forced to empty the os cache, without restarting the client, the files became correct.
We found out that these files had a corresponding messages in the fusex log, looking like :
200130 15:15:22 t=1580393722.676843 f=recover_ropen l=WARN ino:800000004fe32fce s=data:984 recover read-open [1]
200130 15:15:22 t=1580393722.676884 f=recover_ropen l=WARN ino:800000004fe32fce s=data:1008 recover reopening file for read
200130 15:15:22 t=1580393722.676992 f=recover_ropen l=WARN ino:800000004fe32fce s=data:1023 applying exclusion list: tried=s-jrciprjeos237p.cidsn.jrc.it,
200130 15:15:22 t=1580393722.680771 f=recover_ropen l=WARN ino:800000004fe32fce s=data:1061 recover reopened file successfully
200130 15:15:22 t=1580393722.680834 f=recover_read l=WARN ino:800000004fe32fce s=data:1418 recover reopened file successfully to re-read
200130 15:41:45 t=1580395305.007258 f=worker l=WARN tid=00007f6d2c7fb700 s=EnvironmentReader:144 Reading /proc/22064/environ took 6ms (uid=0)
# -------------------
# - recovery record -
# -------------------
# path := '/eos/jeodpp/path/to/file/RO_S2_201910.tif'
# fid := 1340288974
# -[ 000 ] 200130 15:15:22 t=1580393722.676812 f=peek_pread l= ino:800000004fe32fce s=data:2299 status='[ERROR] Invalid session' hint='will TryRecovery'
# -[ 001 ] 200130 15:15:22 t=1580393722.676832 f=recover_read l= ino:800000004fe32fce s=data:1404 hint='recover read'
# -[ 002 ] 200130 15:15:22 t=1580393722.676839 f=recover_ropen l= ino:800000004fe32fce s=data:982 hint='recover read-open'
# -[ 003 ] 200130 15:15:22 t=1580393722.680848 f=peek_pread l= ino:800000004fe32fce s=data:2311 status='[ERROR] Invalid session' hint='success TryRecovery'
This node had 50 files concerned by such recovery records since the eosxd daemon start (3 months before), 4 of them were “corrupted” in memory, the other ones were sane.
What can these recovery events be sign of ? It seems that in some cases, OS cache is polluted with incorrect data, can this be explained, or could it be a bug ? Does someone observe also such situation ? Are there some known way to avoid it ? Is there a way to force the eosx client to invalidate all, or part of system cache ?