Fusex read errors remain in cache

franck-jrc · March 9, 2020, 10:17am

Good morning,

We have since some time regular, but not very frequent, unexplained issues with some of our data visualization tools randomly reporting corrupted files when reading them with eosd or eosxd clients.

We had a recent case which allowed us to analyze more deeply, and it appeared that on this node and this mountpoint (eosx 4.5.15), some files were reporting a corrupted content (checkum different from the original file on eos). We understood that this content was basically from the os cache : having forced to empty the os cache, without restarting the client, the files became correct.

We found out that these files had a corresponding messages in the fusex log, looking like :

200130 15:15:22 t=1580393722.676843 f=recover_ropen    l=WARN  ino:800000004fe32fce s=data:984                 recover read-open [1]
200130 15:15:22 t=1580393722.676884 f=recover_ropen    l=WARN  ino:800000004fe32fce s=data:1008                recover reopening file for read
200130 15:15:22 t=1580393722.676992 f=recover_ropen    l=WARN  ino:800000004fe32fce s=data:1023                applying exclusion list: tried=s-jrciprjeos237p.cidsn.jrc.it,
200130 15:15:22 t=1580393722.680771 f=recover_ropen    l=WARN  ino:800000004fe32fce s=data:1061                recover reopened file successfully
200130 15:15:22 t=1580393722.680834 f=recover_read     l=WARN  ino:800000004fe32fce s=data:1418                recover reopened file successfully to re-read
200130 15:41:45 t=1580395305.007258 f=worker           l=WARN  tid=00007f6d2c7fb700 s=EnvironmentReader:144    Reading /proc/22064/environ took 6ms (uid=0)
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/eos/jeodpp/path/to/file/RO_S2_201910.tif'
#        fid  := 1340288974
#        -[ 000 ] 200130 15:15:22 t=1580393722.676812 f=peek_pread       l= ino:800000004fe32fce s=data:2299                status='[ERROR] Invalid session' hint='will TryRecovery'
#        -[ 001 ] 200130 15:15:22 t=1580393722.676832 f=recover_read     l= ino:800000004fe32fce s=data:1404                hint='recover read'
#        -[ 002 ] 200130 15:15:22 t=1580393722.676839 f=recover_ropen    l= ino:800000004fe32fce s=data:982                 hint='recover read-open'
#        -[ 003 ] 200130 15:15:22 t=1580393722.680848 f=peek_pread       l= ino:800000004fe32fce s=data:2311                status='[ERROR] Invalid session' hint='success TryRecovery'

This node had 50 files concerned by such recovery records since the eosxd daemon start (3 months before), 4 of them were “corrupted” in memory, the other ones were sane.

What can these recovery events be sign of ? It seems that in some cases, OS cache is polluted with incorrect data, can this be explained, or could it be a bug ? Does someone observe also such situation ? Are there some known way to avoid it ? Is there a way to force the eosx client to invalidate all, or part of system cache ?

CERN Accelerating science

Fusex read errors remain in cache