Following all the tests which triggered this issue: https://eos-community.web.cern.ch/t/fsck-settings-communication-between-managers-and-fsts , I am now looking at our EOS storage for Alice (as a Tier2).
We have 2.65PB of storage, 2.17 used (81%), 84 filesystems.
I will examine all filesystems but for now I am looking at the first one.
An important note is that we are not doing RAIN: a filesystem corresponds to partition in a (big) RAID6 volume and we do not do replication. I suspect this will limit our ability to repair errors.
[root@naneosmgr02(EOSMASTER) ~]#eos fsck report -a | grep 'fsid=27' timestamp=1621995118 fsid=27 tag="m_mem_sz_diff" count=1 timestamp=1621995118 fsid=27 tag="orphans_n" count=1 timestamp=1621995118 fsid=27 tag="rep_diff_n" count=315 timestamp=1621995118 fsid=27 tag="rep_missing_n" count=5702 timestamp=1621995118 fsid=27 tag="unreg_n" count=677
The total number of entries in the local FST DB (LocalDB) for this filesystem is : 797951
I looked at each case and I found:
File looks OK on disk but have a 0 size and no checksum in the NS.
How to force the MGM to resynchronize ? Should’nt it be done automatically ?
The file is in the .eosorphans directory. I suppose that it is enough to simply delete it ? Will the entry disappear from le LocalDB ?
I looked at one file and found that it is actually in another filesystem. It does not exist on disk on the filesystem we are examining but still has an entry in the LocalDB… What to do ?
For the file that I looked, there are 2 replicas instead of one. The files on disk have both the correct checksum. How should this be dealt with ?
This is roughly 0.7% of the files. I looked at the first file and found that it is in another filesystem on the same FST. The problem is that it is still in the LocalDB of the one we are examining (but not on disk).
This is not necessarily the same for all 5702 files and I cannot look at all of them one by one. So what can be done here ?