FSCK : false orphan files reported


We still have some puzzling fsck reports on our main production instance, as mentioned in this other thread. But this specific case of orphans category might need an extra topic.

The report we observe pointed out after an version upgrade of the FST from 4.2.12 to 4.2.18. When the FST booted, it did a full resync (in fact it seems that shutting it down with the systemctl command ended in a force shutdown). And after this reboot, orphans file started to show up.

Now, we have a lot (millions, probably all files on the FST) of false reports of orphans files from one FST, and this number is varying over time : at each fsck check every 30 minutes, the number of reported files varies between 1M and 14M (total files on the FST are 19M).

Extract the list of files with fsck report takes a lot of time (50sec) when only 1M are present, but we managed to do it, and the tested files from the list are healthy, so reports is probably not correct.
Where could that come from ?

How could we deal with that to remove the files from the report ? We are not sure that a fsck repair --resync would not stuck the MGM with that large number of files. And as we have also other issues with files in other fsck categories, we prefer avoiding it. Is there a possibility to send a subset of resync requests on some files to see if could settle it ?

@IHEP, eos fsck status also reported the number of orphans files from 4M to 9M in one eos instances. The total file number of the instance is 16M. The version of eos server and client are both 4.2.16.