FSCK reports and how to deal with them

I just did it, still nothing in rep_mising_n. I have some entries in d_mem_sz_diff, for files that had their sizes changed to 0. And, in this situation, running eos fsck repair --replace-damaged-replicas dropped the replica with correct content and kept the one with zero size (because matching the size in MGM) so this is where the danger lies, indeed, and I’ll avoid in production.

But in the meantime I tried to upgrade the FST on this node. The outcome was that the number of used files reported by the FST decreased to what is probably the real number of files present, but the output of dumpmp on this FS still reports all the replicas that were manually deleted.

I’ll try to downgrade again this FST to see if booting shows some files…

About the files intentionally destroyed on the test instance, change FST version didn’t change anything, the files never appeared in any category, expect sometimes for few of them in d_mem_sz_diff. I managed to find ways to fix the files listing them from the dumpmd of the FS (the removed replicas were still listing there, although the FST didn’t know about them) and running bulk conversion on them. This took about 1 hour for 10K files.
So the fsck reports didn’t really help on this, but the case is quite extreme, I reckon.

Hi Franck,
I did some test with latest version deleting files and they show up as missing, once you do a boot which resyncs the disk contents, this happens after dirty shutdown or when you do 'fs boot '.

Strange that I don’t observe the same. Maybe because of some left over configuration ? This citrine instance hasn’t been installed from scratch, but upgraded from aquamarine.

Another situation that has arose lately, on our production instance :

During our last FST upgrade campaign, one of them didn’t shutdown correctly when stopping the service before upgrade, so did a full resync at boot. And it ended up by reporting million of orphan files, although they correspond to existing, and healthy files. The weird fact is that the number of reported orphan files varies over each fsck run (every 30 minutes), randomly between 2M and 13M (this FST holds ~18M files on 24FS).

I observed this also on the above mention test instance. A eos fsck repair --resync command seems to solve this, although I was telling Elvin this morning that I was reluctant to launch this on the production instance with millions of files…

Hello,

Coming back on this subject, after some FST & MGM upgrade, and some FS booting with full resync, and a run of eos fsck repair --resync (which lasted ~1hour), the average number of reported orphan files (the value changes at each fsck run) climbed from 8M to 37M (out of 160M files on the namespace)

in the output of eos fsck report -a, some FS report almost the same number of files as stat.usedfiles in eos fs status output. The list of FS that report orphans file changes at each fsck run.
I could also extract once (the command takes around 1 minute) a list of files with eos fsck report -i -a --error orphans_n, and each file I test with file info is clearly correct.

So, to me, it is clearly an incorrect report. Could that come from a bad communication between MGM and FST about the list of available files ? I’m not sure hoe the reported orphan files are working, but maybe I could have a look to this communication channel to get rid to these reports ? Any idea how I could do it ?

Could it be for instance caused by the same reason that we also don’t get the error messages on the MGM from the FSTs ?

Hi Franck,
this might be becuase you probably didn’t boot your FSTs with the resync flag from disk. This should clear up, once the scanner has run over all files … I will have another look tomrrow I think we have to fix this now once and forever, because there seems to be something generaly wrong in the FSCK in Citrine. It can also be that the boot procedure streaming the records times out because you have too many records …

Last hint: if you do a full boot, then first all files are tagged as orphan until they have been found on disk and flagged. Atleast that is how it should be …

Cheers Andreas.

Hi Andreas,

Thank you for your answer.

On the contrary, it seems that the the FS that report many orphan files are the one that had a forced full resync boot after a unclean shutdown detected.
Or maybe do you mean something else with “resync flag from disk” ? Do you suggest that force booting --syncmgm these FS again, while the FST is on, might solve it ?

Our scaninterval is indeed quite long on our disks. However, after shortening it for one FS, many files were scanned, but the number of reported orphan files didn’t drop.

Yes, I was suspecting something like that… but how could we detect if this happens ?

Ok, so this might explain this behaviour. Somehow on our instance the flag of the files doesn’t happen… But isn’t that quite risky ? If someone runs the fsck repair --unlink-orphans command, we might lose some files, no ?

A side-effect we observed with this fsck report giving many files, is that during each fsck check (every 30 minutes), the heartbeat of the fst increases up to several seconds (> 60 seconds, observed in node ls output while fsck runs), and this causes some error on the clients that can’t get access to some files during this period, as if all FSTs were offline. (errors are Numerical result out of range on fuse log, and Network is unreachable on MGM log)

Our users reported a lot of such errors, so we disabled the fsck runs for now.

Hi Andreas, would it make sense if I submitted a ticket to eos-support about that ?

Yes!

I did some testing. The orphans probably appear after a ‘fs boot --syncmgm’ because the timeout for retrieving the MGM meta data stream is 10 minutes. Can it be that you have many millions of files on a single filesystem?

The logic behind works like the following on a full sync:
clear the db
put all the files on disk in the DB and flag all as orphan
flag all the files reported from the MGM as not orphans

When the MGM syncing part times out you get a lot of files as fake orphans.

I also noticed, that the process in the FST is not correctly spawned, it uses ‘system’ which is very bad in a multi-threaded program because it uses fork and can give an OOM. We have to fix that.

We should probably also make this timeout reconfigurable or have a higher default.

Hi Andreas,

Not really. Each FS has as average 1M files, oscillating roughly between 500K and 1.5M. From the last fsck export (before we disabled it as mentioned above) almost all FS that report orphans file report more than 600K, so it could be that we reach the timeout…

But we also observe such report on our test instance, currently on one FS out of 9, which reports only 1K files, and repair --resync does not clear them… nor does regular scan on FST (test instance scans them daily…)

Can you briefly check, maybe the 1K are actually orphans?

They were not actual orphans (there was a path for them, and there was a replica entry for the FS reporting them as orphans), but I just rechecked and the resync action did eventually clear them, although not immediately. So on this small instance, orphans number dropped to 29, and they are actual orphans.

Regarding the production instance, during the last maintenance period, we launched a resync on the around 10M files reported in the fsck report. This took around 1 hour - and in the mean time fsck interface was not responding. But since we also restarted all the FSTs for upgrade, the number of files eventually grew again to a larger number, and we couldn’t tell if the previous ones were cleared or no. Now we have even more files.

Hi Franck,

If you have the possibility can you test the version produced by this pipeline:

And let me know if this helps with your issue? I’ve done some modifications to the scanner and it should properly mark and move orphans.

Cheers,
Elvin

Hi Franck,

One more thought on this. I picked up one fs which had like 42k orphans our of 51k files and I’ve done the following things:

  • initial status:
eos fs status 3870 | grep orphans
stat.fsck.orphans_n              := 43262
eos fs ls 3870 --io
┌────────────────────────────────┬──────┬────────────────┬───────────────────┬──────────┬────────────┬────────────┬──────────┬──────────┬──────────┬──────┬──────┬────────────┬────────────┬────────────┬───────────┬──────────┬──────────────┬────────────┬──────┬─────────┐
│hostport                        │    id│      schedgroup│             geotag│  diskload│  diskr-MB/s│  diskw-MB/s│ eth-MiB/s│  ethi-MiB│  etho-MiB│ ropen│ wopen│  used-bytes│   max-bytes│  used-files│  max-files│   bal-shd│     drain-shd│   drainpull│  iops│       bw│
└────────────────────────────────┴──────┴────────────────┴───────────────────┴──────────┴────────────┴────────────┴──────────┴──────────┴──────────┴──────┴──────┴────────────┴────────────┴────────────┴───────────┴──────────┴──────────────┴────────────┴──────┴─────────┘
 p05496644h29153.cern.ch:1095       3870       default.28 9918::R::0002::WL02       0.01         0.75         0.01       1192          3        176      2      0      3.39 TB      4.00 TB      51.23 K    195.35 M          0              0          off     59    151 MB 
  • then did a boot of that fs with the --resync flag on:
eos fs boot 3870 --syncmgm 
success: boot message send to p05496644h29153.cern.ch:/data19
  • and monitored the status on the FST:
 tail -f /var/log/eos/fst/xrdlog.fst | grep  "FmdDbMap"
180525 16:21:44 time=1527258104.459059 func=ResyncAllDisk            level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1054                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=10000 fsid=3870
180525 16:21:55 time=1527258115.712853 func=ResyncAllDisk            level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1054                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=20000 fsid=3870
180525 16:22:02 time=1527258122.234257 func=ResyncAllDisk            level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1054                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=30000 fsid=3870
180525 16:22:07 time=1527258127.312786 func=ResyncAllDisk            level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1054                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=40000 fsid=3870
180525 16:22:10 time=1527258130.257063 func=ResyncAllDisk            level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1054                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=50000 fsid=3870
180525 16:22:23 time=1527258143.487375 func=ResyncAllMgm             level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1239                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=10000 fsid=3870
180525 16:22:24 time=1527258144.255685 func=ResyncAllMgm             level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1239                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=20000 fsid=3870
180525 16:22:24 time=1527258144.937825 func=ResyncAllMgm             level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1239                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=30000 fsid=3870
180525 16:22:25 time=1527258145.667004 func=ResyncAllMgm             level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1239                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=40000 fsid=3870
180525 16:22:26 time=1527258146.376248 func=ResyncAllMgm             level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:1239                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="synced files so far" nfiles=50000 fsid=3870
180525 16:22:26 time=1527258146.484211 func=MarkCleanDB              level=INFO  logid=CommonFmdDbMapHandler unit=fst@p05496644h29153.cern.ch:1095 tid=00007f105b7fb700 source=FmdDbMap:488                   tident=<service> sec=      uid=0 gid=0 name= geo="" LevelDB DB mark clean for fsid=3870

After this was finished the reporting of orphans was “more reasonable”:

eos fs status 3870 | grep orphans
stat.fsck.orphans_n              := 12

Bottom line, this might point to the scenario that Andreas suggested. There is now (already in 4.2.23) a timeout of 4 hours for the dumpmd command so this should go away. Together with the recent fixes such orphans should be handled properly.

Cheers,
Elvin

Hi Elvin,

Thank you for spending some time on this.

What is the way to use this specific pipeline version ? Is there some package repository for it ? Unfortunately, I don’t know if I will be able to test this not official release on our production instance, which present the issue…

About the resync, in fact I realize that a FS takes much more time than yours to boot, between 30 minutes and one hour, and this happens sometimes when a FST starts, i.e. all FS boot at the same time, that slows down things for sure. So maybe if we boot them one y one sequentially, that might improves things. But still, I might want to wait we upgrade to V 4.2.23, because the previous 10 minutes timeout will for sure be exceeded.

Hi Franck,

The tests I’ve done were using 4.2.22. Therefore, I think you can go ahead and test it for one file system on one of the FSTs without doing any update. You don’t need any more the new version from the pipeline. Just follow the steps from my last post.

We’ve removed this full resync operation and this should not happen in the future, but this will only be available in 4.2.24.

Cheers,
Elvin

OK, thank you for these precision, I will try to run boot on some FS.

Just one question :

Will this orphan value change even if we don’t have fsck enabled ?