FSCK reports and how to deal with them

franck-jrc · March 20, 2018, 2:29pm

Dear fellows,

After upgrading to citrine, we had new reports from eos fsck feature that was not there before, and we would like to know how to handle them at best.

These are :

d_mem_sz_diff : most of them are rain files (seem all of the rain files of our namespace, mostly 12 stripes), but they seem sane, they can be read, and are correctly reported by eos file check. What would be the way to remove them from the list, because they might hide replica 2 files which really needs repair. Running eos file verify on rain file silently does nothing, and we once issued a eos fsck repair --resync command which hanged up and we had to restart the MGM (maybe only due to the number of files ?)

rep_missing_n : we never had this error reported with aquamarine version. We understand that some replica are missing, but we have a lot of them (50K). Is it safe to run eos fsck repair --drop-missing-replicas ? Even on 50K files ?

In general, how to handle such reports which contains many many files so that we don’t overload the MGM ?

Thank you

Here is the current content of eos fsck report :


```none
mgm# eos fsck report
timestamp=1521554746 tag="d_cx_diff" n=2 shadow_fsid=
timestamp=1521554746 tag="d_mem_sz_diff" n=588187 shadow_fsid=
timestamp=1521554746 tag="file_offline" n=4 shadow_fsid=
timestamp=1521554746 tag="orphans_n" n=668 shadow_fsid=
timestamp=1521554746 tag="rep_diff_n" n=54 shadow_fsid=
timestamp=1521554746 tag="rep_missing_n" n=45721 shadow_fsid=
timestamp=1521554746 tag="rep_offline" n=0 shadow_fsid=
timestamp=1521554746 tag="unreg_n" n=154 shadow_fsid=
timestamp=1521554746 tag="zero_replica" n=93 shadow_fsid=

franck-jrc · March 20, 2018, 2:57pm

Another strange behaviour with respect to previous version is that the reported number of the files in different categories varies with time, oscillating around an average. Would that mean that the report is just for files that are temporarily in transition state (being balanced, being created, or deleted) ?

rep_diff_n for instance is around 50 files, but when querying them with eos file check, most of them are healthy and have the current replica number.

franck-jrc · March 20, 2018, 3:11pm

Some missing replicas file really seem to have become unhealthy after a balancing operation : the target FST might have crashed (we had such cases in the past), and the file is absent on the disk. But the previous replica has been removed. Difficult to really diagnose the situation, because since no file is present, we can’t know when that balancing would have occurred to look back in the logs.
Could this be possible ?

franck-jrc · March 23, 2018, 10:04am

Good morning,

We need to understand more about hoe the rep_missing_n report works.

On a test instance, I did an extreme disaster simulation : I deleted all files in 00000*/ folders on one FS, then restarted the FST, booted the FS with resyncmgm, etc, but the files never appeared in this fsck category, although they indeed are absent from the disk (with the statsize displayed as unsigned integer instead of previously -1))

path="fid:8845" fid="0000228d" size="1016" nrep="2" checksumtype="adler" checksum="217d5fc800000000000000000000000000000000"
nrep="00" fsid="3" host="s-jrciprcids92v.cidsn.jrc.it:1095" fstpath="/data03/00000000/0000228d" size="1016" statsize="1016" checksum="217d5fc800000000000000000000000000000000"
nrep="01" fsid="6" host="s-jrciprcids93v.cidsn.jrc.it:1095" fstpath="/data03/00000000/0000228d" size="1016" statsize="18446744073709551615" checksum="217d5fc800000000000000000000000000000000
"

The file can be read, until some point where just by reading it the missing replica is created as a 0 size file, and its size in the MGM becomes 0, meaning the real content of the file becomes unavailable, until we manually drop the bad replica and run a file verify -checksum -commitsize -commitchecksum command. OK on one file, but there are many of them. I wanted to test the drop-missing-replica repair command on this test instance, but the files are not reported.

What would be the best option to deal with them ?

apeters · March 23, 2018, 2:05pm

Hi Franck, which version is that? I thought this very dangerous behaviour was fixed by a commit from Elvin. You can see this files in the fsck only if you do a full boot via fs boot or a dirty db boot.

esindril · March 23, 2018, 2:07pm

What eos version where you running?

esindril · March 23, 2018, 2:10pm

The fix was added in 4.2.13:
https://gitlab.cern.ch/dss/eos/commit/88de823b6120e8a6a25111b7d3cf27ae7baf5945

franck-jrc · March 23, 2018, 3:29pm

Hi, thanks for your answers.

Indeed you are right, we run an older version, 4.2.12. So we want to upgrade quite soon, I suppose.

On our production they appeared after a dirty db boot. On the test, I can’t manage to have them appearing, even in fs boot --syncmgm

apeters · March 23, 2018, 3:42pm

Can you do a boot without “–syncmgm” …

franck-jrc · March 23, 2018, 4:04pm

I just did it, still nothing in rep_mising_n. I have some entries in d_mem_sz_diff, for files that had their sizes changed to 0. And, in this situation, running eos fsck repair --replace-damaged-replicas dropped the replica with correct content and kept the one with zero size (because matching the size in MGM) so this is where the danger lies, indeed, and I’ll avoid in production.

But in the meantime I tried to upgrade the FST on this node. The outcome was that the number of used files reported by the FST decreased to what is probably the real number of files present, but the output of dumpmp on this FS still reports all the replicas that were manually deleted.

I’ll try to downgrade again this FST to see if booting shows some files…

franck-jrc · March 28, 2018, 9:16am

About the files intentionally destroyed on the test instance, change FST version didn’t change anything, the files never appeared in any category, expect sometimes for few of them in d_mem_sz_diff. I managed to find ways to fix the files listing them from the dumpmd of the FS (the removed replicas were still listing there, although the FST didn’t know about them) and running bulk conversion on them. This took about 1 hour for 10K files.
So the fsck reports didn’t really help on this, but the case is quite extreme, I reckon.

apeters · March 28, 2018, 1:25pm

Hi Franck,
I did some test with latest version deleting files and they show up as missing, once you do a boot which resyncs the disk contents, this happens after dirty shutdown or when you do 'fs boot '.

franck-jrc · March 29, 2018, 2:14pm

Strange that I don’t observe the same. Maybe because of some left over configuration ? This citrine instance hasn’t been installed from scratch, but upgraded from aquamarine.

franck-jrc · March 29, 2018, 2:19pm

Another situation that has arose lately, on our production instance :

During our last FST upgrade campaign, one of them didn’t shutdown correctly when stopping the service before upgrade, so did a full resync at boot. And it ended up by reporting million of orphan files, although they correspond to existing, and healthy files. The weird fact is that the number of reported orphan files varies over each fsck run (every 30 minutes), randomly between 2M and 13M (this FST holds ~18M files on 24FS).

I observed this also on the above mention test instance. A eos fsck repair --resync command seems to solve this, although I was telling Elvin this morning that I was reluctant to launch this on the production instance with millions of files…

franck-jrc · April 30, 2018, 3:55pm

Hello,

Coming back on this subject, after some FST & MGM upgrade, and some FS booting with full resync, and a run of eos fsck repair --resync (which lasted ~1hour), the average number of reported orphan files (the value changes at each fsck run) climbed from 8M to 37M (out of 160M files on the namespace)

in the output of eos fsck report -a, some FS report almost the same number of files as stat.usedfiles in eos fs status output. The list of FS that report orphans file changes at each fsck run.
I could also extract once (the command takes around 1 minute) a list of files with eos fsck report -i -a --error orphans_n, and each file I test with file info is clearly correct.

So, to me, it is clearly an incorrect report. Could that come from a bad communication between MGM and FST about the list of available files ? I’m not sure hoe the reported orphan files are working, but maybe I could have a look to this communication channel to get rid to these reports ? Any idea how I could do it ?

Could it be for instance caused by the same reason that we also don’t get the error messages on the MGM from the FSTs ?

apeters · May 2, 2018, 6:36pm

Hi Franck,
this might be becuase you probably didn’t boot your FSTs with the resync flag from disk. This should clear up, once the scanner has run over all files … I will have another look tomrrow I think we have to fix this now once and forever, because there seems to be something generaly wrong in the FSCK in Citrine. It can also be that the boot procedure streaming the records times out because you have too many records …

Last hint: if you do a full boot, then first all files are tagged as orphan until they have been found on disk and flagged. Atleast that is how it should be …

Cheers Andreas.

franck-jrc · May 4, 2018, 9:04am

Hi Andreas,

Thank you for your answer.

On the contrary, it seems that the the FS that report many orphan files are the one that had a forced full resync boot after a unclean shutdown detected.
Or maybe do you mean something else with “resync flag from disk” ? Do you suggest that force booting --syncmgm these FS again, while the FST is on, might solve it ?

Our scaninterval is indeed quite long on our disks. However, after shortening it for one FS, many files were scanned, but the number of reported orphan files didn’t drop.

Yes, I was suspecting something like that… but how could we detect if this happens ?

Ok, so this might explain this behaviour. Somehow on our instance the flag of the files doesn’t happen… But isn’t that quite risky ? If someone runs the fsck repair --unlink-orphans command, we might lose some files, no ?

franck-jrc · May 4, 2018, 1:53pm

A side-effect we observed with this fsck report giving many files, is that during each fsck check (every 30 minutes), the heartbeat of the fst increases up to several seconds (> 60 seconds, observed in node ls output while fsck runs), and this causes some error on the clients that can’t get access to some files during this period, as if all FSTs were offline. (errors are Numerical result out of range on fuse log, and Network is unreachable on MGM log)

Our users reported a lot of such errors, so we disabled the fsck runs for now.

franck-jrc · May 15, 2018, 5:14pm

Hi Andreas, would it make sense if I submitted a ticket to eos-support about that ?

apeters · May 16, 2018, 7:57am

Yes!

CERN Accelerating science

FSCK reports and how to deal with them