How to audit and repair storage with fsck

barbet · May 26, 2021, 7:42am

Hello,

Following all the tests which triggered this issue: https://eos-community.web.cern.ch/t/fsck-settings-communication-between-managers-and-fsts , I am now looking at our EOS storage for Alice (as a Tier2).

We have 2.65PB of storage, 2.17 used (81%), 84 filesystems.

I will examine all filesystems but for now I am looking at the first one.
An important note is that we are not doing RAIN: a filesystem corresponds to partition in a (big) RAID6 volume and we do not do replication. I suspect this will limit our ability to repair errors.

[root@naneosmgr02(EOSMASTER) ~]#eos fsck report -a | grep 'fsid=27'
timestamp=1621995118 fsid=27 tag="m_mem_sz_diff" count=1
timestamp=1621995118 fsid=27 tag="orphans_n" count=1
timestamp=1621995118 fsid=27 tag="rep_diff_n" count=315
timestamp=1621995118 fsid=27 tag="rep_missing_n" count=5702
timestamp=1621995118 fsid=27 tag="unreg_n" count=677

The total number of entries in the local FST DB (LocalDB) for this filesystem is : 797951

I looked at each case and I found:

m_mem_sz_diff (1)

File looks OK on disk but have a 0 size and no checksum in the NS.
How to force the MGM to resynchronize ? Should’nt it be done automatically ?

orphans_n (1)

The file is in the .eosorphans directory. I suppose that it is enough to simply delete it ? Will the entry disappear from le LocalDB ?

unreg_n (677)

I looked at one file and found that it is actually in another filesystem. It does not exist on disk on the filesystem we are examining but still has an entry in the LocalDB… What to do ?

rep_diff_n (315)

For the file that I looked, there are 2 replicas instead of one. The files on disk have both the correct checksum. How should this be dealt with ?

rep_missing_n (5702)

This is roughly 0.7% of the files. I looked at the first file and found that it is in another filesystem on the same FST. The problem is that it is still in the LocalDB of the one we are examining (but not on disk).
This is not necessarily the same for all 5702 files and I cannot look at all of them one by one. So what can be done here ?

Thank you

JM

esindril · May 28, 2021, 6:53am

Hi JM,

Let me address separately each individual case

m_mem_sz_diff
This means that the MGM never received a commit from the FST about that file and that is why you have no size and no checksum. You could trigger a eos file verify -commitsize / -checksum -commitchecksum but you have no guarantee that what you have on disk is the actual file that the user wanted to update. So I would just consider this file lost. Btw, do you have fusex enabled in this instance?

orphans_n
Starting with 4.8.50 which is not release yet, you will have a command eos fsck clean_orphans [--fsid <val>] that you can use to properly cleanup the orphan files in your instance. Deleting the files manually from the disk will not update the contents of the local db. This new tool will take care of both things.

unreg_n
Try using the eos fsck repair --fxid <val> --fsid <val_fs> --error unreg_n command on one of these files and let’s see what is the outcome - it should be cleaned up from the local db.

rep_diff_n
The same as before

rep_missing_n
Does this mean the file is currently not accessible? Could this be the result of some eos fs mv --force command? Can you paste an example so that I better understand the situation?

Thanks,
Elvin

barbet · May 31, 2021, 9:21am

Hello Elvin,
Thank you very much,
m_mem_sz_diff
We are not using fusex. I suppose nothing of this sort is enabled. I tried the file verify but it does not work (nothing changes).
orphans_n
OK, let’s wait for version 4.8.50…
unreg_n
That worked
rep_diff_n
That worked as well
rep_missing_n
The file is accessible (found it in AliEn File Catalog, and copied it locally)… It is located in another filesystem on the same server (28) but still in the LocalDB for filesystem 27 while not on the disk anymore. As the MGM info is correct, this is a matter of cleaning the LocalDB…
The problem is that I have over 5k rep_missing_n reported by fsck and I cannot check manually all of them in order to decide what to do… What about removing these entries from the LocalDB and, either the replica in the namespace is pointing to another filesystem and this is fine, or the file is lost anyway and I would have to make a list and report to the VO… (How to make the list of filenames ?)

Thank you

JM

esindril · June 1, 2021, 9:38am

Hi JM,

For the m_mem_sz_diff did you run the file verify twice? Once with the -commitsize and once with -checksum -commitchecksum? Do you have any errors on the FST concerning this verify operation?

For the rep_missing_n case can you provide more info. Pick one such file and paste the following info:
eos file info <fpath> --fullpath
eos file check <fpath>
eos fsck report -a -i --error rep_missing_n | grep <fxid_val>
Once I have this info, I can advise you better.

Thanks,
Elvin

barbet · June 1, 2021, 12:30pm

Hi Elvin,
m_mem_sz_diff
Not sure I had done the two commands separately. I just did it the correct way. There is no error message and I could see nothing in the FST log related to the verify. But the result is the same : no changes, the size and the checksum are not updated, the number of replica is still 0:

EOS Console [root://localhost] |/> fileinfo fid:115631077
File: ‘/eos/alice/grid/13/03803/16bf62a2-2d33-11eb-b489-1356836248bf’ Flags: 0664 Clock: 16832fb06f947df4
Size: 0
Modify: Mon Nov 23 03:25:04 2020 Timestamp: 1606098304.529134816
Change: Mon Nov 23 03:25:04 2020 Timestamp: 1606098304.504584355
Birth: Mon Nov 23 03:25:04 2020 Timestamp: 1606098304.504584355
CUid: 222 CGid: 222 Fxid: 06e463e5 Fid: 115631077 Pid: 27386 Pxid: 00006afa
XStype: adler XS: 00 00 00 00 ETAGs: “31039480882266112:00000000”
Layout: plain Stripes: 1 Blocksize: 4k LayoutId: 00100002 Redundancy: d0::t0
#Rep: 0

As you see : #Rep is 0, so, here, basically we are not succesful re-registering a replica in the namespace that exists on a filesystem…

rep_missing_n

[root@naneosmgr01(EOSMASTER) ~]#eos fileinfo fxid:0000f9ab
File: ‘/eos/alice/grid/13/48828/39ace77a-cbbb-11e3-a248-b30e3865f5cb’ Flags: 0000 Clock: 16841b8c0bea91f3
Size: 9350
Modify: Thu Apr 24 16:17:55 2014 Timestamp: 1398349075.143492000
Change: Thu Apr 24 16:17:54 2014 Timestamp: 1398349074.966134451
Birth: Thu Jan 1 01:00:00 1970 Timestamp: 0.000000000
CUid: 222 CGid: 222 Fxid: 0000f9ab Fid: 63915 Pid: 266 Pxid: 0000010a
XStype: adler XS: 1e 8a 7e 47 ETAGs: “17157052170240:1e8a7e47”
Layout: plain Stripes: 1 Blocksize: 4k LayoutId: 00100002 Redundancy: d1::t0
#Rep: 1
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
0 28 nanxrd01.in2p3.fr default.0 /data2 booted rw nodrain online Subatech::H002

EOS Console [root://localhost] |/> file check /eos/alice/grid/13/48828/39ace77a-cbbb-11e3-a248-b30e3865f5cb --fullpath
path=“/eos/alice/grid/13/48828/39ace77a-cbbb-11e3-a248-b30e3865f5cb” fxid=“0000f9ab” size=“9350” nrep=“1” checksumtype=“adler” checksum=“1e8a7e4700000000000000000000000000000000”
nrep=“00” fsid=“28” host=“nanxrd01.in2p3.fr:1095” fstpath=“/data2/00000006/0000f9ab” size=“9350” statsize=“9350” checksum=“1e8a7e4700000000000000000000000000000000” diskchecksum=“1e8a7e4700000000000000000000000000000000” error_label=“none”

[root@naneosmgr01(EOSMASTER) ~]#eos fsck report -a -i --error rep_missing_n | grep ‘0000f9ab’
timestamp=1622465747 fsid=27 tag=“rep_missing_n” count=5701 fxid=0000f9ab, 00013e2c, 00014ec9, 000379a1, […]

About this second case, in the example above, the namespace have correcty registered that the replica is in filesystem 28 while the entry is still present in the LocalDB on filesystem 28. No idea how this happened.

As said, I do not know if all the rep_missing_n in this filesystem 27 are similar errors and there are too many such errors that I can analyse them one by one… What automatic procedure (script) can we imagine ?

Thank you

JM

esindril · June 1, 2021, 12:52pm

Hi JM,

Ah ok, now I understand. I thought you do have the replica registered in the namespace. Then this should be a simple case for the eos fsck repair --fxid <fxid> --fsid <fsid_val> --error unreg_n. Usually the same file can be in several categories and I bet this one is also in the unreg_n one. The result should be that the namespace entry will have the missing replica attached after the repair action.

For this second case, also running eos fsck repair --fxid <fxid> --fsid 27 --error rep_missing_n should fix the issue. This should leave the entry in the namespace untouched and just clean the local db on filesystem 27.

Cheers,
Elvin

barbet · June 1, 2021, 1:47pm

Elvin,

rep_missing_n
OK, the fsck repair comand have supressed the duplicate entry in the filesystems LocalDB:

fsck repair --fxid 0000f9ab --fsid 27 --error rep_missing_n
msg=“repair successful”

eos-leveldb-inspect --dbpath /var/eos/md/fmd.0027.LevelDB/ --fid 63915
error: fid 63915 not found in the DB

m_mem_sz_diff

In this case, performing the fsck repair gives an error but the entry is deleted from the LocalDB of the filesystem, which is OK:

EOS Console [root://localhost] |/> fsck repair --fxid 06e463e5 --fsid 27 --error unreg_n
msg=“repair job failed”
msg=“repair job failed”

eos-leveldb-inspect --dbpath /var/eos/md/fmd.0027.LevelDB/ --fid 115631077
error: fid 115631077 not found in the DB

Now no more error in the fsck report for this filesystem but we have a file with 0 replica that should be considered as a lost file and reported to the VO, right ?

About doing things automatically, can the automatic repair be used ? Should I try to activate it ?
It is currently not enabled:

EOS Console [root://localhost] |/> fsck stat
Info: collection thread status → enabled
Info: repair thread status → disabled

Thanks

JM

esindril · June 2, 2021, 8:09am

Hi JM,

Yes, the files with 0 replicas are lost. If you enable the fsck repair, indeed these cases should be fixed. It’s just a matter of doing eos fsck config toggle-repair.

Cheers,
Elvin

barbet · June 2, 2021, 12:33pm

Thank you Elvin,

I have activated fsck repair. Immediately (or almost) something happened on the FSTs because the command eos-leveldb-inspect --fsck now reports only orphans_n !

I suppose this will be reflected in the result of eos fsck stat after the next scan…

Now this does not mean the NS is consistent with the content of the VO file catalog, I have written to Alice in order to see if something could be done at this level.
Thanks again !
JM

CERN Accelerating science

How to audit and repair storage with fsck

m_mem_sz_diff (1)

orphans_n (1)

unreg_n (677)

rep_diff_n (315)

rep_missing_n (5702)