Understanding consistency checks : autorepair, fsck, syncmgm

Hello,

I am trying to understand some of the features of EOS in terms of consistency and detection of errors.
There is:
autorepair and scaninterval (http://eos-docs.web.cern.ch/eos-docs/configuration/autorepair.html)
eos fsck
eos file check
eos fs boot --syncmgm

I would need to understand whaat exactly each of these tools does, what it needs in terms of comunication (network ports) between the EOS managers and disk servers. I have a remote disk server (test platform) for which the command “eos file check” hangs).

Currently, I observe frequent d_mem_sz_diff for raid6 chunks that do not seem to be real and I do not know how to reset them.

And last, I would like to know what is the procedure to replace a failing disk in RAIN setup. Is it enough to drain it and how to put a new disk as a replacement ?

Thank you

JM

Hi Jean Michel,

Just a word of caution, when you update to the new namespace with QuarkDB you will have to disable the FSCK functionality as this overwhelms the QuarkDB backend when it runs. I am now in the process of rewriting the fsck part and this will be available soon. The new fsck will work with the new namespace and will provide improved automatic repair functionality.

Having said that let me detail what each of the commands that you listed does. By default, there is a scanner thread that runs on each FST, every 4 hours and collects files that haven’t been scanned in the last 7 days (scaninterval). If you have autorepair enabled then if this scan thread detects a simple checksum error or size mismatch for a replica then it will trigger the autorepair functionality which tries to create a new correct replica and drop the old broken one.

The fsck functionality collects all errors from the FSTs - checksum, size mismatch, missing replicas, orphan replicas etc. and displays them to the administrator in the fsck report. At the moment it does not do any automatic repair. One needs to do this manually by doing fsck repair --tags.

The eos file check will collect the information about a file from several sources, namely: the namespace and each of the stripes/replicas local database on the FSTs.

The fs boot --syncmgm will try to bring in sync the local db on the FSTs with what the MGM knows. This functionality is very heavy on the namespace and I don’t recommend using it as it can seriously affect the performance of the instance. We plan in the near future to drop the local db and completely remove this functionality. The new fsck will take care of slowly checking the consistency between the MGM (namespace) and the actual files on disk in the FSTs.

The accounting for RAIN file is half-broken in the current FSCK but this will be properly fixed in the rewrite of this functionality.

The eos file check command can hang if one of the FSTs holding a replica of the file is not responding or in a strange state therefore the xrootd client trying to connect to the FST is blocked. You can run the command with XRD_LOGLEVLE=Dump and you will see more details of actually where it is hanging.

Draining RAIN files only works properly only with the central draining. Yes, you just need to drain the disk and put in a replacement. No other operation is needed.

Cheers,
Elvin

Elvin,

First thank you for the long message of explainations. I have a few questions on your answers:

Just a word of caution, when you update to the new namespace with QuarkDB you will have to disable the FSCK functionality

=>This is on a test platform but on the production EOS I will have to know how to disable fsck. How do you do it ?

The new fsck will work with the new namespace and will provide improved automatic repair functionality

Should we consider delaying the migration to having the namespace in QuarkDB until this is fixed ?
(until having the fixed EOS in the stable repository)

By default, there is a scanner thread that runs on each FST, every 4 hours

I read in the help pages that the default is 14 days, maybe it is no longer true. I set it to 1 hour

it will trigger the autorepair functionality which tries to create a new correct replica and drop the old broken one
Is this true also for RAID6 chunks ? How can I test it ?

The fsck functionality collects all errors from the FSTs

Errors that have been detected by the scandir on the FST ? How can I correct d_mem_sz_diff errors, I have several of them reported and I do not even know if this is real. How can I assess the result of repairement ? (I have also some d_cx_diff errors).

The fs boot --syncmgm will try to bring in sync the local db on the FSTs with what the MGM knows.
You mean the MGM namespace will be updated with what is found on the FST ?

The accounting for RAIN file is half-broken in the current FSCK
Then I should maybe stop trying to understand the output of fsck report for files having raid6 as a replica type, right ?

And finally:

Yes, you just need to drain the disk and put in a replacement. No other operation is needed.
I would have to format the disk in xfs for example, mount it on the disk server and register it with eosfsregister ? Can I reuse the same fs id ?

Thank you

JM

Hi Jean-Michel,

To disable fsck you run:
eos fsck disable

Should we consider delaying the migration to having the namespace in QuarkDB until this is fixed ?
It’s hard for me to answer this question. It depends on your constraints. If you reach the limits of the memory of the MGM, or the reboot time is unacceptable for you, then you should move.

I read in the help pages that the default is 14 days, maybe it is no longer true. I set it to 1 hour.
I don’t recommend setting the scaninterval value to one hour since then the scanner will actually rescan all the files at reach run which is very heavy for the disks and the FSTs. I would leave the default of 7 days. Here is the code which set’s it to this value:
https://gitlab.cern.ch/dss/eos/blob/master/mgm/FsView.cc#L873

Is this true also for RAID6 chunks ? How can I test it ?
The current fsck does not do proper accounting for RAIN errors so you should not rely on it for RAIN layouts.

Errors that have been detected by the scandir on the FST ? How can I correct d_mem_sz_diff errors, I have several of them reported and I do not even know if this is real. How can I assess the result of repairement ? (I have also some d_cx_diff errors).
Yes, fsck reports errors detected by the scandir on the FST. Since EOS 4.4.45 there is a new tool that helps you inspect the local contents of the LevelDB on the FSTs and also the flags that mark the different errors reported by FSCK. For example you can do:

[esindril@esdss000 build_ninja]$ sudo eos fileinfo /eos/dev/replica/test.dat
  File: '/eos/dev/replica/test.dat'  Flags: 0640
  Size: 3314
Modify: Fri Jul 26 14:55:50 2019 Timestamp: 1564145750.692870000
Change: Fri Jul 26 14:55:50 2019 Timestamp: 1564145750.670573117
Birth : Fri Jul 26 14:55:50 2019 Timestamp: 1564145750.670573117
  CUid: 58602 CGid: 1028  Fxid: 0000b16b Fid: 45419    Pid: 11   Pxid: 0000000b
XStype: adler    XS: 74 d7 7c 3a    ETAGs: "12192069976064:74d77c3a"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
  #Rep: 2
┌───┬──────┬────────────────────────┬────────────────┬────────────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                        path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0        2         esdss000.cern.ch        default.0 /home/esindril/Eos_data/fst2     booted             rw      nodrain   online                    elvin 
 1        3         esdss000.cern.ch        default.0 /home/esindril/Eos_data/fst3     booted             rw      nodrain   online                    elvin 

*******
[esindril@esdss000 build_ninja]$ sudo eos-leveldb-inspect --dbpath /var/eos/md/fmd.0002.LevelDB --fid 45419
fxid=b16b id=45419 cid=11 fsid=2 ctime=1564145750 ctime_ns=681283000 mtime=1564145750 mtime_ns=692870000 atime=1564145750 atime_ns=692870000 size=3314 disksize=3314 mgmsize=281474976710641 lid=0x100112 uid=58602 gid=1028 filecxerror=0x0 blockcxerror=0x0 layouterror=0x0 checksum=74d77c3a diskchecksum=74d77c3a mgmchecksum=none locations=none 

I think this tool will also work if you just copy it on one of the FSTs, you don’t need necessarily to update to the EOS version. With this you can understand where the error comes. To be honest the easiest would be to drop the broken replica and trigger an adjust replica (for replica layout) or to rewrite the entire file (for RAIN layouts) by using “eos file convert --rewrite”. Again, many of these will either change or go away with the new FSCK implementation which will handle all these operations on its own in the background.

You mean the MGM namespace will be updated with what is found on the FST ?
No, the other way around. All the info from the MGM will be dumped (essentially an fs dumpmd is happening during this operation) and the info in the local db of the FST (see the the previous answer) is updated so that the scanner can spot inconsistencies between what is in the namespace and what is on disk.

Then I should maybe stop trying to understand the output of fsck report for files having raid6 as a replica type, right ?
Yes, exactly.

I would have to format the disk in xfs for example, mount it on the disk server and register it with eosfsregister ? Can I reuse the same fs id ?
Yes, you could reuse the same fsid, but there is no advantage to this. Just make sure you drain it properly and remove it using the fs rm command and then you can add it back with the same id if you wish so.

Cheers,
Elvin

Hi Elvin,

Thank you very much for the precise answers.

As for migrating our production EOS to using QuarkDB, the issues are:

a) according to Config in quarkDB for master/slave(s) I would need EOS version 4.4.47 in order for the manager failover to work correctly with QuarkDB. Right ?

b) having updated to 4.4.47 and migrated to QDB, I would have to disable fsck (which I believe is not currently enabled in our production storage). With a compacted namespace, we have no problems to boot our managers, so it may be better to wait. This depends also on your release plans (4.4.47 in stable repo and fixed fsck available).

About damaged filesystems:

If I am not wrong, the way it works at CERN is that if smartctl detects a disk failure, the fs is automatically drained. Is this right ?

I have a fs that I manually drained and is currently “empty,drained”. I moved it to a “spare” group. Is there an automatic procedure that would use this available fs in a group that has a failed one or is it a manual procedure ? In this case, is it enough to move it in the group which need it and let the balancer do its job ?

Thanks

JM

Hi Jean-Michel,

a) Yes, that is the minimum version.

b) Fixed fsck functionality will probably be in 4.5.5 or later.

If I am not wrong, the way it works at CERN is that if smartctl detects a disk failure, the fs is automatically drained. Is this right ?
Yes, by and large it works like this, but we don’t rely on smartctl. If a file system is in bootfailure mode then it will trigger an automatic draining of that file system. We have our internal scrubbing mechanism for the disk and this can also detect issues with the underlying disk.

For the last question, there is no automatic procedure to do this, you have to put a disk back in the group where you need it. Once you move it in, it will accept new files and balance other files from disks in the same group.

You’re welcome!
Elvin

Hello,

I have to admit that I have not understood yet all the consistency checks that we can use or look at in EOS.
I am currently playing with a small test cluster and I deleted one of the replicas of a file in a directory with attr set default=replica and I am trying to look at different tools.

After several days, it seems that the MGM is still not aware that I suppressed one replica:

 [root@eosmgr01 ~]$ eos fileinfo //eos/testmirror/bidon
  File: '/eos/testmirror/bidon'  Flags: 0644
  Size: 23
Modify: Thu Aug 20 14:17:11 2020 Timestamp: 1597925831.704535000
Change: Thu Aug 20 14:17:11 2020 Timestamp: 1597925831.548517884
Birth : Thu Aug 20 14:17:11 2020 Timestamp: 1597925831.548517884
  CUid: 99 CGid: 99  Fxid: 00000011 Fid: 17    Pid: 16   Pxid: 00000010
XStype: adler    XS: 58 a8 06 f2    ETAGs: "4563402752:58a806f2"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
  #Rep: 2
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│            path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0        2        eosmgr01.in2p3.fr        default.1          /data02     booted             rw      nodrain   online                  CCIN2P3 
 1        6        eosmgr03.in2p3.fr        default.1          /data02     booted             rw      nodrain   online                  CCIN2P3 

I tried to understand with different tools and I would need to summarize what one can use.

On the manager we have

  • fsck (enable, repair, stat)
  • inspector
  • health
  • file info, file check, file verify

Could someone state once more what each tool does and which variables are driving the behaviour (are the space variables scaninterval, scanrate, scan_disk_interval for fsck ?, how are the variables scan_ns_interval and scan_ns_rate used ?)

On the FST we have

  • The disc scanner which will try (according to a previous answer) to match what is actually on the disk with what the MGM think there should be and report inconsistencies.
  • The tool eos-leveldb-inspect to look at the FST database.

Looking at the FST:

[root@eosmgr03 ~]$ eos-leveldb-inspect --dbpath /var/eos/md/fmd.0006.LevelDB --dump_ids
fid(dec) :
17

[root@eosmgr03 ~]$ eos-leveldb-inspect --dbpath /var/eos/md/fmd.0006.LevelDB --fid 00000017
fxid=11 id=17 cid=16 fsid=6 ctime=1597925831 ctime_ns=548517884 mtime=1597925831 mtime_ns=704535000 atime=1598265072 atime_ns=668975000 size=0 disksize=0 mgmsize=23 lid=0x100112 uid=99 gid=99 filecxerror=0x0 blockcxerror=0x0 layouterror=0x0 checksum=58a806f2 diskchecksum=none mgmchecksum=58a806f2 locations=2,6,

[root@eosmgr03 ~]$ eos-leveldb-inspect --dbpath /var/eos/md/fmd.0006.LevelDB --verbose_fsck
Num entries in DB[mem_n]: 1
Num. files synced from disk[d_sync_n]: 1
Num, files synced from MGM[m_sync_n]: 1
Disk/referece size missmatch[d_mem_sz_diff]: 0
MGM/reference size missmatch[m_mem_sz_diff]: 1
00000011
Disk/reference checksum missmatch[d_cx_diff]: 0
MGM/reference checksum missmatch[m_cx_diff]: 0
Num. of orphans[orphans_n]: 0
Num. of unregistered replicas[unreg_n]: 0
Files with num. replica missmatch[rep_diff_n]: 0
Files missing on disk[rep_missing_n]: 0

The suppressed file appears only as m_mem_sz_diff.

… and the MGM still list both replicas online !

How can I used the different tools to understand the current situation ?
What is missing so that the MGM understand that one of the replicas have been lost ?

Thank you, sorry for this long post, I hope it can also help other to understand.
Maybe make a page of documentation with all this ?

JM

Hi JM,

Below you can find a presentation that should cover all the points you raised. Autorepair does not exit anymore in recent versions (>= 4.8.0) so some of this is simpler. Please let me know if you still have questions after reading the attached file.

https://cernbox.cern.ch/index.php/s/ul6Rpb4mdYbRsva

Cheers,
Elvin

Hi Elvin,
Thank you very much.
I was trying to compare what is displayed in this presentation and what I can see on my storage but I believe the EOS version is not recent enough (my production storage is 4.5.6). I will have to perform an update.
However I have a test cluster running EOS version 4.7.7. What I see there is closer to your presentation but still I think there are a couple of issues:

  • eos fs status was showing only scaninterval. I had to add the parameters scan_disk_interval and scan_ns_interval. (should’nt they be here already ?).

  • In the FST log I see that a piece is missing:

200909 08:42:08 time=1599633728.244151 func=RunDiskScan level=NOTE logid=2aa5f450-d7c2-11ea-a635-003048de1e44 unit=fst@nanxrd15.in2p3.fr:1095 tid=00007fa260feb700 source=ScanDir:454 tident= sec= uid=0 gid=0 name= geo=“” msg=“no qclient present, skipping disk scan”
200909 08:45:04 time=1599633904.403491 func=RunNsScan level=NOTE logid=2aa5f450-d7c2-11ea-a635-003048de1e44 unit=fst@nanxrd15.in2p3.fr:1095 tid=00007fa2607ea700 source=ScanDir:207 tident= sec= uid=0 gid=0 name= geo=“” msg=“no qclient present, skipping ns scan”

What is this qclient (sth related to QDB ?)
If this is the case, I may have trouble adding it to my production storage file servers since they are still running on SL6, I am not sure a piece of code related to QDB is available on this platform…

JM

Hi JM,

The FSTs need to be able to connect to QDB to have the full fsck functionality working properly. This is to ensure that the namespace view is consistent with what is on disk. Therefore you need to add the following two configuration parameters to all the FST daemons:

fstofs.qdbcluster localhost:7777                                                                                                                                                                                                                                                        
fstofs.qdbpassword_file /etc/eos.keytab

The scan_*_interval parameters are not showing up if the defaults are used. These are the defaults:
https://gitlab.cern.ch/dss/eos/-/blob/master/fst/ScanDir.cc#L109

SLC6 is indeed a problem but note that the end-of-life for this OS is November so it might be a good idea to move to CentOS7. Nevertheless, qclient is built into the EOS libraries so there should be no problem adding this to your SLC6 setup.

Cheers,
Elvin

Hi Elvin,
My production cluster has been updated to EOS version 4.7.7.
May I try to sum up what I have understood and the questions I still have:

FST-disk-scan
Will read disk (additions only ?) and update the local LevelDB with what it finds. It can only generate errors such as d_mem_sz_diff and d_cx_diff. Enabled at FS level with “scan_disk_interval”. Right ?

FST-ns-scan
Will read NS (via qclient) ? and update the local LevelDB with what it finds ? It can generate m_mem_sz_diff, m_cx_diff, unreg_n, rep_missing_n, rep_diff_n, orphans_n ?

MGM-fsck-collect
Collect all errors found by the FST scans (disk and ns) ?

MGM-fsck-repair
Will try to repair some errors. What can the repair thread do actually ? What can be fixed this way ?

I have also questions about the presentation https://cernbox.cern.ch/index.php/s/ul6Rpb4mdYbRsva

Slide 7 : what is the color code meaning ?
Slide 8 : Can we describe more precisely each process (what info is read, what can be modified)
Slide 9 : layout errors : I suppose they can only result from the FST-ns-scan ?
Slide 11: By “reference” you mean LevelDB ?

Is the activation of fsck-collect on the MGM costly or hammering QDB ?

Is the automatic draining in case of errors activated by default ? We have 20TB filesystems, not sure we want a drain to start on its own… If it can be disabled…

What is exactly the space inspector ?

Sorry if this is long and messy, only trying to figure out the whole thing.
Thank you
JM

Hi JM,

FST-disk-scan - yes, it will read every file once a week and update the information concerning what it finds on disk.

FST-ns-scan - yes, it looks from the namespace side and tries to detect any inconsistencies and marks any error accordingly

MGM-fsck-collect - yes, exactly!

MGM-fsck-repair - it will try to repair all errors detected in the collection stage. It can fix almost any type of error if enough information is available to take the right decision. In case nothing can be fixed the file (and metadata) is left untouched. This is what we call a “lost file”.

The colors on slice 7:
blue - metadata info updated only once when the file is created or updated
green - metadata updated each time the FST scanner runs over that particular file
red - metadata that indicates some sort of corruption detected

Slide 8:

  • this is exactly what we discussed above. The physical files on disk are read and the size/checksum are computed and update the local levelDB. Then comparing this info with the initial metadata, we decide if there was a corruption in the meantime or not. Also comparing with what is stored in the namespcae the (ns-scan) we decide if there are some files missing or orphan or some other corruption is present.
    Slide 11:
    yes, exactly, this means the size field from the local LevelDB. It is called in the same way on Slide 7 where all the info stored is explained.

Fsck was especially reimplemented not to affect the performance of QDB. There is indeed some more traffic against QDB but this does by no means affect the overall performance of the system.

Yes, automatic draining in case of IO errors is active and has always been. Note this happens only for IO errors picked up by the scrubber thread which looks at disk IO. This does not happen if you have some files with various types of errors.

The space inspector functionality is meant to track leftover atomic files and clean them up. It has no relation whatsoever to the fsck functionality.

Cheers,
Elvin

Hello Elvin,

Thank you for you patience and re-explaining in detail fsck and all disk-scans.
Coming back to reusing filesystems: I am currently draining servers in order to reinstall them in C7 (they still are in SL6). I would like to reuse the FS ids and I need to clean the existing filesystems in an appropriate manner. Could you confirm: I should remove all files in the filesystem, except .eosfsid and .eosfsuuid plus wipe the local DB /var/eos/md/fmd.0068.LevelDB ?
Then issue a “eosfstregister” ?
Thank you

JM

Hi JM,

Yes, those are the right steps. For eosfstregister you need to pass the -i option to register a filesystem with existing .eosfsid/.eosfsuuid therefore, I would more go for the manual registration using eos fs add -m command. It’s quite simple to use it, the help output explains everything. In the end eosfstregister is doing the same thing.

Cheers,
Elvin