CERN Accelerating science

Monitoring fsck activity on FSTs

Hello,
Watching fsck activities on FSTs, I see the localDB checks against files (func=RunDiskScan) every 12 hours or so (which is surprising because scan_disk_interval is not defined and should be the default:4h).
But I almost never see (1) the string “func=RunNsScan”, which I would expect to run every scan_ns_interval (3days by default). Is there something wrong ? How can I check this ?
Thank you
JM
(1) For exemple on one FST: not in July !
zgrep -in “Func=Run” /var/log/eos/fst/xrdlog.fst-202107*.gz | grep -v RunDiskScan
=> nothing!

Hi JM,

Could you paste the config output of one such file system so that we see exactly what scan interval values are set?
eos fs status <fsid> | grep scan

For the RunNsScan function, it is normal that you don’t see any logs as there is nothing printed in that function. Try checking for AccountMissing as this prints a line that contains the following: scanning ... attached namespace entries - this is called from the RunNsScan and it’s a good indicator if the namespace scanning is actually working.

For the RunDiskScan function, do you see the following message (per file system) every 12 hours?
[ScanDir] Directory ...

Thanks,
Elvin

Hi Elvin,

[root@naneosmgr01(EOSMASTER) ~]#eos fs status 75 | grep scan
scaninterval := 604800

Looking for AccountMissing:

on server nanxrd13 :

[root@nanxrd13 ~]# grep ‘AccountMissing.*scanning’ /var/log/eos/fst/xrdlog.fst
210730 03:32:34 time=1627608754.236564 func=AccountMissing level=INFO logid=914475e4-a1d2-11eb-bfaf-f8f21e3b4c60 unit=fst@nanxrd13.in2p3.fr:1095 tid=00007f4d303f9700 source=ScanDir:243 tident= sec= uid=0 gid=0 name= geo="" msg=“scanning 465457 attached namespace entries”
210730 12:30:48 time=1627641048.176522 func=AccountMissing level=INFO logid=91442bd4-a1d2-11eb-bfaf-f8f21e3b4c60 unit=fst@nanxrd13.in2p3.fr:1095 tid=00007f4d31bfc700 source=ScanDir:243 tident= sec= uid=0 gid=0 name= geo="" msg=“scanning 453241 attached namespace entries”

So : 2 occurences with a 9 hours delay.

As for the [ScanDir] Directory:

[root@nanxrd13 ~]# zgrep ‘[ScanDir] Directory.*data1’ /var/log/eos/fst/xrdlog.fst-20210730.gz
210729 12:38:26 time=1627555106.654795 func=RunDiskScan level=NOTE logid=9143d9e0-a1d2-11eb-bfaf-f8f21e3b4c60 unit=fst@nanxrd13.in2p3.fr:1095 tid=00007f4d41fff700 source=ScanDir:504 tident= sec= uid=0 gid=0 name= geo="" [ScanDir] Directory: /data1 files=466021 scanduration=24895 [s] scansize=1627288536708 [Bytes] [ 1.62729e+06 MB ] scannedfiles=43612 corruptedfiles=0 hwcorrupted=0 skippedfiles=422390
210729 22:55:20 time=1627592120.726841 func=RunDiskScan level=NOTE logid=9143d9e0-a1d2-11eb-bfaf-f8f21e3b4c60 unit=fst@nanxrd13.in2p3.fr:1095 tid=00007f4d41fff700 source=ScanDir:504 tident= sec= uid=0 gid=0 name= geo="" [ScanDir] Directory: /data1 files=465624 scanduration=22609 [s] scansize=1515222755529 [Bytes] [ 1.51522e+06 MB ] scannedfiles=41686 corruptedfiles=0 hwcorrupted=0 skippedfiles=423919

So: roughly 10h between the 2 for the same filesystem…

Thank you

JM

Also: how does the manager’s fsck (which runs every 30mn) finds what is the situation for a particuliar filesystem ? Does it query the FSTs LocalDB (via the FST daemon) ? If yes, is this done every 30mn for all filesystems ? (I currently have errors reported for a filesystem and I cannot see them in the corresponding LocalDB).

Thank you

JM

Hi JM,

Ok, so we confirm that the RunNsScan runs as expected - the two log lines don’t necessarily come from the same file system.

What version of eos are you running on the FSTs/MGM? This is related to understanding the frequency of the RunDiskScan thread.

When it comes to the info that you have about fsck errors at the MGM the process is a bit more complicated: the MGM indeed queries the FSTs every 30 min for info about inconsistencies. But the info on the FST is cached to avoid long lookups in the local leveldb and this info is refreshed from time to time, especially when there is a disk scanning happening. So you can have a mismatch between what the MGM reports and what is actually the situation on the FST for up to scan_disk_interval seconds (4h by default). This was done to avoid overloads on the FSTs due to long scans of the localdb.

Cheers,
Elvin

Elvin,

We are running EOS v4.8.40 both on the MGM and FST.

Thank you for explaining, the existence of the cache between MGM fscl report and the current situation in LocalDB, I was not aware of forgot about this.

JM