Watching fsck activities on FSTs, I see the localDB checks against files (func=RunDiskScan) every 12 hours or so (which is surprising because scan_disk_interval is not defined and should be the default:4h).
But I almost never see (1) the string “func=RunNsScan”, which I would expect to run every scan_ns_interval (3days by default). Is there something wrong ? How can I check this ?
(1) For exemple on one FST: not in July !
zgrep -in “Func=Run” /var/log/eos/fst/xrdlog.fst-202107*.gz | grep -v RunDiskScan
Could you paste the config output of one such file system so that we see exactly what scan interval values are set?
eos fs status <fsid> | grep scan
RunNsScan function, it is normal that you don’t see any logs as there is nothing printed in that function. Try checking for
AccountMissing as this prints a line that contains the following:
scanning ... attached namespace entries - this is called from the
RunNsScan and it’s a good indicator if the namespace scanning is actually working.
RunDiskScan function, do you see the following message (per file system) every 12 hours?
[ScanDir] Directory ...
[root@naneosmgr01(EOSMASTER) ~]#eos fs status 75 | grep scan
scaninterval := 604800
on server nanxrd13 :
[root@nanxrd13 ~]# grep ‘AccountMissing.*scanning’ /var/log/eos/fst/xrdlog.fst
210730 03:32:34 time=1627608754.236564 func=AccountMissing level=INFO logid=914475e4-a1d2-11eb-bfaf-f8f21e3b4c60 firstname.lastname@example.org:1095 tid=00007f4d303f9700 source=ScanDir:243 tident= sec= uid=0 gid=0 name= geo="" msg=“scanning 465457 attached namespace entries”
210730 12:30:48 time=1627641048.176522 func=AccountMissing level=INFO logid=91442bd4-a1d2-11eb-bfaf-f8f21e3b4c60 email@example.com:1095 tid=00007f4d31bfc700 source=ScanDir:243 tident= sec= uid=0 gid=0 name= geo="" msg=“scanning 453241 attached namespace entries”
So : 2 occurences with a 9 hours delay.
As for the
[root@nanxrd13 ~]# zgrep ‘[ScanDir] Directory.*data1’ /var/log/eos/fst/xrdlog.fst-20210730.gz
210729 12:38:26 time=1627555106.654795 func=RunDiskScan level=NOTE logid=9143d9e0-a1d2-11eb-bfaf-f8f21e3b4c60 firstname.lastname@example.org:1095 tid=00007f4d41fff700 source=ScanDir:504 tident= sec= uid=0 gid=0 name= geo="" [ScanDir] Directory: /data1 files=466021 scanduration=24895 [s] scansize=1627288536708 [Bytes] [ 1.62729e+06 MB ] scannedfiles=43612 corruptedfiles=0 hwcorrupted=0 skippedfiles=422390
210729 22:55:20 time=1627592120.726841 func=RunDiskScan level=NOTE logid=9143d9e0-a1d2-11eb-bfaf-f8f21e3b4c60 email@example.com:1095 tid=00007f4d41fff700 source=ScanDir:504 tident= sec= uid=0 gid=0 name= geo="" [ScanDir] Directory: /data1 files=465624 scanduration=22609 [s] scansize=1515222755529 [Bytes] [ 1.51522e+06 MB ] scannedfiles=41686 corruptedfiles=0 hwcorrupted=0 skippedfiles=423919
So: roughly 10h between the 2 for the same filesystem…
Also: how does the manager’s fsck (which runs every 30mn) finds what is the situation for a particuliar filesystem ? Does it query the FSTs LocalDB (via the FST daemon) ? If yes, is this done every 30mn for all filesystems ? (I currently have errors reported for a filesystem and I cannot see them in the corresponding LocalDB).
Ok, so we confirm that the
RunNsScan runs as expected - the two log lines don’t necessarily come from the same file system.
What version of eos are you running on the FSTs/MGM? This is related to understanding the frequency of the RunDiskScan thread.
When it comes to the info that you have about fsck errors at the MGM the process is a bit more complicated: the MGM indeed queries the FSTs every 30 min for info about inconsistencies. But the info on the FST is cached to avoid long lookups in the local leveldb and this info is refreshed from time to time, especially when there is a disk scanning happening. So you can have a mismatch between what the MGM reports and what is actually the situation on the FST for up to scan_disk_interval seconds (4h by default). This was done to avoid overloads on the FSTs due to long scans of the localdb.
We are running EOS v4.8.40 both on the MGM and FST.
Thank you for explaining, the existence of the cache between MGM fscl report and the current situation in LocalDB, I was not aware of forgot about this.