On a test cluster, I am trying to observe fsck and integrity mecanisms on the file servers.
I have activated fsck collection on the managers and defined short intervals for both the NS scans and the DISK scans:
eos fsck stat
210506 11:11:30 16395 secgsi_ParseCAlist: nothing to parse
Info: collection thread status → enabled
Info: repair thread status → disabled
210506 11:11:03 1620292263.778981 Start error collection
210506 11:11:03 1620292263.779017 Filesystems to check: 2
210506 11:11:13 1620292273.799774 Finished error collection
210506 11:11:13 1620292273.799803 Next run in 30 minutes
eos space status default | grep scan
210506 11:12:18 16444 secgsi_ParseCAlist: nothing to parse
scan_disk_interval := 600
scan_ns_interval := 259200
scan_ns_rate := 50
scaninterval := 3600
scanrate := 100
When the fsck collection is supposed to start, I observe in the logs:
Hi Elvin
Thank you.
This is a test cluster: only 2 files !
Manager is running EOS version 4.8.40
Server is running EOS version 4.7.7 (can be easily updated)
Yes, /etc/xrd.cf.fst have the right fstofs.qdb directives and I can reach QDB from the server
(test with redis-cli).
JM
I think I know what is the problem. The settings that you are printing from the space level only apply to newly added file systems. If you check one individual file system you will most likely see that the scan interval values where not updates. eos fs status <fsid> | grep scan
To update the scan values of existing file systems in a space you need to use this command: eos space config default fs.scan_disk_interval=600
and so on for all the keys you want to modify.
You can reuse the eos fs status command to confirm the new values were applied.
Elvin,
That’s it. Defining fs.scan_disk_interval and fs.scaninterval triggered execution on the FST.
Note that no scan variables existed before in the output of eos fs status.
Now I am going to observe how the corruption of one file gets reported.
JM
210506 14:42:12 372 FstOfs_FSctl: daemon.23576:32@cceostest03 Unable to execute FSctl command /; invalid argument
… at the time of the fsck collection on the manager…
Sorry,I realize this gets printed on the FSTs. It could be that you are running an older version on the FSTs that don’t implement the new way of querying for fsck replies. This is not a problem, everything should still work
Hello Elvin,
Messages Unable to execute FSctl are no longer being issued after updating EOS to 4.8.40, same version as the manager.
My original goal is to perform an audit on our production cluster but before, I am trying to get familiar with all these mechanisms on a test cluster with only a few files that I can intentionally corrupt.
I have read again your very good presentation : https://indico.cern.ch/event/862873/contributions/3724435/attachments/1979998/3297217/2020_Eos_workshop_fsck.pdf and I would have a few questions about it, if you do not mind. Do you want me to ask them here ? In a new topic ?
Thank you
JM
Thank you Elvin,
Based on the output of eos-leveldb-inspect --fid, I see the set of variables that are kept in the local LevelDB on the FST for each filesystem. The set does not exactly correspond to what is listed in the presentation, I suppose there have been some changes. There are several things:
I do not see a checktime (timestamp of last scan (updated by scan)) so I wonder how the ScanDir decides to recheck the checksum of a file or if it should be skipped (what is the criteria to recheck ?)
As for the *time variables, I understand that Unix epoch is used for the files on disk but for the *time_ns values, I do not see how to interpret them as they do not resemble epoch values.
And If ScanDir only consider recently added files, what is the process that rescan everything on the FST, I have not seen it in action…
The checktime field is in the leveldb database but it’s not printed by the eos-leveldb-inspect tool. I will fix that. The value stored in the leveldb is the same at the extended attribute attached to the raw file. For example:
The criteria to recheck a file uses the scaninterval parameter which usually is set to 7 days. So the scanner will re-scan a file if it has not been scanned in the last 7 days.
The time values are similar to what is stored in the usual timespec: timespec - cppreference.com
The ctime is the value in seconds since epoch and the ctime_ns contains the nanosecond resolution. So in practical terms only the first value is enough to get an idea when the file was touched.
By default the scanner runs every 4 hours (this is configurable by the scan_* parameters) and will re-scan any file that has not been scanned since the scaninterval value that I also mentioned in point 1.
However, it looks like that the minor modification (changed 1 byte) that I made to the test file goes unnoticed. The checksum has changed. It was 58a806f2 before, it is now 59d40701:
Sorry for the late reply, I was sick last week. Let’s take from the bottom up. The extended attribute user.XrdCks.adler32 is added by the xrdadler32 tool when it computes the checksum the first time so this is just an artifact that you can ignore.
It looks indeed strange that the file is scanned but the information in the local database is not updated. Can you also post the output of eos fileinfo for this file? Also it’s a bit strange that you don’t have all the extended attributes that I would expect. For example, for a replica 2 file I have the following:
The managers and the server are both running EOS 4.8.40.
I believe my way of testing was not good. By editing the file to change one char with vi, I actually create a new file and the extended attributes are lost. Only some of them are recreated at the next scan but it looks like it disturbs the process.
So I made a new test, ensuring that the inode of the file does not change (using cp) and followed the whole process. It worked perfectly as expected.
Now that I understand better those processes, I am going to look at the status of our production EOS.