FST: scaninterval not honored any more?

Hi,

After the general upgrade we did yesterday on our instance to 4.5.x, we observe a very high read rate on all disks. It seems to be due to the disk scan which scans every file on the disks. As they are quite full, yet no scan has been completed since yesterday.

Indeed, also on the much smaller test instance, we can see that each run of ScanDir scans every file (skippedfiles=0). So it doesn’t seem like a “first run” checks everything, then it will go on checking with the scaninterval.

Is it something known ? Is there a way to avoid this huge scanning ?

Hi Franck,

Can you make sure you don’t have the force scan option enabled on your disks?
This is enabled by the presence of a file on the mountpoint of the FSTs called .eosscan. If you have this one you need to remove it and then the force scan should be disabled.

What was the previous version of eos running on the FSTs? Now they are all running 4.5.15, right?

Cheers,
Elvin

In case this is not the forced scan … which setting do you have for scaninterval on your filesystems?

Thank you for your answers.

No, there is not /data*/.eosscan file on any FST.

Previous version was 4.4.23 for our FSTs.

The scaninterval setting is 7776000 (90 days) for some disks, 1209600 (14 days) for others.

I think there is a bug in computing the rescan timestamp for the files. I will push a fix and you can test it out on one of your FSTs. For the moment if this high read rate is a problem you can disable scanning by setting the scaninterval to 0.

I confirm this was broken since 4.4.45. I will tag now 4.5.16.
Thanks for reporting this!

1 Like

Thank you. We saw the new tag, we are now upgrading the FSTs.

I’m sorry, but it seems from the first observation that the issue is not solved with this new version :slightly_frowning_face:

We still observe “skippedfiles=0” on test instance which already have finished some scans on its very small disks, and on production the throughput has gone high also on some disks of patched FSTs.

Maybe the fix isn’t complete ?

Did you see two consecutive scans on the same disk with 0 skipped files?

Yes, but the first one still with previous version, it didn’t run long enough to launch 2 consecutive scans with the new version. We can wait.

The eos.timestamp attributes are correctly set, but they are in microseconds. Maybe we should also use microseconds for scaninterval ?

It’s normal for the first scan with 4.5.16 to scan all the files since the timestamps were “wrongly” set by the old version but you shouldn’t see another full scan at the next round of the scanner (in like 4 hours).

The scaninterval should remain set in seconds and the timestamp on the files should be in microseconds. 4.5.16 should account properly for all these. This problem does not exist on the 4.6.y branch since the code was completely changed.

Cheers,
Elvin

OK, sorry, we will wait for the second scan. On prouction instance this will take long time, so we deactivated the scan as you suggested to avoid having all disks scanning at the same time.

Unfortunately, the consecutives skippedfiles=0 for the same disk still appear.

191129 17:06:17 time=1575043577.524311 func=Run                      level=NOTE  logid=ab6104fe-12ab-11ea-9f83-9a9ab1a9ef93 unit=fst@s-jrciprcids93v.cidsn.jrc.it:1095 tid=00007f39257fd700 source=ScanDir:211                    tident=<service> sec=      uid=0 gid=0 name= geo="" [ScanDir] Directory: /data03 files=8806 scanduration=100 [s] scansize=4342352506 [Bytes] [ 4342.35 MB ] scannedfiles=8773 corruptedfiles=0 hwcorrupted=0 skippedfiles=0
191129 21:07:58 time=1575058078.333186 func=Run                      level=NOTE  logid=ab6104fe-12ab-11ea-9f83-9a9ab1a9ef93 unit=fst@s-jrciprcids93v.cidsn.jrc.it:1095 tid=00007f39257fd700 source=ScanDir:211                    tident=<service> sec=      uid=0 gid=0 name= geo="" [ScanDir] Directory: /data03 files=8806 scanduration=100 [s] scansize=4342352506 [Bytes] [ 4342.35 MB ] scannedfiles=8773 corruptedfiles=0 hwcorrupted=0 skippedfiles=0
191130 01:09:38 time=1575072578.754925 func=Run                      level=NOTE  logid=ab6104fe-12ab-11ea-9f83-9a9ab1a9ef93 unit=fst@s-jrciprcids93v.cidsn.jrc.it:1095 tid=00007f39257fd700 source=ScanDir:211                    tident=<service> sec=      uid=0 gid=0 name= geo="" [ScanDir] Directory: /data03 files=8806 scanduration=100 [s] scansize=4342352506 [Bytes] [ 4342.35 MB ] scannedfiles=8773 corruptedfiles=0 hwcorrupted=0 skippedfiles=0
191130 05:11:19 time=1575087079.908896 func=Run                      level=NOTE  logid=ab6104fe-12ab-11ea-9f83-9a9ab1a9ef93 unit=fst@s-jrciprcids93v.cidsn.jrc.it:1095 tid=00007f39257fd700 source=ScanDir:211                    tident=<service> sec=      uid=0 gid=0 name= geo="" [ScanDir] Directory: /data03 files=8806 scanduration=101 [s] scansize=4342352506 [Bytes] [ 4342.35 MB ] scannedfiles=8773 corruptedfiles=0 hwcorrupted=0 skippedfiles=0
191130 09:13:00 time=1575101580.753145 func=Run                      level=NOTE  logid=ab6104fe-12ab-11ea-9f83-9a9ab1a9ef93 unit=fst@s-jrciprcids93v.cidsn.jrc.it:1095 tid=00007f39257fd700 source=ScanDir:211                    tident=<service> sec=      uid=0 gid=0 name= geo="" [ScanDir] Directory: /data03 files=8806 scanduration=100 [s] scansize=4342352506 [Bytes] [ 4342.35 MB ] scannedfiles=8773 corruptedfiles=0 hwcorrupted=0 skippedfiles=0

In addition, it seems that setting scaninterval=0 doesn’t disabled the scan, unless the FST is started after the value has been set.

I saw that the modification is indeed on the writing the last check timestamp, which was probably wrong. As of now, the timestamp of the files seem to correctly correspond to the time of the last check.

From what I observe on the behavior of the FSTs, and after looking quickly at the code, I would suggest that there might remain a problem with the decision whether a file needs to be re-scanned or not, in the DoRescan function which seems to return always true, although I could not understand at which level this could be.

Do you also observe these full rescan of all disks on your instances, or is it specific to us ?

Hi Franck,

There was indeed an outstanding issue with the scanning. This is now fixed in 4.5.17. I tested it and it works as expected. This full re-scan behavior is not specific to your instance, we also see it now that we look for it …
Please update to this version, it should be available in the repos in 30 min.

Cheers,
Elvin

Thank you Elvin, we upgraded our FSTs to this version, now we observe non 0 skippedfiles values.

After several days, we can confirm that version 4.5.17 does solve the issue.

There is this downside that we discovered during this episode : when setting scaninterval to 0 while the FST is running, this doesn’t disable the scan, but scans all files at every run. The way to disable the scan is to set the scaninterval to 0 before (re)starting the corresponding node.

In addition, the scaninterval value needs to be set to every single disk with eos fs config, we can’t set it at a node level with eos node config.