FST opserror and more information about scrubbing algorithm

GeonmoRyu · August 23, 2024, 1:07am

Hello, everyone.

On the eos system we are running, we have noticed that about once or twice a month, one FS will get an opserror and automatically drain.

Most of the time when we get this error, we don’t find anything wrong when we look at the disk.

I was wondering if the way to detect opserror is to write “scrub.rewriteXX” files to the FST disk volume and detect it when it fails to write.

I was wondering how many errors it takes to detect it as an opserror and if we can change the setting if we think it is too sensitive.

Regards,

– Geonmo

esindril · August 23, 2024, 9:41am

Hi Geonmo,

You should really check the FST logs so see exactly what are the errors that are printed from the Scrub thread. We have found this mechanism to be very reliable in our services and we never had any false positive reports. It takes one EIO error to switch the file system in drain mode.

Cheers,
Elvin

CERN Accelerating science

FST opserror and more information about scrubbing algorithm