Automatic replica deletion and recovery in case of disk badblock on non-raid controllers

GeonmoRyu · June 4, 2024, 9:10am

Hello, EOS community people!

On our site, we’re running a 16-striped qrain system using EOS.

These disks are directly connected to a SAS controller without a raid controller.

The disk arrays are ME484 disk enclosures from DELL.

When I looked at the OS logs of the disk servers, I noticed that certain disks were not fully handling bad blocks when they occurred.

I don’t know if it’s the scandisk feature of EOS or another module that checks the disks, but at certain times, it would hit an unreadable block and generate the error below.

Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 Sense Key : Medium Error [current]
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 Add. Sense: Unrecovered read error
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 CDB: Read(32)
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 02
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 CDB[10]: 2e 36 40 f8 2e 36 40 f8 00 00 00 00 00 00 00 08
Jun 4 15:20:12 jbod-mgmt-09 kernel: blk_update_request: critical medium error, dev sdgg, sector 9365242104

We asked our DELL vendor about this, and they told us that in the event of a disk bad block, the raid controller reads the data blocks and parity blocks from the raid controller stores the data in the good blocks, and deletes the old data, and that if you don’t have a raid controller, your software should support that functionality.

Presumably, the policy is to use the spare blocks on the disk itself, and this would require erasing the data on the bad block and creating a new replica from another replica.

However, EOS seems to keep trying to read the block without creating a new replica even if the read fails due to a bad block.

I’m wondering if there is an option to create a new replica of the block when the read fails multiple times because I didn’t know the option and didn’t activate it.

– Geonmo

GeonmoRyu · February 4, 2025, 8:40am

Hello, everyone.

I’m asking again because I think my previous question was not clear enough and it didn’t get answered.

I was wondering if CERN EOS has an automatic procedure to delete the corrupted stripe and recover data from other stripes when a read operation fails.

At the disk firmware level, when a bad block occurs, it is supposed to mark the block and use another one, so it shouldn’t be a problem for write operations, but if a bad block occurs when reading previously written data, it seems to simply result in a read error.

I’m wondering what configuration in EOS needs to happen for this to happen automatically.

I assume that the fsck thread in EOS does this, but am I correct that if it is determined that a stripe is corrupted, a record is made of that stripe and the fsck thread is triggered to handle it?

I’m asking because this is done at a very low level in the hardware-raid controller, so I’m not sure if it’s disabled by default.

Regards,
– Geonmo

esindril · February 6, 2025, 8:36am

Hi Geonmo,

In EOS for RAIN file layouts, every stripe file containing data also has a corresponding block-checksum file which contains the checksum of the blocks of data contained in the stripe file. Therefore, any corruption of the data will be detected during read and this piece of data will be recovered on the fly from the other stripes. Every read operation of raw data is check-summed and verified against the block-checksum information which is stored during the initial write operation.

Therefore, it can not happen that we return broken data to the client if there is a corruption when reading from disk. Worst case scenario, there are so many corruptions that we can not reliably recover the data and in this case we return an error to the client.

Concerning the recovery of such broken stripe files, this is done by the FSCK engine, you are right. To have this properly working you need to have both fsck collection and repair enabled. EOS does not care or know anything about the underlying HW and it assumes these are just JBODs, so the RAID controller behavior is irrelevant as far as EOS is concerned.

Cheers,
Elvin

GeonmoRyu · February 7, 2025, 1:15am

Hello, Elvin.

Thanks for the great answer, is there any way to have the fsck repair thread automatically activated? I guess I could change the preferences, but I’m having a hard time finding it.

esindril · February 7, 2025, 7:18am

Hi Geonmo,

You can check the status of fsck by doing eos fsck stat and then you can use the toggle-collect and toggle-repair options.

Cheers,
Elvin

CERN Accelerating science

Automatic replica deletion and recovery in case of disk badblock on non-raid controllers