Automatic replica deletion and recovery in case of disk badblock on non-raid controllers

Hello, EOS community people!

On our site, we’re running a 16-striped qrain system using EOS.

These disks are directly connected to a SAS controller without a raid controller.

The disk arrays are ME484 disk enclosures from DELL.

When I looked at the OS logs of the disk servers, I noticed that certain disks were not fully handling bad blocks when they occurred.

I don’t know if it’s the scandisk feature of EOS or another module that checks the disks, but at certain times, it would hit an unreadable block and generate the error below.


Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 Sense Key : Medium Error [current]
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 Add. Sense: Unrecovered read error
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 CDB: Read(32)
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 CDB[00]: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 02
Jun 4 15:20:12 jbod-mgmt-09 kernel: sd 12:0:294:0: [sdgg] tag#0 CDB[10]: 2e 36 40 f8 2e 36 40 f8 00 00 00 00 00 00 00 08
Jun 4 15:20:12 jbod-mgmt-09 kernel: blk_update_request: critical medium error, dev sdgg, sector 9365242104

We asked our DELL vendor about this, and they told us that in the event of a disk bad block, the raid controller reads the data blocks and parity blocks from the raid controller stores the data in the good blocks, and deletes the old data, and that if you don’t have a raid controller, your software should support that functionality.

Presumably, the policy is to use the spare blocks on the disk itself, and this would require erasing the data on the bad block and creating a new replica from another replica.

However, EOS seems to keep trying to read the block without creating a new replica even if the read fails due to a bad block.

I’m wondering if there is an option to create a new replica of the block when the read fails multiple times because I didn’t know the option and didn’t activate it.

– Geonmo