During the EOS workshop this year, an issue was mentioned involving XFS kernel crashes. I mentioned that we had been experiencing them quite frequently but that draining had resolved the issue. Shortly after the workshop we decided to cease the drain and flipped the FSIDs back to ‘rw’ mode. In the last couple of days we have begun experiencing the issue anew. I wonder if we could get some help finding the underlying file in question. While it’s clear what fsids (multiple) are at issue, it would be great to be able to figure out the files/dirs that’re triggering this. The crash dump has a daddrs but I’m not quite sure how to translate that into anything useful. Any help would be appreciated:
Apr 6 07:49:27 alicefst01.lbl.gov kernel: sd 10:0:18:0: [sdu] tag#1705 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
Apr 6 07:49:27 alicefst01.lbl.gov kernel: sd 10:0:18:0: [sdu] tag#1705 Sense Key : Medium Error [current] [descriptor]
Apr 6 07:49:27 alicefst01.lbl.gov kernel: sd 10:0:18:0: [sdu] tag#1705 Add. Sense: Unrecovered read error
Apr 6 07:49:27 alicefst01.lbl.gov kernel: sd 10:0:18:0: [sdu] tag#1705 CDB: Read(16) 88 00 00 00 00 04 80 31 df a8 00 00 00 08 00 00
Apr 6 07:49:27 alicefst01.lbl.gov kernel: blk_update_request: critical medium error, dev sdu, sector 19330621352 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Apr 6 07:49:27 alicefst01.lbl.gov kernel: XFS (sdu): metadata I/O error in "xfs_da_read_buf+0xd3/0x120 [xfs]" at daddr 0x48031dfa8 len 8 error 61
[…]