Data consistency problem

As you are well aware, KISTI operates a Tapeless CDS system.

It is configured with QRAIN on 16Stripes.

Recently, issues were identified in some data during the copying process.

A total of approximately 1,700 errors were detected. Upon examining some files, the following problem was found:

The data transfer return code completed successfully (0), but only 9Stripes exist, making the data unreadable.

First, is there any way to run a consistency check on the entire storage?

And is there a way to recover data where only 9Stripes exist?

eos file info /eos/gsdc/grid/02/12354/28b49c5b-dc7c-11ec-85c6-3cecef04a87a
  File: '/eos/gsdc/grid/02/12354/28b49c5b-dc7c-11ec-85c6-3cecef04a87a'  Flags: 0600
  Size: 134170616
Status: locations::incomplete
Modify: Mon Jun 13 08:34:30 2022 Timestamp: 1655109270.656796000
Change: Mon Jun 13 08:33:28 2022 Timestamp: 1655109208.218447406
Access: Thu Jan  1 00:00:00 1970 Timestamp: 0.000000000
 Birth: Mon Jun 13 08:33:28 2022 Timestamp: 1655109208.218447406
  CUid: 10367 CGid: 1395 Fxid: 013b0280 Fid: 20644480 Pid: 21926 Pxid: 000055a6
XStype: adler    XS: 95 b0 c2 be    ETAGs: "5541710402682880:95b0c2be"
Layout: qrain Stripes: 16 Blocksize: 1M LayoutId: 40640f52 Redundancy: d0::t0
  #Rep: 9
┌───┬──────┬────────────────────────┬────────────────┬─────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                 path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴─────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0       53   jbod-mgmt-01.sdfarm.kr       default.52 /jbod/box_01_disk_052     booted             rw      nodrain   online         kisti::gsdc::g01
 1     1313   jbod-mgmt-08.sdfarm.kr       default.52 /jbod/box_16_disk_052     booted             rw      nodrain   online         kisti::gsdc::g03
 2     1145   jbod-mgmt-07.sdfarm.kr       default.52 /jbod/box_14_disk_052     booted             rw      nodrain   online         kisti::gsdc::g03
 3      137   jbod-mgmt-01.sdfarm.kr       default.52 /jbod/box_02_disk_052     booted             rw      nodrain   online         kisti::gsdc::g01
 4     1061   jbod-mgmt-07.sdfarm.kr       default.52 /jbod/box_13_disk_052     booted             rw      nodrain   online         kisti::gsdc::g03
 5      893   jbod-mgmt-06.sdfarm.kr       default.52 /jbod/box_11_disk_052     booted             rw      nodrain   online         kisti::gsdc::g02
 6    23052   jbod-mgmt-12.sdfarm.kr       default.52 /jbod/box_23_disk_052     booted             rw      nodrain   online         kisti::gsdc::e01
 7    22052   jbod-mgmt-11.sdfarm.kr       default.52 /jbod/box_22_disk_052     booted             rw      nodrain   online         kisti::gsdc::e01
 8    21052   jbod-mgmt-11.sdfarm.kr       default.52 /jbod/box_21_disk_052     booted             rw      nodrain   online         kisti::gsdc::e01

Hi Bunmgyun,

Yes, these types of problem can be detected by the file system consistency checks in EOS (fsck) that works in coordination with the scanning of errors. The scanning and fixing of RAIN file layouts has been modified recently to cover more scenarios so I would recommend that you run one of the latest versions to benefit from all the improvements. This means EOS 5.3.25 or 5.3.26 - these are the versions that we are currently upgrading to at CERN.

You can find more details about fsck configuration at this location:

It’s important to make sure that the scan_rain_interval parameter is present in your configuration so that the scanner thread on the FST actually verifies the RAIN files. Once RAIN errors are detected they are summarized as stripe_err inconsistencies in the fsck statu report.

If issues are caught early on (there are still enough correct stripes to recover the file) then you can use the fsck repair functionality to repair such files. You can actually enable the fsck repair only of a particular category of errors and then you can easily monitor the progress.

Let me know if you need any additional information on this topic.

Cheers,

Elvin