EOS corrupted files drain failed

Hi,
We have an EOS cluster (5.1.26-1) with a quark db based on Scientific Linux release 7.9 (Nitrogen). When we tried to drain one of the nodes (eos-f074), the status “failed” is returned:

eos-m01:~ # eos fs ls -d
┌────────────────────────┬────┬──────┬────────────────────────────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────────┐
│host                    │port│    id│                            path│       drain│    progress│       files│  bytes-left│   timeleft│      failed│
└────────────────────────┴────┴──────┴────────────────────────────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────────┘
 eos-f074.jinr.ru         1095    297                           /e/p01       failed          100            0    109.85 GB           0          417
 eos-f074.jinr.ru         1095    298                           /e/p02       failed          100            0    116.76 GB           0          446
 eos-f074.jinr.ru         1095    299                           /e/p03       failed          100            0    111.11 GB           0          390
 eos-f074.jinr.ru         1095    300                           /e/p04       failed          100            0    114.81 GB           0          453

We decided to get a list of damaged files, but we only have one on the list, although the command (eos fs ls -d) displays hundreds of them.

eos-m01:~ # for i in `seq -w 297 300` ; do  eos fs status $i -l | grep status | grep -v configstatus; done
status=atrisk  path=/eos/baikalgvd/reco/2019/data/recMay23/v5/cluster3/recomon_2019_cl3_run486_scl_nu_DATA.root

How we can recive list of all corrupted files?

And the second question. Can this bug fix somehow be related to our problem?

https://gitlab.cern.ch/dss/eos/-/commit/ecb4f6ab1d523fe00dead577a1fc3962b685a196

Hi Ivan,

Unless you are using RAIN files then the mentioned bug does not affect you in any way. And even if you use RAIN files, that happens only in very particular situations when copying the file takes more then 30 min while the system expects it to be faster than that.

You should be able to get the files left on the drained file systems by using the command:
eos fs dumpmd <fsid>.

Cheers,
Elvin

Thank you, Elvin!

Elvin,
We tend to think that we have lost files due to the bulk conversion of large files from replica:2 to qrain:12. During the conversion, we had a problem in the network when its bandwidth was insufficient for mass exchange between FSTs in different physical segments. The problem was not detected immediately and some of the files were corrupted. It seemed to be on version 5.1.19.
Now we do not know how to determine which files are corrupted. The only way is to read the files to the local disk and check the adler32? Or can we check with EOS?

Hi Ivan,

I assume you are now referring mainly to the RAIN files in your instance. In this case, there are some types of corruptions that are detected by fsck, but unfortunately fsck currently can not detect full checksum errors for RAIN files. We do plan to add such a functionality but this just doesn’t exist at the moment since the scanning processes responsible for detecting corruption is localized on the FST. For RAIN files we would need to read all the stripes in to verify the full checksum.

Therefore, indeed you need to read the files that you suspect are corrupted to the localdisk and compare the checksum with the value registered in the namespace. You can probably reduce the amount of files that you need to read in by looking only at the files that were written during that time interval.

For this you can use the report logs which are collected at the MGM node, in /var/eos/report which contain one log line per successful file close operation. Here you can further narrow down by grepping only for sec.app=eos/converter reports - so only files that were written by the converter. Using this maybe you get a more reasonable number of files to check.

Hope it helps,
Elvin

Elvin,
Thanks for your reply!