How to correctly finalize a stalling drain?

Good morning,

We had the quite rare problem of a disk dying, and drain process went well, automatically.
Except for very few files (14 out of 600K).

We found out that most of these files are replica 2 files that can’t be duplicate because the good replica has an outdated .xsmap file (which still seem to be created at balancing even with citrine version, maybe I can open another thread fo it). Removing the xsmap resolves the issue, and drain can continue for this.

root# eos fs ls -d
jeos222 1095 375 /data07 draining 99 6 4.28 TiB 99999999999 0 0

In addition, there is one RAIN file which has one replica on the failing disk, but it seems that the file isn’t repaired. What could be the best way to repair it, or at least maybe drain forget about it ?

And 3 bad files where only located on this disk. Here again, how can we tell the system to ignore these files ?

After this, will the FS be automatically changed to empty state, so we can unregister it, change the disk, and register it new ?

Hi Franck,

For the RAIN file: Have you tried to manually repair the file? (eos file adjustreplica <path>) It should recreate the failed stripe in another disk (and delete the bad one). If not, you can also try to drop the stripe manually eos file drop <path> <fsid_of_the_bad_stripe> and then run the eos file adjustreplica <path> to restore the file to the nominal number of stripes.

For the 3 bad files I think that you can just remove them from EOS namespace (the files will be lost) doing eos rm <path>.

And yes, finally the file system should be empty and you can proceed with the replacement.

I hope it helps,

Roberto

Hi Roberto,

Thanks for your help.
The adjustreplica command did the trick (in fact, it triggered a conversion job which successfully recreated a brand new file id on a new scheduling group out of the incomplete file), no need to drop replica.

Yes, about the other bad files, we will remove them.

Thanks again.

Franck