Cancel drain status?

Good morning,

Is it possible to cancel a drain operation on file systems ?
For some yet unknown reason*, all FS on one FST went to drain state because of “filesystem seems to be not mounted anymore” or “filesystem probe error detected”, but from OS point of view it seems that the disks are accessible, so probably there was a false positive at some point, and it would be nice to cancel all these useless draining…

Thank you

  • I think I found the cause of the error detection : too many open files from the xrootd process. So by restarting the eos daemon, it should all settle, but how to cancel drain ?

Hi Frank,
you can change the configstatus of the fs to ‘rw’ and the drain process for that fs should stop. But if your fs continue to report the boot status as opserror, the drain process will start again

Hi Andrea,

Thank you for your answer. I restarted the fst daemon (after having upgraded it) so that the open files are reset, and change configstatus of the fs to off - then rw.
But now they are booting forever : it says “finished boot procedure” on fst log, but booting state on MGM. and they regularly “start boot procedure”. Restarted eos didn’t solve it. What could I do ?

It happens that it is the daemon that regularly restarts… what could be the cause ?

Edit: The daemon was creating segfault and systemd restarted it. I downgraded everything back to previous version (4.2.20, xrootd 4.8.1) and it now seems to work, and drain state has been reset to nodrain, so we should be ok.

Thanks again

well is not normal that it was segfaulting:-)
which version of EOS and Xrootd you upgraded to?

The more recent ones, eos 4.2.28 and xrootd 4.8.4

Regarding again the topic of cancelling a drain, is it necessary to fully reboot the fs ? Could there be some files that have been drained elsewhere, but still present on the disk ?

when the drain process is stopped all the pending file drains are completed ( this means transferring and deleting the source), so there should be no leftovers.
Regarding the segfaults can you please open a ticket attaching also the stacktrace and the core file if you have it?

When running fs status -r on the previously faulty fs, I see many pending deletions than do not decrease after couple of hours (files were not deleted during the drain process since the fs were seen as “opserror” at that time). Is there a way to force these deletions ? That would

Unfortunately, eos didn’t produce any stacktrace, the only trace are segfault lines in /var/log/messages, do you think it would be of any help ?

i don’t think it’s possible to force the deletions, are you still seeing those ones pending?

Yes, I can still see pending deletions on FS that were drained (between 100K and 200K by FS), seemingly corresponding to the drained files.
In fact, I can also see non null pending deletion numbers on other FS, but much smaller (few dozens)

the code performing the deletion asynchronously seems ok to me, it just loops on the FSs and check what are the files to delete and perform the deletion. Let me check here at CERN if we have the same problem and noone noticed so far…

Yes, this seems strange. Could there be some condition in which deletion don’t occur ? Or maybe a maximum deletion rate ?
I can see that some fs have today a larger number of pending deletions, but another one has three times less (60K instead o 170K)
So based on these numbers, it seems that deletions are, in fact, occurring, but slowly (60Hz average based on ns stat output, is that a reasonable rate ?)

i’m checking the code and I see that each FST request files to remove to the MGM every 5 minutes, so if the files to drop are a lot it may take a while. I’m sorry but i don’t have so much knowledge on this part and both Andreas and Elvin are on holidays this week.

Hi Andrea, thank you that you have checked. It really seems that it does proceed to deletions, and today probably all left over files were removed, so no need to check further.
Probably the number of files left by cancel drained was too much for standard drop rate, so it just took time.