Draining impact availability plus cannot stop draining

Hello,
Yesterday afternoon I decided to drain one filesystem on a node that has 4.
The drain rate is rather slow as observed by Pete in : Eos drain performance.
The Alice Monalisa tests started to fail at the same time, mostly with timouts. The drain operation should not have such an impact on the performance of the storage. This is unexplained.
To confirm, I decided to stop the drain process by issuing the command:

eos fs config 67 configstatus=rw

… but this command does not return…
=> What can be wrong with our storage (BTW, this is the production storage for Alice)

Thanks

JM

Hi JM,

I’ll share below some info on the steps we took to configure gdb for eos 4.5.20 while troubleshooting an issue a while back with Andreas. In our case, our version of eos required devtoolset-8, though with SL6 and your version devtoolset-7 may work – I’m not sure.

The steps below almost certainly need tweaked, I’ve summarized them after the fact and there may be some inaccuracies. Perhaps still of some use and may help gdb to illuminate what is going on with your issue.

It might be helpful to the community if a “debugging how-to” could be made as wiki post or sticky thread, to help site admins quickly configure a proper debug env for collecting information to more readily resolve issues. I think some of the nuances of installing devtoolset (which versions for what), paths to use, and processes to connect to are not particularly obvious.

Cheers,
Pete

Setup of GDB environment and eos/xrootd debug example

  • Information here extracted from emails, eos community forum and CERN service now ticket.
  • Debugging newer eos/xrootd require devtoolset-8
  • The below is a rough guide only and needs further tweaking

Installing devtoolset-8 (On cent7)

Refs

yum install devtoolset-8
scl --help
scl enable devtoolset-8 -- bash

Install debug symbols

yum install eos-xrootd-debuginfo-4.10.1-1.el7.cern.x86_64 # debuginfo-install does not find, but can yum install specific version
yum install yum-utils
debuginfo-install glibc

Gdb of xrootd example

  • Ref CERN service now ticket
  • EOS brings its own xrootd version on EL7, eg:
    • eos-xrootd-4.10.1-1.el7.cern.x86_64
    • Runs /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -s /tmp/xrootd.fst.pid -Rdaemon
    • So you have to use the xrootd from /opt/eos/xrootd/bin/xrootd in the GDB statement
  • Per Andreas and Elvin eos 4.5.20 has to be run with xrootd-4.10.1
source /opt/rh/devtoolset-8/enable
gdb --version #version 8.x
  * Or specify full gdb path /opt/rh/devtoolset-8/root/usr/bin/gdb
/opt/rh/devtoolset-8/root/usr/bin/gdb /opt/eos/xrootd/bin/xrootd 28224 <<< "thread apply all bt" >> /root/mgm.gdb.out

Hello,
I have news on this issue.
It looks like this was triggered by not having added the line “mgmofs.centraldrain true” in /etc/xrd.cf.mgm.
(as stated in : Config in quarkDB for master/slave(s) ).
Now draining is progressing and not (apparently) impacting other activities.
Sorry, this topic (draining vs central draining) was not clear to me.
Thank you Pete for the suggestion about debugging, to me this was last ressort (and I am not much used to do this).
What remains unexplained is how this small configuration mistake could have had the big impact I observed.
JM

Hi JM,

The central drain has become the only mechanism since 4.8.0, if I am not mistaken, therefore this option is not necessary anymore if you use a recent enough version.

The old draining puts a lot of pressure on the namespace and also on the FSTs and that is why you experience this strange behavior. The new QDB namespace only works properly with the new central draining.

Cheers,
Elvin