XrdZMQ::client looping since

For a while now but especially since our upgrade to 5.2.x, we’ve had a situation where an instance will just seemingly lose the mgm with constantly spewing messages like:

# XrdZMQ::client looping since 24480.00 seconds ...

# XrdZMQ::client looping since 24960.00 seconds ...

# XrdZMQ::client looping since 25160.00 seconds ...

# XrdZMQ::client looping since 24520.00 seconds ...

# XrdZMQ::client looping since 24480.00 seconds ...

the mgm process is still running but can’t really respond to admin commands (fs ls, etc). I can’t quite tell the state of the mq process. systemd says it’s fine, logs show sporadic messages like:
240911 12:47:13 3922 XrdProtocol: anon.0:29@li179-96.members.linode.com terminated matching protocol not found

xrootd@quarkdb on the mgm seems to have always crashed in this state and simply restarting the mgm shows a ton of FSIDs in bootfailure (with a mix of failed, draining and nodrain). Neither restarting the fst daemon nor issuing fs boot commands brings individual fsids back online. Only a reboot of the server in question does.

Oddly, only 2 out of our 3 FST servers seems to get impacted by this.

This is happening with some real regularity. We stay up maybe a handful of days before the issue reoccurs.

Hi John,

Could you please let me know what versions are you using for the following packages?
eos-server, eos-xrootd, xrootd-server, alicetokenacc.

If the xrootd@quarkdb crashes do you get any stacktraces or coredumps? Is there anything in the logs?

Thanks,
Elvin

[root@alicemgm0.lbl.gov ~]# rpm -qa | grep -E 'eos-server|eos-xrootd|xrootd-server|alicetokenacc'

xrootd-server-libs-5.7.0-2.el8.x86_64

alicetokenacc-2.3.0-1.el8.x86_64

eos-xrootd-5.6.10-1.el8.x86_64

eos-server-5.2.24-1.el8.x86_64

[root@alicemgm0.lbl.gov ~]#

digging through logs but nothing stands out immediately.

Hi John,

When you end up in this situation can you get a stracktrace of one of the authentication daemons? This would be one of the MGM processes with low memory consumption and higher PID. For example in one of our instances:

 ps aux | grep mgm
root      463563  0.0  0.0   6408  2252 pts/2    S+   11:56   0:00 grep --color=auto mgm
daemon    651079  178 61.1 488493936 241224660 ? SLsl Jul03 184875:46 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651249  2.5  2.5 11726340 10017988 ?   Sl   Jul03 2691:45 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651252  2.5  2.5 11726340 10018004 ?   Sl   Jul03 2693:16 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651255  2.5  2.5 11726340 10017664 ?   Sl   Jul03 2691:20 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651258  2.5  2.5 11726340 10018364 ?   Sl   Jul03 2692:16 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651261  2.5  2.5 11726340 10018340 ?   Sl   Jul03 2691:08 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651264  2.5  2.5 11726340 10017660 ?   Sl   Jul03 2692:26 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651267  2.5  2.5 11726340 10019140 ?   Sl   Jul03 2691:28 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon    651270  2.5  2.5 11726340 10019116 ?   Rl   Jul03 2691:10 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon

In this case, pick the last process and send me the output of the following command:
eu-stack -p <pid>

Thanks,
Elvin

[root@aliceafmgm0.lbl.gov ~]# ps aux | grep mgm
root 355621 0.0 0.0 229108 1192 pts/0 S+ 16:51 0:00 grep --color=auto mgm
daemon 2909082 2.4 1.9 14898264 2536080 ? SLsl Aug15 1600:05 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909227 0.0 0.0 1107348 29928 ? Sl Aug15 0:10 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909230 0.0 0.0 1107344 30120 ? Sl Aug15 0:11 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909233 0.0 0.0 1149244 72060 ? Sl Aug15 10:36 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909236 0.0 0.0 1107332 30168 ? Sl Aug15 0:11 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909239 0.0 0.0 1107320 30172 ? Sl Aug15 0:10 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909242 0.0 0.0 1181920 104760 ? Sl Aug15 18:47 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909245 0.0 0.0 1107336 30140 ? Sl Aug15 0:10 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
daemon 2909248 0.0 0.0 1107328 30120 ? Sl Aug15 0:11 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -Rdaemon
[root@aliceafmgm0.lbl.gov ~]# eu-stack -p 2909248
PID 2909248 - process
TID 2909248:
#0 0x00007fbf5dd6f251 __poll
#1 0x00007fbf532b860e zmq::signaler_t::wait(int) const
#2 0x00007fbf53294cc4 zmq::mailbox_t::recv(zmq::command_t*, int)
#3 0x00007fbf532ba1da zmq::socket_base_t::process_commands(int, bool)
#4 0x00007fbf532bb0ee zmq::socket_base_t::recv(zmq::msg_t*, int)
#5 0x00007fbf532d9f49 s_recvmsg(zmq::socket_base_t*, zmq_msg_t*, int)
#6 0x00007fbf1d11de98 XrdZMQ::RunServer()
#7 0x00007fbf1d11b3ee XrdAliceTokenAcc::Init()
#8 0x00007fbf1d11b6ec XrdAccAuthorizeObject
#9 0x00007fbf56650050 XrdMgmOfs::Configure(XrdSysError&)
#10 0x00007fbf566b7d3d XrdSfsGetFileSystem
#11 0x00007fbf566b7e06 XrdSfsGetFileSystem2
#12 0x00007fbf5f2dc779 XrdXrootdloadFileSystem(XrdSysError*, XrdSfsFileSystem*, char const*, char const*, XrdOucEnv*)
#13 0x00007fbf5f2d22aa XrdXrootdProtocol::ConfigFS(XrdOucEnv&, char const*)
#14 0x00007fbf5f2d65c4 XrdXrootdProtocol::Configure(char*, XrdProtocol_Config*)
#15 0x00007fbf5f2e578f XrdgetProtocol
#16 0x000000000040f4e8 XrdProtLoad::Load(char const*, char const*, char*, XrdProtocol_Config*, bool)
#17 0x000000000040b96e XrdConfig::Setup(char*, char*)
#18 0x000000000040d502 XrdConfig::Configure(int, char**)
#19 0x000000000040776c main
#20 0x00007fbf5dc85d85 __libc_start_main
#21 0x000000000040795e _start
TID 2909252:
#0 0x00007fbf5dd7a197 epoll_wait
#1 0x00007fbf5328bb1c zmq::epoll_t::loop() [clone .part.11]
#2 0x00007fbf532ccbef thread_routine
#3 0x00007fbf5e0181ca start_thread
#4 0x00007fbf5dc84e73 __clone
TID 2909253:
#0 0x00007fbf5dd7a197 epoll_wait
#1 0x00007fbf5328bb1c zmq::epoll_t::loop() [clone .part.11]
#2 0x00007fbf532ccbef thread_routine
#3 0x00007fbf5e0181ca start_thread
#4 0x00007fbf5dc84e73 __clone
[root@aliceafmgm0.lbl.gov ~]#

Hi John,

Unfortunately, this does not indicate anything wrong. Did you get the same symptom in the MGM logs? If it happens again then please take also an eu-stack of the MGM process.

Thanks,
Elvin

https://flexo-external.lbl.gov/856679.log

Hi John,

This already is better, it seems all the 16 worker threads on the MGM are busy waiting for messages from the XRootD authentication daemons and blocked in the following call:

TID 856883:
#0  0x00007f15673ccac1 __poll
#1  0x00007f155f6f060e zmq::signaler_t::wait(int) const
#2  0x00007f155f6cccc4 zmq::mailbox_t::recv(zmq::command_t*, int)
#3  0x00007f155f6f21da zmq::socket_base_t::process_commands(int, bool)
#4  0x00007f155f6f30ee zmq::socket_base_t::recv(zmq::msg_t*, int)
#5  0x00007f155f711f49 s_recvmsg(zmq::socket_base_t*, zmq_msg_t*, int)
#6  0x00007f1563e5fb13 eos::mgm::ZMQ::Worker::work()
#7  0x00007f1567eecb23 execute_native_thread_routine
#8  0x00007f15676781ca start_thread
#9  0x00007f15672d38d3 __clone

What is the trace from the authentication daemons? Is it identical to the trace from the 1st of Oct?

At this point, it’s clear the problem comes either from a communication issue between the MGM and the authentication daemons or something blocks inside the authentication daemons making them not reply to MGM requests.

Thanks,
Elvin

Sorry, some ignorance here, what processes are the auth processes in an ALICE instance?

Hi John,

The authorization processes are the xrootd daemons with higher pids. In your setup, you should have 9 xrootd processes running with the first one being the real MGM (you should see higher CPU consumption for this one) and another 8 processes which similar (lower) load and resource consumption - these are the authentication daemons.

Cheers,
Elvin

https://flexo-external.lbl.gov/logs.tar.gz

Hi John,

Thanks for the stacktraces, I had a look at them and the situation is quite strange. The MGM is stuck on sending requests to the AUTH daemons, with traces like the following:

TID 1206719:
#0  0x00007f15676821b0 __nanosleep
#1  0x00007f15001dd935 XrdZMQ::Send(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)
#2  0x00007f15001da8c8 XrdAliceTokenAcc::Access(XrdSecEntity const*, char const*, Access_Operation, XrdOucEnv*)
#3  0x00007f1563cbd342 XrdMgmOfsFile::open(eos::common::VirtualIdentity*, char const*, int, unsigned int, XrdSecEntity const*, char const*)
#4  0x00007f1563cd0de8 XrdMgmOfsFile::open(char const*, int, unsigned int, XrdSecEntity const*, char const*)
#5  0x00007f1568959088 XrdXrootdProtocol::do_Open()
#6  0x00007f1568949c1c XrdXrootdProtocol::Process2()
#7  0x00007f156868f08d XrdLinkXeq::DoIt()
#8  0x00007f1568692387 XrdScheduler::Run()
#9  0x00007f15686924a9 XrdStartWorking(void*)
#10 0x00007f1568620fc7 XrdSysThread_Xeq
#11 0x00007f15676781ca start_thread
#12 0x00007f15672d38d3 __clone

While the AUTH daemons are waiting for requests from the MGM, with the worker thread in this state:

TID 856813:
#0  0x00007f15673ccac1 __poll
#1  0x00007f155f6f060e zmq::signaler_t::wait(int) const
#2  0x00007f155f6cccc4 zmq::mailbox_t::recv(zmq::command_t*, int)
#3  0x00007f155f6f21da zmq::socket_base_t::process_commands(int, bool)
#4  0x00007f155f6f30ee zmq::socket_base_t::recv(zmq::msg_t*, int)
#5  0x00007f155f711f49 s_recvmsg(zmq::socket_base_t*, zmq_msg_t*, int)
#6  0x00007f15001dde78 XrdZMQ::RunServer()
#7  0x00007f15001db3ce XrdAliceTokenAcc::Init()
#8  0x00007f15001db6cc XrdAccAuthorizeObject
#9  0x00007f1563cad6f7 XrdMgmOfs::Configure(XrdSysError&)
#10 0x00007f1563d0f54d XrdSfsGetFileSystem
#11 0x00007f1563d0f616 XrdSfsGetFileSystem2
#12 0x00007f156893fd39 XrdXrootdloadFileSystem(XrdSysError*, XrdSfsFileSystem*, char const*, char const*, XrdOucEnv*)
#13 0x00007f156893582a XrdXrootdProtocol::ConfigFS(XrdOucEnv&, char const*)
#14 0x00007f1568939be4 XrdXrootdProtocol::Configure(char*, XrdProtocol_Config*)
#15 0x00007f156894901f XrdgetProtocol
#16 0x000000000040dcc8 XrdProtLoad::Load(char const*, char const*, char*, XrdProtocol_Config*, bool)
#17 0x000000000040a02e XrdConfig::Setup(char*, char*)
#18 0x000000000040bce6 XrdConfig::Configure(int, char**)
#19 0x0000000000405cb8 main
#20 0x00007f15672d47e5 __libc_start_main
#21 0x0000000000405ece _start

There is nothing wrong as far as I can see in the EOS layer so my only assumption is that there is some issue with the communication between the MGM and AUTH daemon coming either from ZMQ or from the IPC communication on this Rocky 8 OS.

Either way, I am not sure we can do anything about this from the EOS side, therefore my only recommendation would be to try out a similar setup like ours where we have not seen this issue. Namely, move only the MGM machine to ALMA9 and run in this setup.

Cheers,
Elvin

I’m having some trouble writing a response to this without coming off like a jerk. Moving away from our relatively newly installed rocky8 is not a feasible action for us. In my mind, EOS includes all of the dependency packages including the ZMQ stuff. We’re trying to operate this T2 for ALICE and we took all the OS requirements to heart as we decided our path forward from centos 7 a year+ ago. To me, this sounds like the EOS team saying el8 is not supported (but not actually documenting that or ceasing to release packages for el8).

I’m open to suggestions here but I guess I’m trying to understand what party to talk to about solving this issue in the el8 context.