Hi John,
Thanks for the stacktraces, I had a look at them and the situation is quite strange. The MGM is stuck on sending requests to the AUTH daemons, with traces like the following:
TID 1206719:
#0 0x00007f15676821b0 __nanosleep
#1 0x00007f15001dd935 XrdZMQ::Send(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)
#2 0x00007f15001da8c8 XrdAliceTokenAcc::Access(XrdSecEntity const*, char const*, Access_Operation, XrdOucEnv*)
#3 0x00007f1563cbd342 XrdMgmOfsFile::open(eos::common::VirtualIdentity*, char const*, int, unsigned int, XrdSecEntity const*, char const*)
#4 0x00007f1563cd0de8 XrdMgmOfsFile::open(char const*, int, unsigned int, XrdSecEntity const*, char const*)
#5 0x00007f1568959088 XrdXrootdProtocol::do_Open()
#6 0x00007f1568949c1c XrdXrootdProtocol::Process2()
#7 0x00007f156868f08d XrdLinkXeq::DoIt()
#8 0x00007f1568692387 XrdScheduler::Run()
#9 0x00007f15686924a9 XrdStartWorking(void*)
#10 0x00007f1568620fc7 XrdSysThread_Xeq
#11 0x00007f15676781ca start_thread
#12 0x00007f15672d38d3 __clone
While the AUTH daemons are waiting for requests from the MGM, with the worker thread in this state:
TID 856813:
#0 0x00007f15673ccac1 __poll
#1 0x00007f155f6f060e zmq::signaler_t::wait(int) const
#2 0x00007f155f6cccc4 zmq::mailbox_t::recv(zmq::command_t*, int)
#3 0x00007f155f6f21da zmq::socket_base_t::process_commands(int, bool)
#4 0x00007f155f6f30ee zmq::socket_base_t::recv(zmq::msg_t*, int)
#5 0x00007f155f711f49 s_recvmsg(zmq::socket_base_t*, zmq_msg_t*, int)
#6 0x00007f15001dde78 XrdZMQ::RunServer()
#7 0x00007f15001db3ce XrdAliceTokenAcc::Init()
#8 0x00007f15001db6cc XrdAccAuthorizeObject
#9 0x00007f1563cad6f7 XrdMgmOfs::Configure(XrdSysError&)
#10 0x00007f1563d0f54d XrdSfsGetFileSystem
#11 0x00007f1563d0f616 XrdSfsGetFileSystem2
#12 0x00007f156893fd39 XrdXrootdloadFileSystem(XrdSysError*, XrdSfsFileSystem*, char const*, char const*, XrdOucEnv*)
#13 0x00007f156893582a XrdXrootdProtocol::ConfigFS(XrdOucEnv&, char const*)
#14 0x00007f1568939be4 XrdXrootdProtocol::Configure(char*, XrdProtocol_Config*)
#15 0x00007f156894901f XrdgetProtocol
#16 0x000000000040dcc8 XrdProtLoad::Load(char const*, char const*, char*, XrdProtocol_Config*, bool)
#17 0x000000000040a02e XrdConfig::Setup(char*, char*)
#18 0x000000000040bce6 XrdConfig::Configure(int, char**)
#19 0x0000000000405cb8 main
#20 0x00007f15672d47e5 __libc_start_main
#21 0x0000000000405ece _start
There is nothing wrong as far as I can see in the EOS layer so my only assumption is that there is some issue with the communication between the MGM and AUTH daemon coming either from ZMQ or from the IPC communication on this Rocky 8 OS.
Either way, I am not sure we can do anything about this from the EOS side, therefore my only recommendation would be to try out a similar setup like ours where we have not seen this issue. Namely, move only the MGM machine to ALMA9 and run in this setup.
Cheers,
Elvin