Hello,
I have performed two master/slave transitions recently to allow updating EOS to v4.5.6 and each time I saw the 3rd operation hang and the manager crash. The operations were not affected apparently but I prefer to report it. Here is what happens:
a) first stage
[root@naneosmgr02(EOSMASTER) ~]#eos -b ns | grep Replic ALL Replication mode=master-rw state=master-rw master=naneosmgr02.in2p3.fr configdir=/var/eos/config/naneosmgr02.in2p3.fr/ config=default mgm:naneosmgr01.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr01.in2p3.fr:1097=ok [root@naneosmgr02(EOSMASTER) ~]# eos -b ns master naneosmgr01.in2p3.fr success: current master will step down [root@naneosmgr02(EOSMASTER) ~]#eos -b ns | grep Replic ALL Replication mode=slave-ro state=master-ro master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr01.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr01.in2p3.fr:1097=ok
b) second stage:
[root@naneosmgr01(EOSSLAVE) ~]#eos -b ns master naneosmgr01.in2p3.fr success: current master will step down [root@naneosmgr01(EOSSLAVE) ~]#eos -b ns | grep Replic ALL Replication mode=master-rw state=master-rw master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr02.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr02.in2p3.fr:1097=ok
c) third stage:
[root@naneosmgr02(EOSMASTER) ~]# eos -b ns master naneosmgr01.in2p3.fr
=> At this point, the command does not return and there is evidence of a crash in the log
more /var/log/eos/mgm/xrdlog
[...]
#########################################################################
# -----------------------------------------------------------------------
# Responsible thread =>
# -----------------------------------------------------------------------
# Thread 11 (Thread 0x7f80fd8fc700 (LWP 168334)):
#########################################################################
#4 <signal handler called>
#5 0x00007f84c575c113 in eos::InMemNamespaceGroup::~InMemNamespaceGroup() ()
from /usr/lib64/libEosNsInMemory.so
#6 0x00007f84c575cf79 in eos::InMemNamespaceGroup::~InMemNamespaceGroup() ()
from /usr/lib64/libEosNsInMemory.so
#7 0x00007f84f9073db8 in eos::mgm::Master::BootNamespace() ()
from /lib64/libXrdEosMgm.so
#8 0x00007f84f907357e in eos::mgm::Master::MasterRO2Slave() ()
from /lib64/libXrdEosMgm.so
#9 0x00007f84f907baf2 in eos::mgm::Master::Activate(std::string&, std::string&,
int) () from /lib64/libXrdEosMgm.so
#10 0x00007f84f907bf1f in eos::mgm::Master::ApplyMasterConfig(std::string&, std:
:string&, eos::mgm::IMaster::Transition::Type) () from /lib64/libXrdEosMgm.so
#11 0x00007f84f906eb73 in eos::mgm::Master::SetMasterId(std::string const&, int,
std::string&) () from /lib64/libXrdEosMgm.so
#12 0x00007f84f8dd3c76 in eos::mgm::NsCmd::MasterSubcmd(eos::console::NsProto_Ma
sterProto const&, eos::console::ReplyProto&) () from /lib64/libXrdEosMgm.so
#13 0x00007f84f8ddd010 in eos::mgm::NsCmd::ProcessRequest() ()
from /lib64/libXrdEosMgm.so
#14 0x00007f84f8d7d748 in eos::mgm::IProcCommand::LaunchJob() ()
from /lib64/libXrdEosMgm.so
#15 0x00007f84f8d7e158 in eos::mgm::IProcCommand::open(char const*, char const*,
eos::common::VirtualIdentity&, XrdOucErrInfo*) ()
from /lib64/libXrdEosMgm.so
#16 0x00007f84f8ebe272 in XrdMgmOfsFile::open(char const*, int, unsigned int, Xr
dSecEntity const*, char const*) () from /lib64/libXrdEosMgm.so
#17 0x00007f84fd49d776 in XrdXrootdProtocol::do_Open() ()
from /opt/eos/xrootd/lib64/libXrdServer.so.2
#18 0x00007f84fd215549 in XrdLink::DoIt() ()
from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#19 0x00007f84fd2188ff in XrdScheduler::Run() ()
from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#20 0x00007f84fd218a49 in XrdStartWorking(void*) ()
from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#21 0x00007f84fd1de7f7 in XrdSysThread_Xeq ()
from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#22 0x00007f84fcd92dd5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f84fc094ead in clone () from /lib64/libc.so.6
To escape this situation, I restarted eos services on the new slave:
[root@naneosmgr02(EOSSLAVE) ~]#systemctl restart eos
After booting the namespace, things are back to normal.
That happened twice.
JM