Master/slave transition problems with eos v4.5.6

barbet · September 17, 2019, 12:26pm

Hello,

I have performed two master/slave transitions recently to allow updating EOS to v4.5.6 and each time I saw the 3rd operation hang and the manager crash. The operations were not affected apparently but I prefer to report it. Here is what happens:

a) first stage

[root@naneosmgr02(EOSMASTER) ~]#eos -b ns | grep Replic
ALL      Replication                      mode=master-rw state=master-rw master=naneosmgr02.in2p3.fr configdir=/var/eos/config/naneosmgr02.in2p3.fr/ config=default mgm:naneosmgr01.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr01.in2p3.fr:1097=ok
[root@naneosmgr02(EOSMASTER) ~]# eos -b ns master naneosmgr01.in2p3.fr
success: current master will step down
[root@naneosmgr02(EOSMASTER) ~]#eos -b ns | grep Replic
ALL      Replication                      mode=slave-ro state=master-ro master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr01.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr01.in2p3.fr:1097=ok

b) second stage:

[root@naneosmgr01(EOSSLAVE) ~]#eos -b ns master naneosmgr01.in2p3.fr
success: current master will step down
[root@naneosmgr01(EOSSLAVE) ~]#eos -b ns | grep Replic
ALL      Replication                      mode=master-rw state=master-rw master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr02.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr02.in2p3.fr:1097=ok

c) third stage:

[root@naneosmgr02(EOSMASTER) ~]# eos -b ns master naneosmgr01.in2p3.fr

=> At this point, the command does not return and there is evidence of a crash in the log

more /var/log/eos/mgm/xrdlog 
[...]
#########################################################################
# -----------------------------------------------------------------------
# Responsible thread =>
# -----------------------------------------------------------------------
# Thread 11 (Thread 0x7f80fd8fc700 (LWP 168334)):
#########################################################################
#4  <signal handler called>
#5  0x00007f84c575c113 in eos::InMemNamespaceGroup::~InMemNamespaceGroup() ()
   from /usr/lib64/libEosNsInMemory.so
#6  0x00007f84c575cf79 in eos::InMemNamespaceGroup::~InMemNamespaceGroup() ()
   from /usr/lib64/libEosNsInMemory.so
#7  0x00007f84f9073db8 in eos::mgm::Master::BootNamespace() ()
   from /lib64/libXrdEosMgm.so
#8  0x00007f84f907357e in eos::mgm::Master::MasterRO2Slave() ()
   from /lib64/libXrdEosMgm.so
#9  0x00007f84f907baf2 in eos::mgm::Master::Activate(std::string&, std::string&,
 int) () from /lib64/libXrdEosMgm.so
#10 0x00007f84f907bf1f in eos::mgm::Master::ApplyMasterConfig(std::string&, std:
:string&, eos::mgm::IMaster::Transition::Type) () from /lib64/libXrdEosMgm.so
#11 0x00007f84f906eb73 in eos::mgm::Master::SetMasterId(std::string const&, int,
 std::string&) () from /lib64/libXrdEosMgm.so
#12 0x00007f84f8dd3c76 in eos::mgm::NsCmd::MasterSubcmd(eos::console::NsProto_Ma
sterProto const&, eos::console::ReplyProto&) () from /lib64/libXrdEosMgm.so
#13 0x00007f84f8ddd010 in eos::mgm::NsCmd::ProcessRequest() ()
   from /lib64/libXrdEosMgm.so
#14 0x00007f84f8d7d748 in eos::mgm::IProcCommand::LaunchJob() ()
   from /lib64/libXrdEosMgm.so
#15 0x00007f84f8d7e158 in eos::mgm::IProcCommand::open(char const*, char const*,
 eos::common::VirtualIdentity&, XrdOucErrInfo*) ()
   from /lib64/libXrdEosMgm.so
#16 0x00007f84f8ebe272 in XrdMgmOfsFile::open(char const*, int, unsigned int, Xr
dSecEntity const*, char const*) () from /lib64/libXrdEosMgm.so
#17 0x00007f84fd49d776 in XrdXrootdProtocol::do_Open() ()
   from /opt/eos/xrootd/lib64/libXrdServer.so.2
#18 0x00007f84fd215549 in XrdLink::DoIt() ()
   from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#19 0x00007f84fd2188ff in XrdScheduler::Run() ()
   from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#20 0x00007f84fd218a49 in XrdStartWorking(void*) ()
   from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#21 0x00007f84fd1de7f7 in XrdSysThread_Xeq ()
   from /opt/eos/xrootd/lib64/libXrdUtils.so.2
#22 0x00007f84fcd92dd5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f84fc094ead in clone () from /lib64/libc.so.6

To escape this situation, I restarted eos services on the new slave:

[root@naneosmgr02(EOSSLAVE) ~]#systemctl restart eos

After booting the namespace, things are back to normal.

That happened twice.

JM

esindril · September 17, 2019, 12:41pm

Hi Jean Michel,

I believe you are still using the old HA configuration with the QDB namespace which does not work properly and we don’t plan to support it.
Do you have the EOS_USE_QDB_MASTER=1 env variable in your /etc/sysconfig/eos_env?

Cheers,
Elvin

barbet · September 17, 2019, 12:51pm

Thanks Elvin, I have not yet moved to having the namespace in QuarkDB, the EOS update was a requirement. But I do not have EOS_USE_QDB_MASTER=0 in /etc/sysconfig/eos_env. Maybe this is the problem. I suppose that I should add it until I have completed the transition to using QuarkDB.

JM

esindril · September 17, 2019, 12:59pm

Hi Jean Michel,

Wait, so you are using the old namespace … then we have a bug there. I’ll have a closer look and let you know.

Thanks,
Elvin

esindril · September 17, 2019, 1:45pm

Hi Jean Michel,

The issue you reported is now fixed by the following commit:
https://gitlab.cern.ch/dss/eos/commit/3f196b40467665b2077c121cf7b3526da431de1b

There was some refactoring done to the NS component and it led to double deletion of some namespace objects during the master-slave fail-over.

Thanks for reporting.
Cheers,
Elvin

barbet · September 17, 2019, 1:49pm

Thank you very much Elvin, For me the impact was not big. I will probably wait until this fix is released with some new version in the stable repo. I do not perform master/slave transition often.

JM

CERN Accelerating science

Master/slave transition problems with eos v4.5.6