Mgm stacked then I switch from master to slave

Hi,
I try to setup eos failover cluster with quarkdb backend
I have 3 quarkdb machines (b00,b01,b02), and 2 eos mgm nodes (m01,m02).

Master node has next strings in /etc/xrd.cf.mgm

EOS_MGM_HOST=m01.test.ru
EOS_MGM_HOST_TARGET=m02.test.ru
EOS_INSTANCE_NAME=eostest
EOS_MGM_MASTER1=m01.test.ru
EOS_MGM_MASTER2=m02.test.ru
#EOS_MGM_ALIAS=eos.test.ru
#EOS_PSS_MGM=$EOS_MGM_ALIAS:1094
EOS_BROKER_URL=root://eos.test.ru:1097//eos/

Slave node has next strings in /etc/xrd.cf.mgm

EOS_MGM_HOST=m02.test.ru
EOS_MGM_HOST_TARGET=m01.test.ru
EOS_INSTANCE_NAME=eostest
EOS_MGM_MASTER1=m01.test.ru
EOS_MGM_MASTER2=m02.test.ru
#EOS_MGM_ALIAS=eos.test.ru
#EOS_PSS_MGM=$EOS_MGM_ALIAS:1094
EOS_BROKER_URL=root://eos.test.ru:1097//eos/

I start eos in next order:

m01:# systemctl start eos@master
m01:# systemctl start eos@sync
m01:# systemctl start eos@mq
m01:# systemctl start eos@mgm

m02:# systemctl start eos@master
m02:# systemctl start eos@sync
m02:# systemctl start eos@mq
m02:# systemctl start eos@mgm

After that I recive errors:

             ---- high rate error messages suppressed ----

181024 16:32:28 time=1540387948.217943 func=Supervisor
level=CRIT logid=27b61c8a-d791-11e8-b374-000af7e02290
unit=mgm@eos.test.ru:1094 tid=00007fc507dfc700
source=Master:412 tident= sec= uid=0
gid=0 name= geo="" msg=“dual RW master setup detected”
---- high rate error messages suppressed ----
181024 16:32:34 time=1540387954.233800 func=Supervisor
level=CRIT logid=27b61c8a-d791-11e8-b374-000af7e02290
unit=mgm@eos.test.ru:1094 tid=00007fc507dfc700
source=Master:412 tident= sec= uid=0
gid=0 name= geo="" msg=“dual RW master setup detected”

I fixed this problem changed broker url string to:
EOS_BROKER_URL=root://m01.test.ru:1097//eos/
EOS_BROKER_URL=root://m02.test.ru:1097//eos/

But, what is broker_url option? I can’t information about it in documentation. If I have alias eos.test.ru that is contain broker_url string?

but then I try to change:

m01:~ # eos -b ns master m02.test.ru
configdir=/var/eos/config/m02.test.ru/ activating master=m02.test.rusuccess: <m02.test.ru> is now the master

m02:~ # eos -b ns master m02.test.ru

In mgm log at m02:

[QCLIENT - INFO - processRedirection:377] redirecting to b02.test.ru:7777

[QCLIENT - INFO - processRedirection:377] redirecting to b02.test.ru:7777

[QCLIENT - INFO - processRedirection:377] redirecting to b02.test.ru:7777

[QCLIENT - INFO - processRedirection:377] redirecting to b02.test.ru:7777

181029 10:57:47 time=1540799867.150357 func=Slave2Master level=CRIT logid=e4eff2ce-db4e-11e8-b22c-000af7e0a0ea unit=mgm@m02.test.ru:1094 tid=00007fd3d07ff700 source=Master:1335 tident= sec= $
terminate called after throwing an instance of ‘std::logic_error’
what(): basic_string::_S_construct null not valid
error: received signal 6:
/lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7fd3ce29d874]
/lib64/libc.so.6(+0x36280)[0x7fd3d3a3c280]
/lib64/libc.so.6(gsignal+0x37)[0x7fd3d3a3c207]
/lib64/libc.so.6(abort+0x148)[0x7fd3d3a3d8f8]
/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x7fd3d434b7d5]
/lib64/libstdc++.so.6(+0x5e746)[0x7fd3d4349746]
/lib64/libstdc++.so.6(+0x5e773)[0x7fd3d4349773]
/lib64/libstdc++.so.6(+0x5e993)[0x7fd3d4349993]
/lib64/libstdc++.so.6(_ZSt19__throw_logic_errorPKc+0x77)[0x7fd3d439e597]
/lib64/libstdc++.so.6(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0xa1)[0x7fd3d43aa3c1]
#########################################################################

stack trace exec=xrootd pid=14234 what=‘thread apply all bt’

#########################################################################

It stacked. Eos mgm is running but this actions can’t be performed.

By mgm node:
m01

ALL Replication mode=slave-ro state=slave-ro master=m02.test.ru configdir=/var/eos/config/m02.test.ru/ config=default active=true mgm:m02.test.ru=ok mgm:mode=slave-ro mq:m02.test.ru:1097=ok

m02

ALL Replication mode=slave-ro state=slave-ro master=m01.test.ru configdir=/var/eos/config/m01.test.ru/ config=default active=true mgm:m01.test.ru=ok mgm:mode=slave-ro mq:m01.test.ru:1097=o

Before this action m01 was - master

How I can switch master to slave? What’s wrong?

Hi Ivan,

There are still some preparatory steps that you need to take in order to have the full master-slave setup working with the new QuarkDB backed. This is still not very well tested at the moment and there is still a bit of polishing that will go into this in the following weeks.

Having said that, what you first need to do is to export the local configuration that now sits in /var/eos/config/<hostname>/default.eoscf to QuarkDB. For this you you only need one MGM up and running to export the configuration. But before starting the MGM you need to set the EOS_USE_QDB_MASTER=1 in your /etc/sysconfig/eos_env. This variable is needed to use the Master-Slave implementation that sits on top of QuarkDB which is different from the Master-Slave implementation used in the legacy in-memory namespace.

After having done this, you start up only one MGM and you can use the eos config export command to export the local configuration to QuarkDB. Once this is done, you need to stop the MGM and modify the /etc/xrd.cf.mgm config file and change this line:
mgmofs.cfgtype file
to this
mgmofs.cfgtype quarkdb

Now you can start the MGM and you should see something like this as the output from “eos ns”

# ------------------------------------------------------------------------------------
ALL      Replication                      mode=master-rw state=master-rw master=esdss000.cern.ch configdir=/var/eos/config/esdss000.cern.ch/ config=default
# ------------------------------------------------------------------------------------

Operate the same modification to the other MGM /etc/xrd.cf.mgm file and then start it up. The second one should become a slave. You know should have a working master-slave setup. But please keep in mind there might still be some things to improve therefore this is not yet production ready.

Cheers,
Elvin

One small correction to the previous post. The output of the eos ns command should have a like the following:

# ------------------------------------------------------------------------------------
ALL      Replication                      is_master=true master_id=esdss000.cern.ch:1094
# ------------------------------------------------------------------------------------

Sorry, I copy-pasted the wrong thing before.

Cheers,
Elvin

Thank you for answer!

I added EOS_USE_QDB_MASTER=1 in /etc/sysconfig/eos_env and mgmofs.cfgtype file with mgmofs.nslib /usr/lib64/libEosNsInMemory.so

After I performed:

eos config export -f /var/eos/config/m01.test.ru/default.eoscf
error: this command is available only with ConfigEngine type ‘quarkdb’ (errc=22) (Invalid argument)

In log file:

181030 11:32:26 time=1540888346.458450 func=MakeResult level=ERROR logid=static… unit=mgm@m01.test.ru:1094 tid=00007fc10dfff700 source=ProcCommand:667 tident= sec=(null) uid=99 gid=99 name=- geo="" error: this command is available only with ConfigEngine type ‘quarkdb’ (errno=22)
181030 11:32:26 17418 XrootdXeq: root.17815:98@localhost.localdomain disc 0:00:00

If I will include mgmofs.cfgtype quarkdb, file /var/eos/config/m01.test.ru/default.eoscf is not exist.

Hi Ivan,

You should export the old configuration if you are “converting” an in-memory instance to QuarkDB. If you start fresh, with a new MGM and everything then you don’t have the default.eoscf file. In this case you don’t need to to anything - the MGM will use QuarkDB directly to save the new config changes that you’ll add.

Cheers,
Elvin

Hi Elvin,
Yes, I’m use fresh configuration. I’m test swithing from master to slave like a documentation:
http://eos-docs.web.cern.ch/eos-docs/configuration/master.html

switch the master MGM to RO mode

eosdevsrv1:# eos -b ns master eosdevsrv2.cern.ch

switch the slave MGM to master mode

eosdevsrv2:# eos -b ns master eosdevsrv2.cern.ch

switch the RO mode master MGM to slave mode

eosdevsrv1:# eos -b ns master eosdevsrv2.cern.ch

m01:~ # eos -b ns master m02.test.ru
configdir=/var/eos/config/m02.test.ru/ activating master=m02.jinr.rusuccess: <eos-m02.test.ru> is now the master

m02:~ # eos -b ns master m02.test.ru
^C

And after it stucked…

m01:~ # eos -b ns | grep Repl
ALL Replication mode=slave-ro state=slave-ro master=m02.test.ru configdir=/var/eos/config/m02.test.ru/ config=default active=true mgm:m02.test.ru=ok mgm:mode=slave-ro mq:m02.test.ru:1097=ok

m02:~ # eos -b ns | grep Repl
ALL Replication mode=slave-ro state=slave-ro master=m01.jinr.ru configdir=/var/eos/config/m01.test.ru/ config=default active=true mgm:m01.test.ru=ok mgm:mode=slave-ro mq:m01.test.ru:1097=ok

And second question about EOS_BROKER_URL. Which value does the string have?
Server name or Alias?
Now have next values:
m01:
EOS_BROKER_URL=root://m01.test.ru:1097//eos/
m02:
EOS_BROKER_URL=root://m02.test.ru:1097//eos/

Hi Ivan,

For the new master-slave QuarkDB backend you should use this page as documentation:
http://eos-docs.web.cern.ch/eos-docs/configuration/master_quarkdb.html

There is another restriction for this setup, namely, the MGM and its corresponding MQ need to be on the same machine. Your EOS_BROKER_URLs seem correct to me.

Cheers,
Elvin

Hi Elvin,

In your example master have a master mode in Replication string. Then I run

systemctl start eos@mgm
systemctl start eos@sync
systemctl start eos@mq
systemctl start eossync

I recive:

ALL Replication mode=slave-ro state=slave-ro

Why mode=slave-ro? How I can switch it to master?

Also Then I added:

mgmofs.qdbpassword_file /etc/eos.keytab

mgm log spam:

qclient: HmacAuthHandshake failed with error ERR no password is set
qclient: HmacAuthHandshake failed with error ERR no password is set
qclient: HmacAuthHandshake failed with error ERR no password is set

It’s problem with eos.keytab?

m01:~ # ll /etc/eos.keytab
-r-------- 1 daemon daemon 135 Oct 15 13:58 /etc/eos.keytab

Hi Ivan,

I guess I’ve misunderstood you. What exactly are you trying to test?

  1. For the in-memory namespace with master slave setup then this link should point you in the right direction:
    http://eos-docs.web.cern.ch/eos-docs/configuration/master.html

  2. For the namespace in QuarkDB, along with master-slave implementation that uses also QuarkDB then this link should help:
    http://eos-docs.web.cern.ch/eos-docs/configuration/master_quarkdb.html

Let me know if this helps.

Cheers,
Elvin

Hi Elvin,

I’m tryed to test using quarkdb backend in master-slave configuration for EOS.
I’m read documentation and then I used in-memory namespace all worked but then I used quarkdb backend I recived errors in eos service.
I thought it my configuration mistake but then you wrote about

this is not yet production ready

I whatever solved test it :). Unfortunately nothing succeeded but thank you for help. I will wait for new releases.

Elvin,
I remembered a question. What eos-server version release do you use for quarkdb?