Master/Slave Configuration

I’m working on setting up a citrine instance in a test environment. I have 4 nodes. Two will be MGM nodes in a master/slave configuration. Two will be FSTs. I have one MGM and the two FSTs installed. I’m using eos-4.2.15-1.

MGM Master: cmseos-itbmgm01
MGM Slave: cmseos-itbmgm02
FSTs: cmseos-itbfst04, cmseos-itbmgm05

During the first MGM install, I was getting the following error:

180302 15:16:52 time=1520025412.490847 func=BootNamespace level=CRIT logid=06fcce94-1e5f-11e8-8fff-0025901bede4 unit=mgm@cmseos-itbmgm01.fnal.gov:1094 tid=00007f91b4235880 source=Master:1984 tident=<service> sec= uid=0 gid=0 name= geo="" initialization returned ec=14 File does not exist and Create flag is absent: /var/eos/md/directories.cmseos-itbmgm02.fnal.gov.mdlog

Obviously, something was switched in the config. I eventually got the Master MGM up and running.

Now I am having a similar error with the slave MGM.

180308 09:09:34 time=1520521774.047156 func=BootNamespace            level=NOTE  logid=b586e0da-22e2-11e8-a57f-0025902b341e unit=mgm@cmseos-itbmgm02.fnal.gov:1094 tid=00007f42ea2098c0 source=Master:1925                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure started as slave
180308 09:09:34 time=1520521774.047390 func=BootNamespace            level=CRIT  logid=b586e0da-22e2-11e8-a57f-0025902b341e unit=mgm@cmseos-itbmgm02.fnal.gov:1094 tid=00007f42ea2098c0 source=Master:1981                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos view initialization failed after 0 seconds
180308 09:09:34 time=1520521774.047435 func=BootNamespace            level=CRIT  logid=b586e0da-22e2-11e8-a57f-0025902b341e unit=mgm@cmseos-itbmgm02.fnal.gov:1094 tid=00007f42ea2098c0 source=Master:1984                    tident=<service> sec=      uid=0 gid=0 name= geo="" initialization returned ec=14 File does not exist and Create flag is absent: /var/eos/md/directories.cmseos-itbmgm01.fnal.gov.mdlog
180308 09:09:34 12634 XrootdConfig: Unable to create file system object via libXrdEosMgm.so
180308 09:09:34 12634 XrootdConfig: Unable to load file system.
------ xrootd protocol initialization failed.
180308 09:09:34 12634 XrdProtocol: Protocol xrootd could not be loaded
------ xrootd mgm@cmseos-itbmgm02.fnal.gov:-1 initialization failed.

So I have some questions about /ets/sysconfig/eos_env (or eos in aquamarine). Here is a partial look at what mine currently look like:

cmseos-itbmgm01 (master):

XRD_ROLES="mq sync mgm "

#----------------------------------------
# EOS Configuration
#----------------------------------------

EOS_INSTANCE_NAME=eositb
EOS_AUTOLOAD_CONFIG="default"
EOS_BROKER_URL=root://localhost:1097//eos/
EOS_MGM_MASTER1=cmseos-itbmgm01.fnal.gov
EOS_MGM_MASTER2=cmseos-itbmgm02.fnal.gov
EOS_MGM_ALIAS=cmseos-itb.fnal.gov
EOS_MGM_HOST=cmseos-itbmgm01.fnal.gov
EOS_MGM_HOST_TARGET=cmseos-itbmgm02.fnal.gov

cmseos-itbmgm02 (slave):

XRD_ROLES="mq sync mgm "

#----------------------------------------
# EOS Configuration
#----------------------------------------

EOS_INSTANCE_NAME=eositb
EOS_AUTOLOAD_CONFIG="default"
EOS_BROKER_URL=root://cmseos-itbmgm01.fnal.gov:1097//eos/
EOS_MGM_MASTER1=cmseos-itbmgm01.fnal.gov
EOS_MGM_MASTER2=cmseos-itbmgm02.fnal.gov
EOS_MGM_ALIAS=cmseos-itb.fnal.gov
EOS_MGM_HOST=cmseos-itbmgm02.fnal.gov
EOS_MGM_HOST_TARGET=cmseos-itbmgm01.fnal.gov

Obviously something is wrong, but I’m not sure what to switch.

What should these look like in a master/slave setup?

Should master/slave be avoided and a master/master config be preferred?

Does anything in the config have to change if you do a master-slave switch?

Do both the master and slave need to run the ‘sync’ role?

If master/slave is still supported in SL7 and citrine, what is the proper way to do a master/slave setup and a master/slave swap with systemd?

Hi Dan,

the error you see is for the first time an instance is created.
by default mgm start in slave mode (unless told otherwise)

to set the mgm to be a master (and remove this error) you need to do:
systemctl start eos@master

For what concern the slave, you might need to compact the namespace once,
to add the follow flag to the namespace for the slave to be able to follow.

before doing that couls you just try forcing the slave mode?
systemctl start eos@slave

Cheers,
Luca

Correct on the first point. I noticed it was coming up as slave (as it will by default), so I did run the systemctl start eos@master after shutting it down and it seemed fine after that.

I have run the systemctl start eos@slave on the second node a couple of times and it did not seem to help.

Can you explain what the follow flag is?

I ran a compact and it didn’t make a difference, same error on startup of the slave MGM.
The mq and sync services are up and running on the slave, it’s just the MGM that won’t start.

Before you ask, there are correct iptables rules in place and I even dropped iptables on both systems temporarily to make sure that wasn’t the issue.

I finally got it going. I did a compact and then restarted eossync on MGM1. I restarted it on MGM2 as well and then did systemctl restart eos@mgm on MGM2 and it started up. I have no idea why this worked.

was the eossync running ok after the compact (meaning the new file was already reaching the slave)?

The follow flag in the namespace is a tag in the namespace file which mark the end of the “compacted” part of namespace and the beginning of the re-do-log part.

Basically once a slave start it will read everything up to the compact point to start from the namespace baseline and then it will go in “follow mode” to track record by record what happened.

Dear Luca,

I’m working on setting Citrine EOS (ver 4.2.26) at CentOS 7 (kernel 3.10.0-862.3.3). Using recommendations from [https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst] and advices from this topic, I managed to start successfully “master-slave” combination. And it worked fine until I have tried to set quota. Setting quota immediately crashes MGM at slave server and it refuses to start. But when I remove quota node (‘quota rmnode’ at master) slave MGM can start again. More details are as follows.

I have two servers: master - fs01 and slave - fs02. In DNS I have “eos” entry as a round robin for “fs01”, “fs02” servers. It is used as “alias” and “broker”. Here I follow section “Configuration of an MGM/MQ master/slave pair”) at page (https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst): “The simplest configuration uses an alias to point round-robin to master and slave machine e.g. configure a static alias eosdev.cern.ch resolving to eosdevsrv1.cern.ch and eosdevsrv2.cern.ch. This name can be used in the FST configuration to define the broker URL and can be used by clients to talk to the instance independent of read or write access.”

So I have at master fs01 /etc/sysconfig/eos_env:

XRD_ROLES=“mq mgm sync fst”
EOS_MGM_HOST=fs01.hpc.utfsm.cl
EOS_MGM_HOST_TARGET=fs02.hpc.utfsm.cl
EOS_BROKER_URL=root://eos.hpc.utfsm.cl:1097//eos/
EOS_MGM_MASTER1=fs01.hpc.utfsm.cl
EOS_MGM_MASTER2=fs02.hpc.utfsm.cl
EOS_MGM_ALIAS=eos.hpc.utfsm.cl
EOS_INSTANCE_NAME=eoshpc

And at slave fs02 /etc/sysconfig/eos_env:

XRD_ROLES=“mq mgm sync fst”
EOS_MGM_HOST=fs02.hpc.utfsm.cl
EOS_MGM_HOST_TARGET=fs01.hpc.utfsm.cl
EOS_BROKER_URL=root://eos.hpc.utfsm.cl:1097//eos/
EOS_MGM_MASTER1=fs01.hpc.utfsm.cl
EOS_MGM_MASTER2=fs02.hpc.utfsm.cl
EOS_MGM_ALIAS=eos.hpc.utfsm.cl
EOS_INSTANCE_NAME=eoshpc

At master I started

  systemctl start eos@master
  systemctl start eos
  systemctl start eossync

and at slave

  systemctl start eos@slave
  systemctl start eos
  systemctl start eossync

Both instances started successfully and I have

  [fs01]# eos -b ns
  # ------------------------------------------------------------------------------------
  # Namespace Statistics
  # ------------------------------------------------------------------------------------
  ALL      Files                            6 [booted] (0s)
  ALL      Directories                      16
  # ------------------------------------------------------------------------------------
  ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
  # ------------------------------------------------------------------------------------
  ALL      Replication                      mode=master-rw state=master-rw master=fs01.hpc.utfsm.cl configdir=/var/eos/config/fs01.hpc.utfsm.cl/ config=default active=true mgm:fs02.hpc.utfsm.cl=ok mgm:mode=slave-ro mq:fs02.hpc.utfsm.cl:1097=ok
  ...

  [fs02]# eos -b ns
  # ------------------------------------------------------------------------------------
  # Namespace Statistics
  # ------------------------------------------------------------------------------------
  ALL      Files                            6 [booted] (5s)
  ALL      Directories                      16
  # ------------------------------------------------------------------------------------
  ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
  # ------------------------------------------------------------------------------------
  ALL      Replication                      mode=slave-ro state=slave-ro master=fs01.hpc.utfsm.cl configdir=/var/eos/config/fs01.hpc.utfsm.cl/ config=default active=true mgm:fs01.hpc.utfsm.cl=ok mgm:mode=master-rw mq:fs01.hpc.utfsm.cl:1097=ok
  ...

It was possible, for example, to create a directory for user “yupi” which is normally accessible as from master/slave servers

  [fs01]# eos -b ls -l /eos/hpc/user/y/yupi
  -rw-r--r--   1 yupi     utfsm               6 Jun 26 12:08 test

  [fs02]# eos -b ls -l /eos/hpc/user/y/yupi
  -rw-r--r--   1 yupi     utfsm               6 Jun 26 12:08 test

so and from client machine too

  [yupi@wn37] $ ls -l /eos/hpc/user/y/yupi
  total 0
  -rw-r--r-- 1 yupi utfsm 6 Jun 26 12:08 test

But when I’m trying to set quota node “/eos/hpc/user”

  [fs01]# eos -b space quota default on
  [fs01]# date
  Tue Jun 26 15:08:23 -04 2018
  [fs01]# eos -b quota set -u yupi -v 10G /eos/hpc/user
  success: updated volume quota for uid=1009 for node /eos/hpc/user/

slave MGM immediately crashes and I see at slave node /var/log/eos/mgm/xrdlog.mgm:

   180626 15:08:23 time=1530040103.678415 func=FsConfigListener         level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=FsConfigListener:164           tident=<single-exec> sec=      uid=0 gid=0 name= geo="" Call SetConfig quota:/eos/hpc/user/:uid=1009:userbytes 10000000000
   180626 15:08:23 time=1530040103.678537 func=SpaceQuota               level=INFO  logid=static.............................. unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=Quota:79                       tident= sec=(null) uid=99 gid=99 name=- geo="" No ns quota found for path=/eos/hpc/user/
   180626 15:08:23 time=1530040103.678616 func=SpaceQuota               level=CRIT  logid=static.............................. unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=Quota:91                       tident= sec=(null) uid=99 gid=99 name=- geo="" Cannot register quota node /eos/hpc/user/, errmsg=Unable to write the record data at offset 0xf6e3d4; Bad file descriptor
   error: received signal 11:
   /lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7f7b658eb0a4]
   /lib64/libc.so.6(+0x362f0)[0x7f7b6a55f2f0]
   /lib64/libXrdEosMgm.so(_ZN3eos3mgm5Quota6CreateERKSs+0x176)[0x7f7b65882746]
   /lib64/libXrdEosMgm.so(_ZN3eos3mgm13IConfigEngine15ApplyEachConfigEPKcP12XrdOucStringPv+0xab3)[0x7f7b6571c293]
   /lib64/libXrdEosMgm.so(_ZN9XrdMgmOfs16FsConfigListenerEv+0x1f02)[0x7f7b6590c402]
   /lib64/libXrdEosMgm.so(_ZN9XrdMgmOfs24StartMgmFsConfigListenerEPv+0x9)[0x7f7b6590f109]
   /lib64/libXrdUtils.so.2(XrdSysThread_Xeq+0x37)[0x7f7b6b768197]
   /lib64/libpthread.so.0(+0x7e25)[0x7f7b6b324e25]
   /lib64/libc.so.6(clone+0x6d)[0x7f7b6a627bad]
   #########################################################################
   # stack trace exec=xrootd pid=17208 what='thread apply all bt'
   #########################################################################

MGM tries to restart but always fails:

  [fs02]# systemctl status eos@mgm
  ● eos@mgm.service - EOS mgm
     Loaded: loaded (/usr/lib/systemd/system/eos@.service; disabled; vendor preset: disabled)
     Active: activating (auto-restart) (Result: signal) since Tue 2018-06-26 15:12:35 -04; 640ms ago
    Process: 24064 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
    Process: 23961 ExecStart=/usr/bin/xrootd -n %i -c /etc/xrd.cf.%i -l /var/log/eos/xrdlog.%i -s /tmp/xrootd.%i.pid -Rdaemon (code=killed, signal=SEGV)
    Process: 23927 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
   Main PID: 23961 (code=killed, signal=SEGV)
  Jun 26 15:12:35 fs02.hpc.utfsm.cl systemd[1]: Unit eos@mgm.service entered failed state.
  Jun 26 15:12:35 fs02.hpc.utfsm.cl systemd[1]: eos@mgm.service failed.

But as far as I remove quota node:

  [fs01]# date
  Tue Jun 26 15:14:58 -04 2018
  [fs01]# eos -b quota rmnode -p /eos/hpc/user
  Do you really want to delete the quota node under path /eos/hpc/user ?
  Confirm the deletion by typing => 7377813624
                                 => 7377813624
  Deletion confirmed
  success: removed space quota for /eos/hpc/user/

slave MGM again automactically starts normally:

  [root@fs02 mgm]# systemctl status eos@mgm
  ● eos@mgm.service - EOS mgm
     Loaded: loaded (/usr/lib/systemd/system/eos@.service; disabled; vendor preset: disabled)
     Active: active (running) since Tue 2018-06-26 15:15:14 -04; 16s ago
    Process: 28257 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
    Process: 28274 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
   Main PID: 28307 (xrootd)
     CGroup: /system.slice/system-eos.slice/eos@mgm.service
             ├─28307 /usr/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaem...
             ├─28624 /usr/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaem...
             └─28667 eos -b console log _MGMID_
  ...
  Jun 26 15:15:14 fs02.hpc.utfsm.cl xrootd[28307]: Register objects provide by NsInMemoryPlugin ...
  Jun 26 15:15:16 fs02.hpc.utfsm.cl xrootd[28307]: ==> new Broker root://eos.hpc.utfsm.cl:1097//eos/fs02.hpc.utfsm...log=1
  Jun 26 15:15:16 fs02.hpc.utfsm.cl xrootd[28307]: ==> new Broker root://daemon@eos.hpc.utfsm.cl:1097//eos/fs02.hp...log=0

So what was done wrong? It can be so that I somehow misunderstood what EOS “alias” and “broker” hosts mean and use them in a wrong way. But without quota settings everything works just fine. Could you, please, provide some hint on how to resolve this problem?

With the best wishes,

Yuri Ivanov

Hi Lucas and all,

I’m running eos 4.3.12-1.el7. Using recommendations here in this page and those in https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst, I’m trying without success since many days to run the master/slave configuration. The slave refuses to run : only sync and mq run, the MGM stop running, in the logs I see File does not exist and Create flag is absent: /var/eos/md/directories.np02eos1.cern.ch.mdlog :

@np02eos2.cern.ch:1094 tid=00007f56063d4780 source=Master:1985                    tident=<service> sec=      uid=0 gid=0 name= geo="" initialization returned ec=14 File does not exist and Create flag is absent: /var/eos/md/directories.np02eos1.cern.ch.mdlog
181113 08:30:16 13382 XrootdConfig: Unable to create file system object via libXrdEosMgm.so

Here is my configuration (namespace in memory), with a RR DNS alias np02eos.cern.ch which points to the 2 MGM :

master np02eos1 /etc/sysconfig/eos_env :

XRD_ROLES="mq mgm sync"
EOS_MGM_HOST=np02eos1.cern.ch
EOS_MGM_HOST_TARGET=np02eos2.cern.ch
EOS_INSTANCE_NAME=np02eos
EOS_AUTOLOAD_CONFIG=default
# EOS_BROKER_URL=root://localhost:1097//eos/
EOS_BROKER_URL=root://$EOS_MGM_HOST:1097//eos/
EOS_GEOTAG="np02-daq"
EOS_MGM_MASTER1=np02eos1.cern.ch
EOS_MGM_MASTER2=np02eos2.cern.ch
EOS_MGM_ALIAS=np02eos.cern.ch

slave np02eos2 /etc/sysconfig/eos_env :

XRD_ROLES="mq mgm sync"
EOS_MGM_HOST=np02eos2.cern.ch
EOS_MGM_HOST_TARGET=np02eos1.cern.ch
EOS_INSTANCE_NAME=np02eos
EOS_AUTOLOAD_CONFIG=default
EOS_BROKER_URL=root://$EOS_MGM_HOST:1097//eos/
EOS_GEOTAG="np02-daq"
EOS_MGM_MASTER1=np02eos1.cern.ch
EOS_MGM_MASTER2=np02eos2.cern.ch
EOS_MGM_ALIAS=np02eos.cern.ch

I tried both EOS_BROKER_URL=root://$EOS_MGM_HOST:1097//eos/ and EOS_BROKER_URL=root://localhost:1097//eos/.

I compacted the namespace on then master MGM, as described :

eos -b ns compact on 1 86400 all

The command eos -b ns displays :

# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL      Files                            5 [booted] (0s)
ALL      Directories                      13
ALL      Total boot time                  0 s
# ------------------------------------------------------------------------------------
ALL      Compactification                 status=wait waitstart=86269 interval=86400 ratio-file=1.9:1 ratio-dir=1.6:1
# ------------------------------------------------------------------------------------
ALL      Replication                      mode=master-rw state=master-rw master=np02eos1.cern.ch configdir=/var/eos/config/np02eos1.cern.ch/ config=default active=true mgm:np02eos2.cern.ch=down mq:np02eos2.cern.ch:1097=ok

I have the same problem when I inverse the roles. For example if np02eos2 is the master and np02eos1 is the slave withe the commands :

On the new master np02eos2 :

[root@np02eos2 ~]# systemctl stop eos@*
[root@np02eos2 ~]# systemctl start eos@master
[root@np02eos2 ~]# systemctl start eos
[root@np02eos2 ~]# systemctl status eos@*|grep -E "Active|service - EOS"
● eos@mgm.service - EOS mgm
   Active: active (running) since Tue 2018-11-13 09:38:04 CET; 3min 20s ago
● eos@mq.service - EOS mq
   Active: active (running) since Tue 2018-11-13 09:38:04 CET; 3min 20s ago
● eos@sync.service - EOS sync
   Active: active (running) since Tue 2018-11-13 09:38:04 CET; 3min 20s ago

[root@np02eos2 ~]# eos -b ns compact on 1 86400 all

... waiting until Compactification                 status=wait

[root@np02eos2 ~]# systemctl restart eos@sync

On the new slave np02eos1 :

[root@np02eos1 ~]# systemctl stop eos@*
[root@np02eos1 ~]# systemctl start eos@slave
[root@np02eos1 ~]# systemctl start eosslave
[root@np02eos1 ~]# systemctl start eos
[root@np02eos1 ~]# systemctl status eos@*|grep -E "Active|service - EOS"
● eos@sync.service - EOS sync
   Active: active (running) since Tue 2018-11-13 09:38:21 CET; 5min ago
● eos@mq.service - EOS mq
   Active: active (running) since Tue 2018-11-13 09:38:21 CET; 5min ago
● eos@mgm.service - EOS mgm
   Active: failed (Result: exit-code) since Tue 2018-11-13 09:38:33 CET; 4min 49s ago

and The new slave np03eos1 mgm logs show :

181113 09:38:33 time=1542098313.393817 func=BootNamespace            level=ALERT logid=81234e76-e71f-11e8-b662-00221965c090 unit=mgm@np02eos1.cern.ch:1094 tid=00007fa6fb8cb780 source=Master:1864                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="preset the expected namespace size to optimize RAM usage via EOS_NS_DIR_SIZE && EOS_NS_FILE_SIZE in /etc/sysconfig/eos"
181113 09:38:33 time=1542098313.393917 func=BootNamespace            level=NOTE  logid=81234e76-e71f-11e8-b662-00221965c090 unit=mgm@np02eos1.cern.ch:1094 tid=00007fa6fb8cb780 source=Master:1926                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure started as slave
181113 09:38:33 time=1542098313.394133 func=BootNamespace            level=CRIT  logid=81234e76-e71f-11e8-b662-00221965c090 unit=mgm@np02eos1.cern.ch:1094 tid=00007fa6fb8cb780 source=Master:1982                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos view initialization failed after 0 seconds
181113 09:38:33 time=1542098313.394155 func=BootNamespace            level=CRIT  logid=81234e76-e71f-11e8-b662-00221965c090 unit=mgm@np02eos1.cern.ch:1094 tid=00007fa6fb8cb780 source=Master:1985                    tident=<service> sec=      uid=0 gid=0 name= geo="" initialization returned ec=14 File does not exist and Create flag is absent: /var/eos/md/directories.np02eos2.cern.ch.mdlog
181113 09:38:33 32638 XrootdConfig: Unable to create file system object via libXrdEosMgm.so
181113 09:38:33 32638 XrootdConfig: Unable to load file system.
------ xrootd protocol initialization failed.
181113 09:38:33 32638 XrdProtocol: Protocol xrootd could not be loaded
------ xrootd mgm@np02eos1.cern.ch:-1 initialization failed.

I did two fresh installs without success, the firewall is disabled. I don’t understand what is wrong in my configuration. I’m looking for your advices.
Thanks,
Denis

The error you see is because you don’t have the ‘eossync’ service running which replicates from master to slave.

Thanks a lot Andreas, this was not clear in my mind.
Cheers,
Denis