Master/Slave Configuration

yupi · June 26, 2018, 8:14pm

Dear Luca,

I’m working on setting Citrine EOS (ver 4.2.26) at CentOS 7 (kernel 3.10.0-862.3.3). Using recommendations from [https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst] and advices from this topic, I managed to start successfully “master-slave” combination. And it worked fine until I have tried to set quota. Setting quota immediately crashes MGM at slave server and it refuses to start. But when I remove quota node (‘quota rmnode’ at master) slave MGM can start again. More details are as follows.

I have two servers: master - fs01 and slave - fs02. In DNS I have “eos” entry as a round robin for “fs01”, “fs02” servers. It is used as “alias” and “broker”. Here I follow section “Configuration of an MGM/MQ master/slave pair”) at page (https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst): “The simplest configuration uses an alias to point round-robin to master and slave machine e.g. configure a static alias eosdev.cern.ch resolving to eosdevsrv1.cern.ch and eosdevsrv2.cern.ch. This name can be used in the FST configuration to define the broker URL and can be used by clients to talk to the instance independent of read or write access.”

So I have at master fs01 /etc/sysconfig/eos_env:

XRD_ROLES=“mq mgm sync fst”
EOS_MGM_HOST=fs01.hpc.utfsm.cl
EOS_MGM_HOST_TARGET=fs02.hpc.utfsm.cl
EOS_BROKER_URL=root://eos.hpc.utfsm.cl:1097//eos/
EOS_MGM_MASTER1=fs01.hpc.utfsm.cl
EOS_MGM_MASTER2=fs02.hpc.utfsm.cl
EOS_MGM_ALIAS=eos.hpc.utfsm.cl
EOS_INSTANCE_NAME=eoshpc

And at slave fs02 /etc/sysconfig/eos_env:

XRD_ROLES=“mq mgm sync fst”
EOS_MGM_HOST=fs02.hpc.utfsm.cl
EOS_MGM_HOST_TARGET=fs01.hpc.utfsm.cl
EOS_BROKER_URL=root://eos.hpc.utfsm.cl:1097//eos/
EOS_MGM_MASTER1=fs01.hpc.utfsm.cl
EOS_MGM_MASTER2=fs02.hpc.utfsm.cl
EOS_MGM_ALIAS=eos.hpc.utfsm.cl
EOS_INSTANCE_NAME=eoshpc

At master I started

  systemctl start eos@master
  systemctl start eos
  systemctl start eossync

and at slave

  systemctl start eos@slave
  systemctl start eos
  systemctl start eossync

Both instances started successfully and I have

  [fs01]# eos -b ns
  # ------------------------------------------------------------------------------------
  # Namespace Statistics
  # ------------------------------------------------------------------------------------
  ALL      Files                            6 [booted] (0s)
  ALL      Directories                      16
  # ------------------------------------------------------------------------------------
  ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
  # ------------------------------------------------------------------------------------
  ALL      Replication                      mode=master-rw state=master-rw master=fs01.hpc.utfsm.cl configdir=/var/eos/config/fs01.hpc.utfsm.cl/ config=default active=true mgm:fs02.hpc.utfsm.cl=ok mgm:mode=slave-ro mq:fs02.hpc.utfsm.cl:1097=ok
  ...

  [fs02]# eos -b ns
  # ------------------------------------------------------------------------------------
  # Namespace Statistics
  # ------------------------------------------------------------------------------------
  ALL      Files                            6 [booted] (5s)
  ALL      Directories                      16
  # ------------------------------------------------------------------------------------
  ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
  # ------------------------------------------------------------------------------------
  ALL      Replication                      mode=slave-ro state=slave-ro master=fs01.hpc.utfsm.cl configdir=/var/eos/config/fs01.hpc.utfsm.cl/ config=default active=true mgm:fs01.hpc.utfsm.cl=ok mgm:mode=master-rw mq:fs01.hpc.utfsm.cl:1097=ok
  ...

It was possible, for example, to create a directory for user “yupi” which is normally accessible as from master/slave servers

  [fs01]# eos -b ls -l /eos/hpc/user/y/yupi
  -rw-r--r--   1 yupi     utfsm               6 Jun 26 12:08 test

  [fs02]# eos -b ls -l /eos/hpc/user/y/yupi
  -rw-r--r--   1 yupi     utfsm               6 Jun 26 12:08 test

so and from client machine too

  [yupi@wn37] $ ls -l /eos/hpc/user/y/yupi
  total 0
  -rw-r--r-- 1 yupi utfsm 6 Jun 26 12:08 test

But when I’m trying to set quota node “/eos/hpc/user”

  [fs01]# eos -b space quota default on
  [fs01]# date
  Tue Jun 26 15:08:23 -04 2018
  [fs01]# eos -b quota set -u yupi -v 10G /eos/hpc/user
  success: updated volume quota for uid=1009 for node /eos/hpc/user/

slave MGM immediately crashes and I see at slave node /var/log/eos/mgm/xrdlog.mgm:

   180626 15:08:23 time=1530040103.678415 func=FsConfigListener         level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=FsConfigListener:164           tident=<single-exec> sec=      uid=0 gid=0 name= geo="" Call SetConfig quota:/eos/hpc/user/:uid=1009:userbytes 10000000000
   180626 15:08:23 time=1530040103.678537 func=SpaceQuota               level=INFO  logid=static.............................. unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=Quota:79                       tident= sec=(null) uid=99 gid=99 name=- geo="" No ns quota found for path=/eos/hpc/user/
   180626 15:08:23 time=1530040103.678616 func=SpaceQuota               level=CRIT  logid=static.............................. unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=Quota:91                       tident= sec=(null) uid=99 gid=99 name=- geo="" Cannot register quota node /eos/hpc/user/, errmsg=Unable to write the record data at offset 0xf6e3d4; Bad file descriptor
   error: received signal 11:
   /lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7f7b658eb0a4]
   /lib64/libc.so.6(+0x362f0)[0x7f7b6a55f2f0]
   /lib64/libXrdEosMgm.so(_ZN3eos3mgm5Quota6CreateERKSs+0x176)[0x7f7b65882746]
   /lib64/libXrdEosMgm.so(_ZN3eos3mgm13IConfigEngine15ApplyEachConfigEPKcP12XrdOucStringPv+0xab3)[0x7f7b6571c293]
   /lib64/libXrdEosMgm.so(_ZN9XrdMgmOfs16FsConfigListenerEv+0x1f02)[0x7f7b6590c402]
   /lib64/libXrdEosMgm.so(_ZN9XrdMgmOfs24StartMgmFsConfigListenerEPv+0x9)[0x7f7b6590f109]
   /lib64/libXrdUtils.so.2(XrdSysThread_Xeq+0x37)[0x7f7b6b768197]
   /lib64/libpthread.so.0(+0x7e25)[0x7f7b6b324e25]
   /lib64/libc.so.6(clone+0x6d)[0x7f7b6a627bad]
   #########################################################################
   # stack trace exec=xrootd pid=17208 what='thread apply all bt'
   #########################################################################

MGM tries to restart but always fails:

  [fs02]# systemctl status eos@mgm
  ● eos@mgm.service - EOS mgm
     Loaded: loaded (/usr/lib/systemd/system/eos@.service; disabled; vendor preset: disabled)
     Active: activating (auto-restart) (Result: signal) since Tue 2018-06-26 15:12:35 -04; 640ms ago
    Process: 24064 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
    Process: 23961 ExecStart=/usr/bin/xrootd -n %i -c /etc/xrd.cf.%i -l /var/log/eos/xrdlog.%i -s /tmp/xrootd.%i.pid -Rdaemon (code=killed, signal=SEGV)
    Process: 23927 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
   Main PID: 23961 (code=killed, signal=SEGV)
  Jun 26 15:12:35 fs02.hpc.utfsm.cl systemd[1]: Unit eos@mgm.service entered failed state.
  Jun 26 15:12:35 fs02.hpc.utfsm.cl systemd[1]: eos@mgm.service failed.

But as far as I remove quota node:

  [fs01]# date
  Tue Jun 26 15:14:58 -04 2018
  [fs01]# eos -b quota rmnode -p /eos/hpc/user
  Do you really want to delete the quota node under path /eos/hpc/user ?
  Confirm the deletion by typing => 7377813624
                                 => 7377813624
  Deletion confirmed
  success: removed space quota for /eos/hpc/user/

slave MGM again automactically starts normally:

  [root@fs02 mgm]# systemctl status eos@mgm
  ● eos@mgm.service - EOS mgm
     Loaded: loaded (/usr/lib/systemd/system/eos@.service; disabled; vendor preset: disabled)
     Active: active (running) since Tue 2018-06-26 15:15:14 -04; 16s ago
    Process: 28257 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
    Process: 28274 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
   Main PID: 28307 (xrootd)
     CGroup: /system.slice/system-eos.slice/eos@mgm.service
             ├─28307 /usr/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaem...
             ├─28624 /usr/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaem...
             └─28667 eos -b console log _MGMID_
  ...
  Jun 26 15:15:14 fs02.hpc.utfsm.cl xrootd[28307]: Register objects provide by NsInMemoryPlugin ...
  Jun 26 15:15:16 fs02.hpc.utfsm.cl xrootd[28307]: ==> new Broker root://eos.hpc.utfsm.cl:1097//eos/fs02.hpc.utfsm...log=1
  Jun 26 15:15:16 fs02.hpc.utfsm.cl xrootd[28307]: ==> new Broker root://daemon@eos.hpc.utfsm.cl:1097//eos/fs02.hp...log=0

So what was done wrong? It can be so that I somehow misunderstood what EOS “alias” and “broker” hosts mean and use them in a wrong way. But without quota settings everything works just fine. Could you, please, provide some hint on how to resolve this problem?

With the best wishes,

Yuri Ivanov

CERN Accelerating science

Master/Slave Configuration