Dear Luca,
I’m working on setting Citrine EOS (ver 4.2.26) at CentOS 7 (kernel 3.10.0-862.3.3). Using recommendations from [https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst] and advices from this topic, I managed to start successfully “master-slave” combination. And it worked fine until I have tried to set quota. Setting quota immediately crashes MGM at slave server and it refuses to start. But when I remove quota node (‘quota rmnode’ at master) slave MGM can start again. More details are as follows.
I have two servers: master - fs01 and slave - fs02. In DNS I have “eos” entry as a round robin for “fs01”, “fs02” servers. It is used as “alias” and “broker”. Here I follow section “Configuration of an MGM/MQ master/slave pair”) at page (https://github.com/cern-eos/eos/blob/master/doc/configuration/master.rst): “The simplest configuration uses an alias to point round-robin to master and slave machine e.g. configure a static alias eosdev.cern.ch resolving to eosdevsrv1.cern.ch and eosdevsrv2.cern.ch. This name can be used in the FST configuration to define the broker URL and can be used by clients to talk to the instance independent of read or write access.”
So I have at master fs01 /etc/sysconfig/eos_env:
XRD_ROLES=“mq mgm sync fst”
EOS_MGM_HOST=fs01.hpc.utfsm.cl
EOS_MGM_HOST_TARGET=fs02.hpc.utfsm.cl
EOS_BROKER_URL=root://eos.hpc.utfsm.cl:1097//eos/
EOS_MGM_MASTER1=fs01.hpc.utfsm.cl
EOS_MGM_MASTER2=fs02.hpc.utfsm.cl
EOS_MGM_ALIAS=eos.hpc.utfsm.cl
EOS_INSTANCE_NAME=eoshpc
And at slave fs02 /etc/sysconfig/eos_env:
XRD_ROLES=“mq mgm sync fst”
EOS_MGM_HOST=fs02.hpc.utfsm.cl
EOS_MGM_HOST_TARGET=fs01.hpc.utfsm.cl
EOS_BROKER_URL=root://eos.hpc.utfsm.cl:1097//eos/
EOS_MGM_MASTER1=fs01.hpc.utfsm.cl
EOS_MGM_MASTER2=fs02.hpc.utfsm.cl
EOS_MGM_ALIAS=eos.hpc.utfsm.cl
EOS_INSTANCE_NAME=eoshpc
At master I started
systemctl start eos@master
systemctl start eos
systemctl start eossync
and at slave
systemctl start eos@slave
systemctl start eos
systemctl start eossync
Both instances started successfully and I have
[fs01]# eos -b ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 6 [booted] (0s)
ALL Directories 16
# ------------------------------------------------------------------------------------
ALL Compactification status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL Replication mode=master-rw state=master-rw master=fs01.hpc.utfsm.cl configdir=/var/eos/config/fs01.hpc.utfsm.cl/ config=default active=true mgm:fs02.hpc.utfsm.cl=ok mgm:mode=slave-ro mq:fs02.hpc.utfsm.cl:1097=ok
...
[fs02]# eos -b ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 6 [booted] (5s)
ALL Directories 16
# ------------------------------------------------------------------------------------
ALL Compactification status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL Replication mode=slave-ro state=slave-ro master=fs01.hpc.utfsm.cl configdir=/var/eos/config/fs01.hpc.utfsm.cl/ config=default active=true mgm:fs01.hpc.utfsm.cl=ok mgm:mode=master-rw mq:fs01.hpc.utfsm.cl:1097=ok
...
It was possible, for example, to create a directory for user “yupi” which is normally accessible as from master/slave servers
[fs01]# eos -b ls -l /eos/hpc/user/y/yupi
-rw-r--r-- 1 yupi utfsm 6 Jun 26 12:08 test
[fs02]# eos -b ls -l /eos/hpc/user/y/yupi
-rw-r--r-- 1 yupi utfsm 6 Jun 26 12:08 test
so and from client machine too
[yupi@wn37] $ ls -l /eos/hpc/user/y/yupi
total 0
-rw-r--r-- 1 yupi utfsm 6 Jun 26 12:08 test
But when I’m trying to set quota node “/eos/hpc/user”
[fs01]# eos -b space quota default on
[fs01]# date
Tue Jun 26 15:08:23 -04 2018
[fs01]# eos -b quota set -u yupi -v 10G /eos/hpc/user
success: updated volume quota for uid=1009 for node /eos/hpc/user/
slave MGM immediately crashes and I see at slave node /var/log/eos/mgm/xrdlog.mgm:
180626 15:08:23 time=1530040103.678415 func=FsConfigListener level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=FsConfigListener:164 tident=<single-exec> sec= uid=0 gid=0 name= geo="" Call SetConfig quota:/eos/hpc/user/:uid=1009:userbytes 10000000000
180626 15:08:23 time=1530040103.678537 func=SpaceQuota level=INFO logid=static.............................. unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=Quota:79 tident= sec=(null) uid=99 gid=99 name=- geo="" No ns quota found for path=/eos/hpc/user/
180626 15:08:23 time=1530040103.678616 func=SpaceQuota level=CRIT logid=static.............................. unit=mgm@fs02.hpc.utfsm.cl:1094 tid=00007f7b3f9f8700 source=Quota:91 tident= sec=(null) uid=99 gid=99 name=- geo="" Cannot register quota node /eos/hpc/user/, errmsg=Unable to write the record data at offset 0xf6e3d4; Bad file descriptor
error: received signal 11:
/lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7f7b658eb0a4]
/lib64/libc.so.6(+0x362f0)[0x7f7b6a55f2f0]
/lib64/libXrdEosMgm.so(_ZN3eos3mgm5Quota6CreateERKSs+0x176)[0x7f7b65882746]
/lib64/libXrdEosMgm.so(_ZN3eos3mgm13IConfigEngine15ApplyEachConfigEPKcP12XrdOucStringPv+0xab3)[0x7f7b6571c293]
/lib64/libXrdEosMgm.so(_ZN9XrdMgmOfs16FsConfigListenerEv+0x1f02)[0x7f7b6590c402]
/lib64/libXrdEosMgm.so(_ZN9XrdMgmOfs24StartMgmFsConfigListenerEPv+0x9)[0x7f7b6590f109]
/lib64/libXrdUtils.so.2(XrdSysThread_Xeq+0x37)[0x7f7b6b768197]
/lib64/libpthread.so.0(+0x7e25)[0x7f7b6b324e25]
/lib64/libc.so.6(clone+0x6d)[0x7f7b6a627bad]
#########################################################################
# stack trace exec=xrootd pid=17208 what='thread apply all bt'
#########################################################################
MGM tries to restart but always fails:
[fs02]# systemctl status eos@mgm
● eos@mgm.service - EOS mgm
Loaded: loaded (/usr/lib/systemd/system/eos@.service; disabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: signal) since Tue 2018-06-26 15:12:35 -04; 640ms ago
Process: 24064 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 23961 ExecStart=/usr/bin/xrootd -n %i -c /etc/xrd.cf.%i -l /var/log/eos/xrdlog.%i -s /tmp/xrootd.%i.pid -Rdaemon (code=killed, signal=SEGV)
Process: 23927 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 23961 (code=killed, signal=SEGV)
Jun 26 15:12:35 fs02.hpc.utfsm.cl systemd[1]: Unit eos@mgm.service entered failed state.
Jun 26 15:12:35 fs02.hpc.utfsm.cl systemd[1]: eos@mgm.service failed.
But as far as I remove quota node:
[fs01]# date
Tue Jun 26 15:14:58 -04 2018
[fs01]# eos -b quota rmnode -p /eos/hpc/user
Do you really want to delete the quota node under path /eos/hpc/user ?
Confirm the deletion by typing => 7377813624
=> 7377813624
Deletion confirmed
success: removed space quota for /eos/hpc/user/
slave MGM again automactically starts normally:
[root@fs02 mgm]# systemctl status eos@mgm
● eos@mgm.service - EOS mgm
Loaded: loaded (/usr/lib/systemd/system/eos@.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2018-06-26 15:15:14 -04; 16s ago
Process: 28257 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 28274 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 28307 (xrootd)
CGroup: /system.slice/system-eos.slice/eos@mgm.service
├─28307 /usr/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaem...
├─28624 /usr/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaem...
└─28667 eos -b console log _MGMID_
...
Jun 26 15:15:14 fs02.hpc.utfsm.cl xrootd[28307]: Register objects provide by NsInMemoryPlugin ...
Jun 26 15:15:16 fs02.hpc.utfsm.cl xrootd[28307]: ==> new Broker root://eos.hpc.utfsm.cl:1097//eos/fs02.hpc.utfsm...log=1
Jun 26 15:15:16 fs02.hpc.utfsm.cl xrootd[28307]: ==> new Broker root://daemon@eos.hpc.utfsm.cl:1097//eos/fs02.hp...log=0
So what was done wrong? It can be so that I somehow misunderstood what EOS “alias” and “broker” hosts mean and use them in a wrong way. But without quota settings everything works just fine. Could you, please, provide some hint on how to resolve this problem?
With the best wishes,
Yuri Ivanov