Slave MGM crashes when quota node enabled

yupi · June 28, 2018, 3:05pm

I’m working on setting Citrine EOS (ver 4.2.26) at CentOS 7. I managed to start successfully master-slave combination. And it worked fine until I have tried to set quota. Setting quota immediately crashes MGM at slave server and it refuses to start. But when I remove quota node (‘quota rmnode’ at master) slave MGM can start again. All details on this problem are presented in my message at Master/Slave Configurationtion topic. Did anybody encountered with such or similar situation?

With the best wishes,

Yuri Ivanov

dszkola · October 9, 2018, 8:15pm

Was this ever fixed? I’m now seeing this exact same thing. Here is what the log shows on the slave server:

181009 15:12:29 time=1539115949.179154 func=SpaceQuota level=CRIT logid=static… unit=mgm@cmseos-itbmgm02.fnal.gov:1094 tid=00007fcf4ec648c0 source=Quota:68 tident= sec=(null) uid=99 gid=99 name=- geo="" Cannot create quota directory /eos/uscms/store/user/

EOS_SERVER_VERSION=4.2.28 EOS_SERVER_RELEASE=1
xrootd-server-4.8.4-1.el7.x86_64

–
Dan Szkola
FNAL

yupi · October 10, 2018, 1:47pm

Probably not. At least I got no any reply on this issue. By the way, later I found that now EOS documentation on Master/Slave configuration refers only to “BERYLL release” (see http://eos-docs.web.cern.ch/eos-docs/configuration/master.html). So one can guess that such mode is not for Citrine. Moreover it can be so that for Citrine only Master/Slave setup with QaurkDB works (see page http://eos-docs.web.cern.ch/eos-docs/configuration/master_quarkdb.html). But I didn’t test it: for a moment we have suspended all work on EOS.

With the best wishes,

Yuri Ivanov

crystal · October 11, 2018, 12:57am

I think this has been fixed since 4.3.0, according to the changelog (https://gitlab.cern.ch/dss/eos/blob/dev/doc/releases/citrine-release.rst)

From memory, manually compacting the mdlogs on the slave might fix this issue, but upgrading to 4.3.x should certainly stop that occurring.

Master/slave setup definitely works with Citrine, I guess that documentation could use some updating We don’t run slave MGMs anymore right now, but we were doing so for a while (AARNet Citrine Upgrade Site Report).

QuarkDB isn’t necessary for Citrine either - we’re not using that yet in production, but I do have it running on a test cluster.

yupi · October 11, 2018, 2:31pm

Great! Thanks for your clarifications and comments! Hopefully next time my experience with EOS will be more successful

By the way, you wrote that you are not using Master/Slave mode anymore. Do you use several masters instead? Is there any documentation on configuring such mode?

With the best wishes,

Yuri Ivanov

dszkola · October 11, 2018, 2:38pm

Sure enough, compacting the namespace seems to have alleviated the issue. After restarting EOS services on the slave MGM, it stayed up and running instead of crashing.

We’re in the middle of moving our EOS nodes to SL7 and then we will move to Citrine. I’m concerned about the 4.2.x releases at this point as there seems to be quite a few issues and I don’t think 4.3.x is for production use yet, unless I missed an announcement, so I’m stuck wondering what release we will use.

crystal · October 12, 2018, 10:51am

@yupi We’re currently running just a single master as we’re working on hardware upgrades for the slave mgms - I don’t believe a multi-master mgm setup is possible.

Glad that fixed up the issue @dszkola ! I’m interested in knowing which Citrine version(s) are being run in production at CERN currently, as well - @esindril would you happen to have this info? Thanks!

apeters · October 12, 2018, 11:09am

We run currently 4.3.12 on the four LHC production instances.

yupi · October 12, 2018, 2:56pm

@crystalConcerning multi-master mode: if it is not possible, why does sample EOS configuration file eos_env.example contain following lines?

# The fully qualified hostname of MGM master1
EOS_MGM_MASTER1=eosdevsrv1.cern.ch
# The fully qualified hostname of MGM master2
EOS_MGM_MASTER2=eosdevsrv2.cern.ch
# The alias which selects master 1 or 2
EOS_MGM_ALIAS=eosdev.cern.ch

Such configuration with two (or more) masters is required for providing HA (high availability) mode. I would guess that in theory with QuarkDB it can be possible, but I didn’t test it. There are some rumors about HA mode in EOS, but I saw no any documentation.

CERN Accelerating science

Slave MGM crashes when quota node enabled