AARNet Citrine Upgrade Site Report

crystal · May 28, 2018, 1:16am

Hello!

A couple weeks ago, on May 12, we upgraded our production EOS instance from Beryl-Aquamarine 0.3.268 to Citrine 4.2.22. Here’s a short report on our process!

Background

Our production environment:

1 master MQ/MGM in Melbourne, Australia
2 slave MQ/MGMs
- 22ms away by network in Brisbane, Australia
- 43ms away by network in Perth, Australia
12 FSTs, equally spread with four in each of the three sites
- Each FST has 44 disks
104 million files, >1PB of data

Additionally, we run ownCloud and Minio S3 gateways on top of our EOS storage via the eosd fuse driver.

EOS Upgrade Preparation

At AARNet, we use Docker for our EOS deployment, and Rancher for orchestration. Each Rancher node is of the “Cattle” type, and we bind each container to the host’s IP stack. Each of the xrootd processes is split into its own container: for Beryl-Aquamarine, these were the MQ, MGM and FST.

Only the MGM and FST containers required a config upgrade (in xrd.cf.mgm and xrd.cf.fst respectively):

Added xrootd.fslib -2 libXrdEosFst.so to xrd.cf.fst, removed unnecessary authlib config lines
MGM namespace plugin must now be specified in xrd.cf.mgm: mgmofs.nslib /usr/lib64/libEosNsInMemory.so

The first issue we ran into was the Citrine MGM being unable to start - this was due to the code trying to start the sync processes via systemctl, which didn’t work in the docker containers. To get around this, we added a check for the EOS_START_SYNC_SEPARATELY environment variable, and split the SYNC and EOSSYNC processes out into their own containers.

The second was a strange bug in xrootd, in which we were seeing authentication issues between the containers due to an “IP mismatched” error - this was reported and fixed extremely quickly by the xrootd people (thanks Andrew and Michal!)

Production EOS Upgrade Process

Stopped all ownCloud, Minio, and related containers
- This stopped all traffic to EOS
Set all the FST filesystems to read-only
Turned all nodes off, then stopped the FST containers
Stopped all the MQ and MGM containers
Upgraded master MQ and MGM containers to use new Citrine Docker image
Created SYNC containers on both the slave MGM servers
Created EOSSYNC containers on the master MGM server
Verified the metadata was syncing
Upgraded the slave MQ and MGM containers to use new Citrine Docker image
Upgraded the FST containers to use new Citrine Docker image
Waited for the FST filesystems to boot, then set them back to read-write
Restarted eosd on all app stack servers
Restarted all ownCloud, Minio and related containers

All in all, it was an extremely easy process involving lots of button pressing in Rancher to stop and restart the containers; most of the work was in building the docker images to upgrade to. That said, it was still pretty lengthy; we started at 10 PM on 12 May and finished up at approximately 3 PM on 13 May. Most of this time was spent slowly restarting the FSTs post-upgrade, and waiting for the (504 total) filesystems to finish booting completely. It probably could have gone faster, but we didn’t want to risk stalling the master MGM by booting the FSTs too quickly. Everything was butter-smooth until we turned the Owncloud and S3 containers back online… (suspense)

Issues

ownCloud reporting lack of quota/space for all users due to the php disk_free_space() function returning 0
- This prevented new writes even though there was plenty of free space available in EOS
- Directly after the MGM/MQ/FST upgrade, we were still using Beryl-Aquamarine 0.3.268 eosd clients
- This behaviour persisted even after upgrading all the eosd clients to 4.2.22
- df displays correct mount info (space etc) most of the time, but occasionally shows 0 as well
- Still investigating this one, have submitted a ticket
MGM stall due to deadlock causing everything else to crash
- We started seeing Cloudstor crashes at very specific times, and narrowed the cause down to a curl run against the MGM
- This was reported and also fixed extremely quickly (thanks Andreas!)

CERN Accelerating science

AARNet Citrine Upgrade Site Report

Background

EOS Upgrade Preparation

Production EOS Upgrade Process

Issues