CERN Accelerating science

AARNet Citrine Upgrade Site Report


(Crystal) #1

Hello!

A couple weeks ago, on May 12, we upgraded our production EOS instance from Beryl-Aquamarine 0.3.268 to Citrine 4.2.22. Here’s a short report on our process!

Background

Our production environment:

  • 1 master MQ/MGM in Melbourne, Australia
  • 2 slave MQ/MGMs
    • 22ms away by network in Brisbane, Australia
    • 43ms away by network in Perth, Australia
  • 12 FSTs, equally spread with four in each of the three sites
    • Each FST has 44 disks
  • 104 million files, >1PB of data

Additionally, we run ownCloud and Minio S3 gateways on top of our EOS storage via the eosd fuse driver.

EOS Upgrade Preparation

At AARNet, we use Docker for our EOS deployment, and Rancher for orchestration. Each Rancher node is of the “Cattle” type, and we bind each container to the host’s IP stack. Each of the xrootd processes is split into its own container: for Beryl-Aquamarine, these were the MQ, MGM and FST.

Only the MGM and FST containers required a config upgrade (in xrd.cf.mgm and xrd.cf.fst respectively):

  • Added xrootd.fslib -2 libXrdEosFst.so to xrd.cf.fst, removed unnecessary authlib config lines
  • MGM namespace plugin must now be specified in xrd.cf.mgm: mgmofs.nslib /usr/lib64/libEosNsInMemory.so

The first issue we ran into was the Citrine MGM being unable to start - this was due to the code trying to start the sync processes via systemctl, which didn’t work in the docker containers. To get around this, we added a check for the EOS_START_SYNC_SEPARATELY environment variable, and split the SYNC and EOSSYNC processes out into their own containers.

The second was a strange bug in xrootd, in which we were seeing authentication issues between the containers due to an “IP mismatched” error - this was reported and fixed extremely quickly by the xrootd people (thanks Andrew and Michal!)

Production EOS Upgrade Process

  1. Stopped all ownCloud, Minio, and related containers
    • This stopped all traffic to EOS
  2. Set all the FST filesystems to read-only
  3. Turned all nodes off, then stopped the FST containers
  4. Stopped all the MQ and MGM containers
  5. Upgraded master MQ and MGM containers to use new Citrine Docker image
  6. Created SYNC containers on both the slave MGM servers
  7. Created EOSSYNC containers on the master MGM server
  8. Verified the metadata was syncing
  9. Upgraded the slave MQ and MGM containers to use new Citrine Docker image
  10. Upgraded the FST containers to use new Citrine Docker image
  11. Waited for the FST filesystems to boot, then set them back to read-write
  12. Restarted eosd on all app stack servers
  13. Restarted all ownCloud, Minio and related containers

All in all, it was an extremely easy process involving lots of button pressing in Rancher to stop and restart the containers; most of the work was in building the docker images to upgrade to. That said, it was still pretty lengthy; we started at 10 PM on 12 May and finished up at approximately 3 PM on 13 May. Most of this time was spent slowly restarting the FSTs post-upgrade, and waiting for the (504 total) filesystems to finish booting completely. It probably could have gone faster, but we didn’t want to risk stalling the master MGM by booting the FSTs too quickly. Everything was butter-smooth until we turned the Owncloud and S3 containers back online… (suspense)

Issues

  • ownCloud reporting lack of quota/space for all users due to the php disk_free_space() function returning 0
    • This prevented new writes even though there was plenty of free space available in EOS
    • Directly after the MGM/MQ/FST upgrade, we were still using Beryl-Aquamarine 0.3.268 eosd clients
    • This behaviour persisted even after upgrading all the eosd clients to 4.2.22
    • df displays correct mount info (space etc) most of the time, but occasionally shows 0 as well
    • Still investigating this one, have submitted a ticket
  • MGM stall due to deadlock causing everything else to crash
    • We started seeing Cloudstor crashes at very specific times, and narrowed the cause down to a curl run against the MGM
    • This was reported and also fixed extremely quickly (thanks Andreas!)