Dear fellows,
Trying to start discussion on this nice new website.
At JRC, we are preparing the Aquamarine to Citrine upgrade for our main eos instance next week.
From the tests we did, we extracted a short page of steps and configuration changes for the process that we want to share with you in case you are planning the upgrade (below the post).
In addition, you might have some comments or experience to share if you already did these step, it might help us.
We tested a full offline upgrade procedure on a separate instance : stop everything, then upgrade and start MGM, then FSTs one by one, and it really seems straightforward.
But since our main instance is quite large with 2PB of data, 17 FSTs and 130M files, this might take a while, and we were thinking of tempting an online upgrade, by switching the instance in read only in the process so that users can still access the data, and start the FSTs after, having at least 3 FSTs booting at a time. How does this sound as a solution ?
We tried some scenario, and it seems that MGM and FSTs can correctly collaborate, at least for some time, with mismatched versions. We had the impression that it is best the first upgrade the MGM, then the FSTs one by one, but does someone have some arguments against this
Plus, what would be the best way to turn the instance read-only ? Turn all FSTs read-only, or add some access rule ?
The JRC team.
Upgrade from Aquamarine to Citrine
The procedure is quite as easy as minor version upgrade, but all components should be done in one shot. CERN advises to upgrade all servers at the same time (because of some changes in communication protocol) but partial upgrade works in tests (safeguard by setting the instance read-only ?)
Procedure
Run compaction for faster boot.
Set global stall and stop all servers (FST + MGM), or decide to change servers one by one (by setting read-only on the whole instance to avoid errors, if possible)
Current advised version for eos server is 4.2.20 (after first citrine upgrade to 4.2.12, this 4.2.20 version solved various bugs).
For safety, backup all /var/eos
folders !!
Change configuration (see Configuration changes)
Change eos repos (see yum.repo file)
Run MGM upgrade (see Upgrade MGM)
Run FST upgrade (see Upgrade FSTs)
Some changes in using the instance (see Changes in new version)
Configuration changes
All servers
The eos configuration is moved from /etc/sysconfig/eos
to /etc/sysconfig/eos_env
. The export
directive needs to be removed, as well as the service alias definition. Conversion to remove export directives seems to work this way :
cat /etc/sysconfig/eos | sed -e 's/export //' > /etc/sysconfig/eos_env
If present, these lines need to be removed (they cause a warning in eos log at when starting) :
# ------------------------------------------------------------------
# Service Script aliasing for EL7 machines
# ------------------------------------------------------------------
which systemctl >& /dev/null
if [ $? -eq 0 ]; then
alias service="service --skip-redirect"
fi
MGM
Necessary lines to be added in MGM’s /etc/xrd.cf.mgm
#-------------------------------------------------------------------------------
# Set the namespace plugin implementation
#-------------------------------------------------------------------------------
mgmofs.nslib /usr/lib64/libEosNsInMemory.so
Now also in the configuration, synchronization service needs to explicitly define local and remote MGM host (! configuration is different on both MGMs) as EOS_MGM_MASTER1/2
values are not sufficient any more. In /etc/sysconfig/eos_env
, add :
EOS_MGM_HOST=fqdn.of.local.host.domain
EOS_MGM_HOST_TARGET=fqdn.of.remote.host.domain
FST
The FSTs need to be geotagged, otherwise no write can be scheduled there. So on all FSTs in /etc/sysconfig/eos
add the value (can be anything to work, just not empty; can be set to rack or switch to foresee future expansions)
export EOS_GEOTAG='JRC_DC'
The default xrd.cf.fst
file changed a bit (automatically done by upgrade if file is not modified) :
-
xrootd.fslib libXrdEosFst.so
needs to becomexrootd.fslib -2 libXrdEosFst.so
- comment or remove these lines :
#ofs.authlib libXrdEosAuth.so
#ofs.authorize
yum.repo file
https://eos-docs.web.cern.ch/eos-docs/quickstart/setup_repo.html#eos-citrine
/etc/yum.repos.d/eos.repo
[eos-citrine]
name=EOS 4.0 Version
baseurl=https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/el-7/x86_64/
gpgcheck=0
enabled=1
[eos-dep]
name=EOS 4.0 Dependencies
baseurl=https://storage-ci.web.cern.ch/storage-ci/eos/citrine-depend/el-7/x86_64/
gpgcheck=0
enabled=1
(remove any obsolete /etc/yum.repos.d/eos-dep.repo
if any)
Use also the xrootd-stable repository :
[xrootd-stable]
name=XRootD Stable repository
baseurl=http://xrootd.org/binaries/stable/slc/7/$basearch
gpgcheck=1
enabled=1
protect=0
gpgkey=http://xrootd.cern.ch/sw/releases/RPM-GPG-KEY.txt
Upgrade MGM
Set global stall
eos access set stall 1000 w
OR
Switch namespace in read only :
eos access set stall 1000 w
<== this is currently not supported so not an option
Set all nodes readonly. For each node:
eos node config node.fqdn configstatus=ro
Stop the old service service --skip-redirect eos stop
To latest version :
yum upgrade "eos-*"
To specific version (e.g 4.2.12) :
yum upgrade eos-server-4.2.12-1.el7.cern eos-client-4.2.12-1.el7.cern eos-debuginfo-4.2.12-1.el7.cern
Start eos systemctl start eos
server should boot normally
check synchronisation : systemctl status eossync@*
Update also slave
Check synchronisation, etc…
Before upgrading the FSTs, switch off the fsck system, which will cause the MGM to uselessly count all missing files. Since all FST are currently down, this represent all the files in the namespace.
eos fsck disable
Upgrade FSTs
Stop the old service service --skip-redirect eos stop
To latest version :
yum upgrade "eos-*"
To specific version (e.g 4.2.12) :
yum upgrade eos-server-4.2.12-1.el7.cern eos-client-4.2.12-1.el7.cern eos-debuginfo-4.2.12-1.el7.cern
Start service : systemctl start eos
Expect a full boot because they need to fully resynchronize all FS with MGM (probably 30 min to 1 hour, depends on the number of files)
Finalize
When all FSTs have booted, and all works, you can switch back to read/write :
eos access rm stall w
Changes in new version
Service management systemd
The eos server now fully uses systemd, so systemctl command to handle services. Commands are :
systemctl start eos
to start all eos services (including eossync)
systemctl restart eos@*
to restart all eos services
systemctl start eos@mgm
to start just mgm
systemctl restart eossync@*
to restart eossync services
systemctl status eossync@*
to get synchronization status (not as easily readable as old version)
Details are here http://eos-docs.web.cern.ch/eos-docs/eos_services.html
Master/slave switch
the previous command service --skip-redirect eos master mq
doesn’t work any more. The replacements are :
systemctl start eos@master
systemctl start eos@slave