Preparing for updating EOS on a production instance

barbet · March 31, 2021, 8:38am

Hello,
Our EOS production instance for Alice is still running EOS version 4.7.7. I would like to update the entire cluster to the latest stable version which is 4.8.31 if I am not wrong.
I had a quick look at the release notes but I prefer to ask if it is OK to jump from 4.7.7 directly to 4.8.31 and if there are specific precautions to take.
Also, my method is to update successively;

the slave manager
force role exchange so that it becomes the master
update the new slave manager
update the servers (FSTs)

Is it OK this way ?
Thank you
JM

esindril · April 1, 2021, 10:00am

Hi Jean Michel,

There were quite some important changes in between those releases and it’s quite hard to say if there is any unwanted side effect while running in mixed mode.
One important step is that you need to disable the converter until all the nodes are updated to the same version.
I would suggest to update to the latest 4.8.40 which is running stable in production on our setup so that you don’t need to do another update later on. There were quite a number of issues fixed between .31 and .40 release.

I will promote the .40 release to the stable repo today.

Cheers,
Elvin

barbet · April 6, 2021, 9:55am

Hi Elvin,
Thank you very much, Could you remind me of the commands to disable/reenable the converter ? Thank you. Il will start to look at the update shortly, although all servers are currently busy with rebalancing the cluster following the addition of new servers. Unless you advise me that updating while there is this rebalancing is a bad idea…
JM

barbet · April 12, 2021, 8:44am

Hi Elvin,
I cannot see version 4.8.40 in the stable repo. Is it OK to update when balancing is ongoing ?
Thank you
JM

esindril · April 12, 2021, 8:58am

Hi Jean Michel,

I’ve just added 4.8.40 to the stable repo.
The converter is used for the balancing so you will need to stop all balancing during the update.

Cheers,
Elvin

barbet · April 12, 2021, 9:34am

Thank you very much Elvin,
Trying yum update on a test server, I get a dependency error:

[…]
→ Finished Dependency Resolution
Error: Package: eos-server-4.8.40-1.el7.cern.x86_64 (eos-citrine)
Requires: eos-folly = 2019.11.11.00-1.el7.cern

Is it possible that the requirement statement in eos-server-4.8.40 still mention eos-folly 2019.11.11 ?
JM

esindril · April 12, 2021, 9:48am

Hi JM,

That is not new, we had the same dependency also before, unless you system updated to the 2020 version. Then you need to downgrade eos-folly, we had a plan to move to a new folly version and built the packages, but we didn’t yet do the move. Newer eos packages have a strict dependency on eos-folly-2019 while old one did not have so that is why probably you have the 2020 version.

Cheers,
Elvin

barbet · April 12, 2021, 1:50pm

Elvin,
I updated to eos-server-4.8.40-1. incidentally I also reinstalled the VM on which EOS manager runs but the eos services cannot start because of:

[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS

The QDB cluster looks OK although the server mentioned in mgmofs.cfgredishost is not the leader… Can it be the reason ? Can this variable take a list of hosts ?

JM

esindril · April 12, 2021, 2:18pm

Hi Jean Michel,

Yes, it can take also a list of host but it also works with an alias if it point to the correct thing.
I guess you mean mgmofs.qdbcluster there is no such config mgmofs.cfgredishost in the /etc/xrd.cf.mgm config file.

Cheers,
Elvin

barbet · April 12, 2021, 2:23pm

In fact I have both:

mgmofs.cfgtype quarkdb
mgmofs.cfgredisport 6379
mgmofs.cfgredishost cceosqdb01.in2p3.fr

and

mgmofs.qdbpassword <quarkdb_password>
mgmofs.qdbcluster cceosqdb01.in2p3.fr:7777,cceosqdb02.in2p3.fr:7777,cceosqdb03.in2p3.fr:7777

The first set is (was) supposed to instruct EOS store and load its config from QDB… Has it changeed ?

JM

esindril · April 12, 2021, 2:49pm

From the first batch only mgmofs.cfgtype is used, the other I don’t think were ever used. The mgmofs.qdbcluster is used for all the communication with QDB.

Elvin

barbet · April 13, 2021, 6:42am

Hello Elvin,
This morning I downgraded eos back to version 4.8.31-1 with the following command:

yum downgrade eos-server-4.8.31-1.el7.cern.x86_64 eos-client-4.8.31-1.el7.cern.x86_64 eos-xrootd-4.12.5-1.el7.cern.x86_64

Then restarted EOS services and it works (without any other modification on the network, firewalls, configuration files)… So I think there may be sth wrong with version 4.8.40-1…
How can I help ?
Thank you
JM

esindril · April 13, 2021, 8:08am

Hi JM,

So what is exactly the problem. Can you paste some logs?
I don’t understand this statement: eos manager runs but the eos services cannot start
To which eos services are you referring? The FSTs? Can you also paste your /etc/xrd.cf.fst config file?

Thanks,
Elvin

barbet · April 13, 2021, 8:14am

Hi Elvin,
The full statement was :
I also reinstalled the VM on which EOS manager runs but the eos services cannot start
In short : * the eos mgm cannot start*
I am going to update again to 4.8.40-1 and send you a more complete log of the mgm trying to start.
JM

barbet · April 13, 2021, 8:21am

Updated this way:

[root@cceostest01 ~]$ yum update eos-server.x86_64 eos-client.x86_64 eos-xrootd.x86_64

Restarted EOS mgm:

[root@cceostest01 ~]$ systemctl stop eos@*
[root@cceostest01 ~]$ systemctl start eos

Log extract from /var/log/eos/mgm/xrdlog.mgm
[…]

################################################################################
210413 10:17:28 3143 Starting on Linux 3.10.0-1160.24.1.el7.x86_64
Copr. 2004-2012 Stanford University, xrd version v4.12.8
++++++ xrootd mgm@cceostest01.in2p3.fr initialization started.
Config using configuration file /etc/xrd.cf.mgm
=====> xrd.sched mint 8 maxt 256 idle 64
Config maximum number of connections restricted to 65000
Config maximum number of threads restricted to 7260
Copr. 2012 Stanford University, xrootd protocol 4.0.0 version v4.12.8
++++++ xrootd protocol initialization started.
=====> xrootd.fslib libXrdEosMgm.so
=====> xrootd.seclib libXrdSec.so
=====> xrootd.async off nosf
=====> xrootd.chksum adler32
=====> all.export / nolock
Config exporting /
Plugin loaded
++++++ Authentication system initialization started.
Plugin loaded
=====> sec.protocol unix
Plugin loaded
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
Plugin loaded
210413 10:17:28 3143 secgsi_InitOpts: *** ------------------------------------------------------------ ***
210413 10:17:28 3143 secgsi_InitOpts: Mode: server
210413 10:17:28 3143 secgsi_InitOpts: Debug: 0
210413 10:17:28 3143 secgsi_InitOpts: CA dir: /etc/grid-security/certificates/
210413 10:17:28 3143 secgsi_InitOpts: CA verification level: 1
210413 10:17:28 3143 secgsi_InitOpts: CRL dir: /etc/grid-security/certificates/
210413 10:17:28 3143 secgsi_InitOpts: CRL extension: .r0
210413 10:17:28 3143 secgsi_InitOpts: CRL check level: 0
210413 10:17:28 3143 secgsi_InitOpts: Certificate: /etc/grid-security/cceostest.in2p3.fr.pem
210413 10:17:28 3143 secgsi_InitOpts: Key: /etc/grid-security/cceostest.in2p3.fr.key
210413 10:17:28 3143 secgsi_InitOpts: Proxy delegation option: 0
210413 10:17:28 3143 secgsi_InitOpts: GRIDmap file: /etc/grid-security/grid-mapfile
210413 10:17:28 3143 secgsi_InitOpts: GRIDmap option: 2
210413 10:17:28 3143 secgsi_InitOpts: GRIDmap cache entries expiration (secs): 600
210413 10:17:28 3143 secgsi_InitOpts: Client proxy availability in XrdSecEntity.endorsement: 0
210413 10:17:28 3143 secgsi_InitOpts: VOMS option: 1
210413 10:17:28 3143 secgsi_InitOpts: MonInfo option: 1
210413 10:17:28 3143 secgsi_InitOpts: Crypto modules: ssl
210413 10:17:28 3143 secgsi_InitOpts: Ciphers: aes-128-cbc:bf-cbc:des-ede3-cbc
210413 10:17:28 3143 secgsi_InitOpts: MDigests: sha1:md5
210413 10:17:28 3143 secgsi_InitOpts: Trusting DNS for hostname checking
210413 10:17:28 3143 secgsi_InitOpts: *** ------------------------------------------------------------ ***
210413 10:17:28 3143 secgsi_GetSrvCertEnt: do not have certificate for the issuing CA ‘5e02f50a.0|14e86c33.0’
210413 10:17:28 3143 secgsi_Init: problems loading srv cert
=====> sec.protocol gsi -crl:0 -cert:/etc/grid-security/cceostest.in2p3.fr.pem -key:/etc/grid-security/cceostest.in2p3.fr.key -gridmap:/etc/grid-security/grid-mapfile -d:0 -gmapopt:2 -vomsat:1 -moninfo:1
=====> sec.protbind * only gsi sss unix
Config 4 authentication directives processed in /etc/xrd.cf.mgm
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for 10.180.10.7: local pub4 prv4
Config Route all4: 10.180.10.7 Dest=[::10.180.10.7]:1094
Plugin loaded
++++++ (c) 2015 CERN/IT-DSS MgmOfs (meta data redirector) 4.8.40
=====> mgmofs enforces SSS authentication for XROOT clients
jemalloc is loaded!
jemalloc heap profiling is disabled
=====> mgmofs.hostname: cceostest01.in2p3.fr
=====> mgmofs.hostpref: cceostest01
=====> mgmofs.managerid: cceostest01.in2p3.fr:1094
=====> mgmofs.fs: /
=====> mgmofs.targetport: 1095
=====> mgmofs.instance : eoslcgfr
=====> mgmofs.metalog: /var/eos/md
=====> mgmofs.txdir: /var/eos/tx
=====> mgmofs.authdir: /var/eos/auth
=====> mgmofs.reportstorepath: /var/eos/report
=====> mgmofs.cfgtype: quarkdb
=====> mgmofs.nslib : /usr/lib64/libEosNsQuarkdb.so
=====> mgmofs.qdbpassword length : 42
=====> mgmofs.qdbcluster : cceosqdb01.in2p3.fr:7777,cceosqdb02.in2p3.fr:7777,cceosqdb03.in2p3.fr:7777
=====> mgmofs.redirector : false
=====> mgmofs.broker : root://localhost:1097//eos/cceostest01.in2p3.fr/mgm
=====> mgmofs.defaultreceiverqueue : /eos/*/fst
=====> mgmofs.fs: /
=====> mgmofs.errorlog : enabled
=====> all.role: manager
=====> setting message filter: Process,AddQuota,Update,UpdateHint,Deletion,PrintOut,SharedHash,work
=====> comment log in /var/log/eos/mgm/logbook.log
=====> eosxd stacktraces log in /var/log/eos/mgm/eosxd-stacktraces.log
=====> eosxd logtraces log in /var/log/eos/mgm/eosxd-logtraces.log
=====> mgmofs.alias: cceostest.in2p3.fr
210413 10:17:28 time=1618301848.600738 func=Configure level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@cceostest01.in2p3.fr:1094 tid=00007f25f516c780 source=XrdMgmOfsConfigure:1534 tident= sec= uid=0 gid=0 name= geo=“” MGM_HOST=cceostest01.in2p3.fr MGM_PORT=1094 VERSION=4.8.40 RELEASE=1 KEYTABADLER=c98b24b8 SYMKEY=8fsZLKP1M4kg5TLMdL4QPmJmirE=
210413 10:17:28 time=1618301848.602398 func=set level=INFO logid=static… unit=mgm@cceostest01.in2p3.fr:1094 tid=00007f25f516c780 source=InstanceName:39 tident= sec=(null) uid=99 gid=99 name=- geo=“” Setting global instance name => eoslcgfr
210413 10:17:28 time=1618301848.602534 func=AddBroker level=INFO logid=static… unit=mgm@cceostest01.in2p3.fr:1094 tid=00007f25f516c780 source=XrdMqClient:173 tident= sec=(null) uid=99 gid=99 name=- geo=“” msg=“add broker” url=“root://localhost:1097//eos/cceostest01.in2p3.fr/mgm?xmqclient.advisory.status=1&xmqclient.advisory.query=1&xmqclient.advisory.flushbacklog=1”
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
210413 10:17:28 time=1618301848.603666 func=Supervisor level=NOTE logid=afc1ca0a-9c30-11eb-b426-fa163eebf630 unit=mgm@cceostest01.in2p3.fr:1094 tid=00007f25b4bfd700 source=QdbMaster:237 tident= sec= uid=0 gid=0 name= geo=“” msg=“set up booting stall rule”
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS
[QCLIENT - ERROR - getNextEndpoint:112] Unable to resolve any endpoints, possible trouble with DNS

JM

esindril · April 13, 2021, 8:24am

Can you also paste the /etc/xrd.cf.mgm config?

esindril · April 13, 2021, 8:34am

You need to modify the mgmofs.qdbcluster to list the entries with an empty space as separator.
After this everything should work fine.

Cheers,
Elvin

barbet · April 13, 2021, 8:43am

Elvin,
This is exactly this ! replacing commas by spaces in both config files /etc/xrd.cf.mgm and /etc/xrd.cf.mq makes it work.
Is this sth new and different between eos vervsions 4.8.31 and 4.8.40 ?
Do you still want to see /etc/xrd.cf.mgm ?
Thank you very much.
JM

esindril · April 13, 2021, 8:55am

Hi JM,

Yes, the parsing changed to make it consistent with how it works in QuarkDB, sorry, I forgot about this. I don’t need to see the config file. Glad things work now!

Cheers,
Elvin

barbet · April 15, 2021, 9:11am

Hi Elvin,
I have updated on a test machine that will be rebooted tomorrow (system updates including kernel) (I will be on site) and I will perform the updates on the managers in prod (round-robin update) next week.

About stopping the converter, the current situation is converter off but balancer on:

[root@naneosmgr01(EOSMASTER) ~]#eos space status default

Space Variables
…
balancer := on
balancer.node.ntx := 4
balancer.node.rate := 100
balancer.threshold := 10
converter := off
converter.ntx := 2
drainer.node.nfs := 5
drainer.node.ntx := 40
drainer.node.rate := 500
drainperiod := 3600
geotagbalancer := off
geotagbalancer.ntx := 10
geotagbalancer.threshold := 5
graceperiod := 86400
groupbalancer := off
groupbalancer.ntx := 10
groupbalancer.threshold := 5
groupmod := 1
groupsize := 70
headroom := 50.00 GB
quota := off
scaninterval := 604800
tracker := off

=> Do I have to set balancer off ? What is the exact command ?

Thank you

JM

CERN Accelerating science

Preparing for updating EOS on a production instance

[root@naneosmgr01(EOSMASTER) ~]#eos space status default