In our production deployments, we tend to use the contents of /etc/eos.keytab for QDB authentication as well, but you can use any sufficiently long string as a password.
qdbpassword_file simply specifies a file, whose contents will be used as the password to authenticate against QDB. If the QDB instance has no password set, it will work without it, but it’s highly recommended to set up a password.
in /etc/xrd.cf.quarkdb ?
[…]
redis.password_file /etc/eos.keytab
[…]
b) EOS MGM configuration /etc/xrd.cf.mgm:
[…]
mgmofs.qdbpassword_file /etc/eos.keytab
[…]
Right ?
If a password string instead of a file is used, one could use the /etc/sysconfig/eos_env directive
EOS_QUARKDB_PASSWORD ? Would it be sufficient to configure both QuarkDB ans EOS MGM ?
Yes, exactly. Both QDB and the MGM are configured to use passwordfiles with identical contents.
I think EOS_QUARKDB_PASSWORD is a configuration option only relevant to the namespace tests… sorry about that. To use a string instead of a passwordfile, try mgmofs.qdbpassword instead of mgmofs.qdbpassword_file.
On a test instance that was correctly migrated on QuarkDB namespace, we have been trying the configuration in QuarkDB, and master/slave configuration, but there is some issue : the slave boots and loads the namespace, but the boot state is not correct :
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 29641 [failed] (1558626981s)
ALL Directories 16597
ALL Total boot time 1558626978 s
#
Even if we can browse the files using eos command line, the changes in the master seem not replicated on the slave.
The configuration is correctly in QuarkDB :
# eos config ls
Existing Configurations on QuarkDB
================================
created: Thu May 23 12:25:43 2019 name: backup
created: Fri May 24 11:08:37 2019 name: default *
But it seems we have the old replication scheme (on both MGM), with mention of file configuration, is that normal ? :
ALL Replication mode=master-rw state=master-rw master=master.cidsn.jrc.it configdir=/var/eos/config/master.cidsn.jrc.it/ config=default mgm:slave.cidsn.jrc.it=down mq:slave.cidsn.jrc.it:1097=ok
Slave can see master (mgm:...=ok) but master doesn’t see slave (mgm:...=down)
So I suppose that something is missing, maybe someone can understand which one it could be ?
In addition to this, I have some questions about using the QuarkDB master/slave architecture :
How do the FSTs know about the current master ? Do we have to ensure that the hostnames set in EOS_MGM_ALIAS and EOS_BROKER_URL environment variables are always resolving to current master, for instance by some mechanism updating the DNS depending on the current master ?
Is all the configuration supposed to be also replicated to the slave in real time (fs status, vid, space configuration, etc…) ?
You need to enable the QuarkdbMaster class which is different from the old Master implementation, which you are apparently using. Therefore put the following env variable in the /eos/sysconfig/eos_env file for both the master and the slave machines:
EOS_USE_QDB_MASTER=1
So nothing particular to be done on the FST side, right ?
Because sometimes when a MGM starts as slave, it seems it can’t see the FSTs (they have unknown status)
In fact, I think I have been left with this question : how to have all MGM (master and slave) been contacted by the FSTs so that they can all see them as online ?
And in general, what would be the correct set up so that also clients contact the new master when there is a new master elected ? Should it be with a DNS alias that automatically changes based on which node has the master lease ?
One thing which I forgot to mention is that the mq daemons need also the pointers for the QDB cluster. Therefore, in the /etc/xrd.cf.mq you also need to put:
With this the slave should get the correct status of the FSTs.
Nevertheless, indeed if you want the client to always connect to the correct master then you either have to do some DNS trick to point the alias to the current master or you can deploy another MGM which acts just like a redirector and he redirects clients to the correct MGM master.
For this you need to setup a route using the “eos route command” in the newly deployed MGM redirector. Something like this:
eos route link /eos/ eos-mgm1.cern.ch:1094:8000,eos-mgm2.cern.ch:1094:8000
The MGM redirector will continuously poll the status of the other MGMs in the route command and will figure out who is the master therefore properly redirecting the clients to the correct MGM. There a small penalty for the redirection but for normal operations this should not be visible.
If you do a master-slave switch then the MQ from the old master will still be the “master” MQ. Therefore, you need this configuration also in the MQ to have everything working file also after you restart one of the MGMs.
You can end up with missing (FST) updates after you do a transition and then you restart both the MGM and MQ on the new slave.
Thank you Elvin for this precision, in fact this mq configuration seems to solve the problem.
With this, the MQ will automatically switch to the same role (slave or master) that the MGM running on the same host, correct ?
Another aspect that I’m not sure how to handle is the default max cache numbers ? Changing it with eos ns cache set -d/-f command seems not persistent after restart, and not even master/slave switch. We have plenty of RAM, and we would need to set it at higher number than default ones (30M files, and 3M directories)
I added the mq.qdb lines to /etc/xrd.cf.mq but I was wondering about the best way to point to the QDB cluster: should we use ? :
mq.qdbcluster localhost:7777
mq.qdbcluster round-robin-dns-alias:7777
mq.qdbcluster node1:7777,node2:7777,node3:7777
I noticed also that when executing the command master other, the message is misleading (it says that the current machine is now the master wħich is wrong…):
[root@nanxrd15(EOSMASTER) ~]#eos -b ns | grep Repli
ALL Replication is_master=true master_id=nanxrd15.in2p3.fr:1094
You have new mail in /var/spool/mail/root
[root@nanxrd15(EOSMASTER) ~]#eos -b ns master other
success: <nanxrd15.in2p3.fr:1094> is now the master
Is there a simple way to check if the mq daemon is master or slave ?
And to finish with, I find that the /var/log/eos/mq/xrdlog.mq log on the slave mq is growing continually with lines such as:
I understand that after changing the mq config file, the mq daemon needs to be reloaded. Is it safe to restart the master MQ while on activity, or is there a risk to lose some message and lead to data corruption ?
I confirm @barbet observation of incorrect message when doing a master/slave switch (it mentions the old master instead of new one) with eos ns master other. And it would indeed help in troubleshooting if we could easily have a way to know if a MQ is master or slave.
Sorry for the late reply, I was on holidays last week. For you first question, you can add any of the last two options. But for simplicity I would go for option 3 where you list all the qdb nodes. The qclient will take care of connecting to the correct master so you don’t have to worry about that.
Indeed, the message a bit misleading but not totally wrong. On the moment you issue the command that is the master but will soon drop out. I will improve that.
For you last question, the messages in the logs are normal and there are also an indication that the current MQ is acting as a slave. There is no other way at the moment to tell if the MQ is acting as master or slave. As I said before, we plan on dropping this daemon soon so all this will be greatly simplified.