# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 86058 [booted] (0s)
ALL Directories 16629
ALL Total boot time 1 s
# ------------------------------------------------------------------------------------
ALL Replication is_master=true master_id=s-jrciprcids90v.cidsn.jrc.it:1094
# ------------------------------------------------------------------------------------
ALL files created since boot 600
ALL container created since boot 0
# ------------------------------------------------------------------------------------
ALL current file id 314502
ALL current container id 19997
# ------------------------------------------------------------------------------------
ALL eosxd caps 0
ALL eosxd clients 3
# ------------------------------------------------------------------------------------
ALL File cache max num 30000000
ALL File cache occupancy 75993
ALL In-flight FileMD 0
ALL Container cache max num 3000000
ALL Container cache occupancy 1933
ALL In-flight ContainerMD 0
# ------------------------------------------------------------------------------------
ALL memory virtual 2.65 GB
ALL memory resident 473.95 MB
ALL memory share 21.87 MB
ALL memory growths 561.50 MB
ALL threads 221
ALL fds 267
ALL uptime 105307
# ------------------------------------------------------------------------------------
ALL drain info id=default, thread_pool_min=1, thread_pool_max=400, thread_pool_size=1, queue_size=0
# ------------------------------------------------------------------------------------
This is our test instance, but we observe the same also on our production instance.
Ok, this explains why it’s not working as it should.
The version you have is still missing some features to have the HA setup working.
Particularly this is what prohibits the balancer to be enabled. https://gitlab.cern.ch/dss/eos/blob/4.4.23/mgm/QdbMaster.hh#L140
I suggest to use anything >= 4.4.47 for the HA setup to work fine.
OK, thanks, so we will move to the testing repository.
Do you consider it quite safe to run this newer version on production ?
Edit : I also had another question about the central drain : it seems to be recommended when using QDB namespace, but is there some particular configuration to change/pay attention, etc… ?
We have a diverse setup running in production with the oldest version being 4.4.34. We also have one instance running 4.5.2 and things are stable.
For the central drain you need to set the following configuration option in the /etc/xrd.cf.mgm to properly enable the central draining:
mgmofs.centraldrain true
Then you should see a summary at the end of the eos ns command about the thread pool used for draining. Smth along these lines:
# ------------------------------------------------------------------------------------
ALL drain info thread_pool=central_drain min=16 max=1000 size=89 queue_size=85
# ------------------------------------------------------------------------------------
You can increase the size of the thread pool by using the following command:
eos ns max_drain_threads <num>
but the default should be ok in general.
You can also monitor the number of started/successful/failed drain jobs by doing:
> eos ns stat | grep DrainCentral
all DrainCentralFailed 4.37 M 0.00 0.15 0.57 1.38 -NA- -NA-
all DrainCentralStarted 122.10 M 0.00 1.37 6.58 8.00 -NA- -NA-
all DrainCentralSuccessful 117.73 M 0.00 1.22 5.99 6.63 -NA- -NA-
You can further control the draining with the following parameters:
space config <space-name> space.drainer.node.ntx=<#> : configure the number of parallel draining transfers per node [ default=2 (streams) ]
space config <space-name> space.drainer.fs.ntx=<#> : configure the number of parallel draining transfers per fs (Valid only for central drain) [ default=5 ]
space config <space-name> space.drainperiod=<sec> : configure the default drain period if not defined on a filesystem (see fs for details)
All logs related to the draining activity are collected in /var/log/eos/mgm/DrainJob.log.
Thank you @esindril for the drain information, we could indeed have it correctly work on the test instance.
However, at first tries, the DrainJobs were almost all falling in timeout after 10 minutes (operation expired), whereas any other activity (fuse access, balancing, convert, …) were correctly working, and the corresponding files were healthy. There was one FST with some older version (4.4.25 instead of 4.4.46) that was upgraded (4.4.47), then after a while and some MGM restart, it started to always work. Could there be some minimum version number (FST/MGM) for central drain to correctly work ?
Glad to hear that everything is working as expected. As far as versions are concerned, there were indeed quite some modifications between 4.4.23 and 4.4.47, but I couldn’t spot any clear incompatibility that would make the drain timeout. If you want we can investigate further.
Today a drain fired up after a disk failure, and central drain is working correctly. However, for our first central drain operation, we find that it could go faster. The DrainCentralStarted rate is limited to 2.75.
all DrainCentralStarted 52.41 K 2.75 2.61 2.77 2.76 -NA
all DrainCentralSuccessful 52.41 K 2.75 2.80 2.81 2.76 -NA
Indeed, we observe from DrainJob.log that 11 drain requests are launched every 4 seconds (hence the 2.75 rate) and finish in less than one second, then for 3 seconds nothing happens.
How can we increase the number of parallel requests that can be launched ? There is currently only one FS draining, within a group large 37 FS.
Could you also post the configuration of the drain thread pool by doing “eos ns”?
In principle, if there is not enough parallelism one can increase the size of the drain thread pool but for such a recommendation I first need to see the output of the “eos ns”.
Thank you for your answer. You are right, I forgot to post this information, and I didn’t consider changing it. They are the default values that are mentioned to be OK, and from what I understand, they are much higher that what we observe (400 max threads).
# ------------------------------------------------------------------------------------
ALL drain info thread_pool=central_drain min=80 max=400 size=80 queue_size=0
# ------------------------------------------------------------------------------------
Currently no drain is running any more, but from what I remember that this output was the same while draining.
Let’s revisit this once you have some draining running. Another thing to monitor is the network IO in the cluster, to see if there is any other significant activity happening at the same time. Just do:
I’d like to know more about how to set up the MGM redirector - is it just an empty MGM with its own QuarkDB backend, that has routes configured to the actual EOS cluster MGMs?
Yes, exactly, in this case the QuarkDB backend can be a standalone one since you only have to create the redirection paths in the MGM redirector. Afterwards, you set up the routes and that is pretty much it. You can find more info about the route setup here: http://eos-docs.web.cern.ch/eos-docs/configuration/route.html
We had another draining this week. It seems that it went faster than the previous one. Could it just be that changing the values does not affect an already running drain ?