I am currently draining servers. We are not using RAIN, the filesystems on the servers are partitions of a large RAID6 volume. I have some questions :
When trying to set the parameters for draining, I observed that I could not put more than 10 as for
drainer.node.ntx. Also, I wanted to set drainperiod to 7 days (604800) and it seems it is still 86400.
Is the graceperiod taking over ? What is the respective impact of these 2 timeouts ?
I declare 2 FS in drain but noticed that they where processed one after the other, so no parallelism here.
Is this by design ?
When draining starts, it looks like the (auto) balancer is temporarily suspended. Does it restart
automatically after draining is over or does one have to restart it. If yes how ?
Among the other servers in the cluster, that are pulling data part of the drain procedure, it seems that
some are pulling data in a more efficient manner than others (at least for some time). Can this be
explained ?
I am still draining servers (it is slow because the servers have 2x1Gbits/s Ethernet adapters). This morning I tried to start all remaining drains at the same time hoping that they would be processed in parallel but nope, I can see already that only the first one is progressing. So, asking the question again: is there a way to have drain operations done in parallel ?
Hi Jean Michel,
i guess you are using the distributed drain right? ( the new central drain is not enabled by default)
i see that you have max 10 transfers per node for all filesystems, and you said that you cannot raise this limit.
i just tried and i can raise the limit without problems
space config default space.drainer.node.ntx=40
success: setting drainer.node.ntx=40
EOS Console [root://localhost] |/eos/dev01/test/andrea/> space status default
I successfully changed the drainer.node.ntx to 40 (I do not understand why I could not in the first place).
But it does not change the way the filesystems are drained, that is: one after the other. The only thing that changed is the number of processes performing eoscp on the target nodes. The network traffic on our EOS cluster can be seen here http://alimonitor.cern.ch?2692 (need a grid certificate allowed by Alice).
Currently 10 filesystems are draining, only the first one is progressing :
do you mean that you have only one scheduling group in your system, or that the fs that are draining belongs to the same scheduling group?
in the second case if all your fs under draining belong to the same scheduling group, only FS on the same scheduling group can pull data from them… maybe there are not so many FSs which are not under drain on that scheduling group?
i have tried to reproduce this behaviour on my testbed but i could not. I moved all FS on the same scheduling group, ( default.0) and tried to drain 3 FS and i could see them being drained in parallel