Central drain

First time draining using the central drain. It’s going somewhat slowly and I’m wondering what is the safest way to speed it up? I’m aware there are limits given network and disk speeds, etc. Just looking to incrementally poke some settings to see if it helps.

Can someone who is familiar with central draining give us some pointers?

Hi Daniel,

You can configure the pool of threads that deal with drain requests using the following command:
eos ns max_drain_threads
This represents the pool of threads that handle drain jobs. You can see the status of the drain pool by doing eos ns. Then you can also modify the maximum number of drain jobs running in parallel per file system. This is the usual “drainer.fs.ntx” and also “drainer.node.nfs” parameters.

With these knobs it should be enough to speed up the draining.


Right, I’ve been watching the values in ‘eos ns’:

ALL drain info thread_pool=central_drain min=10 max=100 size=23 queue_size=11

100 threads should be plenty (this must be the default I’ve never changed it.)

I have 6 filesystems draining on two FSTs with these values configured:
drainer.fs.ntx := 8
drainer.node.nfs := 5
drainer.node.ntx := 16
drainer.node.rate := 25
drainperiod := 259200

In the help section for ‘eos space’ we have:
space.drainer.node.rate=<MB/s > : configure the nominal transfer bandwidth per running transfer on a node [ default=25 (MB/s) ]
space.drainer.node.ntx=<#> : configure the number of parallel draining transfers per node [ default=2 (streams) ]
space.drainer.node.nfs=<#> : configure the number of max draining filesystems per node (Valid only for central drain) [ default=5 ]
space.drainer.retries=<#> : configure the number of retry for the draining process (Valid only for central drain) [ default=1 ]
space.drainer.fs.ntx=<#> : configure the number of parallel draining transfers per fs (Valid only for central drain) [ default=5 ]

Even though they don’t mention central drain specifically, are space.drainer.node.rate and space.drainer.node.ntx still used?

Hi Daniel,

Only the two parameters that I mentioned in my previous comment are used in the central draining. The rest will soon go away and have no effect on the current performance. The fact that the queue_size is 11 it means there are actually not too many files to drain. Therefore, how much are you actually draining? Can you also post eos fs ls -d?


Yeah, knew I should have posted that in my last one…

In ‘eos ns’ the queue size varies. I’ve seen it as high as 30, but never higher.

ALL      drain info                       thread_pool=central_drain min=10 max=100 size=15 queue_size=26
│host                    │port│    id│                            path│       drain│    progress│       files│  bytes-left│   timeleft│      failed│
 cmseos16.fnal.gov        1095     56                   /storage/data1     draining           30     346.72 K     20.85 TB      169573            1 
 cmseos16.fnal.gov        1095     57                   /storage/data2     draining           23     508.80 K     27.43 TB      169578            9 
 cmseos16.fnal.gov        1095     58                   /storage/data3     draining           25     456.40 K     31.13 TB      169584            2 
 cmseos17.fnal.gov        1095     59                   /storage/data1     draining           33     323.50 K     19.85 TB      169588            2 
 cmseos17.fnal.gov        1095     60                   /storage/data2     draining           24     496.11 K     27.09 TB      169590            3 
 cmseos17.fnal.gov        1095     61                   /storage/data3     draining           22     528.63 K     33.86 TB      169595            1 

In this case I would increase the drainer.fs.ntx to 100 or so since you have available threads but not enough transfers are put in the queue. This should speed up things.

Note also that changing these parameters do not impact currently running drainings, we observed that also in the past. In this post some details.

The default values are quite low for a drain to be fast enough (even when draining only one FS). On our instance, we came with these values and they are fitting for normal use (draining one or two FS). Not sure if they could cause an averuse of resources if draining many FS at the same time like you are doing.

drainer.fs.ntx                   := 50
drainer.node.nfs                 := 50