eos drain operations run at sub-optimum performance despite low IO contention on FSTs and fsids involved, and much less than rsync rates between systems achieves.
For example, regardless of drainer config settings we do not appear to be able to achieve more than ~800M/min ~1TB/day draining a single fsid. Concurrent drains increase this but fall far below what rsync, etc. achieve.
There is very light load on the source FST, and essentially zero IO contention on the FSTs serving the destination fsids within the scheduling groups being drained to.
We have six FSTs, each with 10G networking, JBODS connect with multiple SAS connections
Running fio, rsync, etc. on the underlying file systems shows vastly higher iops and throughput then what draining approaches.
-
What is the highest drain rates one can expect, given low IO contention and proper drain tunings?
Are other sites able to achieve drain rates above ~1T/day per fsid? -
What guidelines exist for determining optimum drainer.node.ntx and drainer.node.rate values?
We plug in various values for ntx and rates, but the performance remains capped.
Currently we are draining four fsids and seeing:
18:20:06 # eos config dump -g | grep drain | grep default
global:/config/eosaliceornl/space/default#drainer.node.nfs => 5
global:/config/eosaliceornl/space/default#drainer.node.ntx => 10
global:/config/eosaliceornl/space/default#drainer.node.rate => 200
global:/config/eosaliceornl/space/default#drainperiod => 259200
18:16:47 # eos io stat -x
┌───┬────────────────────────┬────────┬────────┬────────┬────────┐
│io │ application│ 1min│ 5min│ 1h│ 24h│
└───┴────────────────────────┴────────┴────────┴────────┴────────┘
out eos/draining 2.37 G 8.58 G 181.71 G 2.10 T
out eoscp 0 0 498.59 M 498.59 M
out other 913.61 M 3.27 G 27.54 G 314.12 G
out tpc 681.07 M 6.99 G 60.19 G 930.69 G
in eos/draining 2.37 G 8.60 G 181.71 G 2.10 T
in transfer-3rd 0 0 0 2.72 M
in converter 752.04 M 6.99 G 60.19 G 930.86 G
in other 0 0 62.76 M 1.25 G
18:39:22 # eos fs ls -d
┌──────────────────────────┬────┬──────┬────────────────────────────────┬────────────┬────────────┬────────────┬────────────┬───────────┬──────┬──────┐
│host │port│ id│ path│ drainstatus│ progress│ files│ bytes-left│ timeleft│ retry│ wopen│
└──────────────────────────┴────┴──────┴────────────────────────────────┴────────────┴────────────┴────────────┴────────────┴───────────┴──────┴──────┘
warp-ornl-cern-01.ornl.gov 1095 25 /warpfs/n02/cern-01 draining 0 903.81 K 35.07 TB 84415 0 0
warp-ornl-cern-01.ornl.gov 1095 26 /warpfs/n02/cern-02 draining 0 866.03 K 33.27 TB 73058 0 0
warp-ornl-cern-01.ornl.gov 1095 27 /warpfs/n02/cern-03 draining 0 856.51 K 35.56 TB 73071 0 0
warp-ornl-cern-01.ornl.gov 1095 28 /warpfs/n02/cern-04 draining 0 910.22 K 36.75 TB 73075 0 0
On a dest FST, tail -f /eos/log/fst/eoscp.log shows effective copy rate varys, but seldom exceed ~40MB/sec, though iowiat on the FST is <
[eoscp] #################################################################
[eoscp] # Date : ( 1529944965 ) Mon Jun 25 18:42:45 2018[eoscp} # auth forced=sss krb5=<none> gsi=<none>
[eoscp] # Source Name [00] : root://warp-ornl-cern-01.ornl.gov:1095//replicate:07930e08
[eoscp] # Destination Name [00] : root://warp-ornl-cern-02.ornl.gov:1095//replicate:07930e08
[eoscp] # Data Copied [bytes] : 3528358
[eoscp] # Realtime [s] : 0.521000
[eoscp] # Eff.Copy. Rate[MB/s] : 6.772280
[eoscp] # Bandwidth[MB/s] : 200
[eoscp] # Write Start Position : 0
[eoscp] # Write Stop Position : 3528358
[eoscp] #################################################################
[eoscp] # Date : ( 1529944965 ) Mon Jun 25 18:42:45 2018[eoscp} # auth forced=sss krb5=<none> gsi=<none>
[eoscp] # Source Name [00] : root://warp-ornl-cern-01.ornl.gov:1095//replicate:0678fb3e
[eoscp] # Destination Name [00] : root://warp-ornl-cern-02.ornl.gov:1095//replicate:0678fb3e
[eoscp] # Data Copied [bytes] : 10344
[eoscp] # Realtime [s] : 0.670000
[eoscp] # Eff.Copy. Rate[MB/s] : 0.015439
[eoscp] # Bandwidth[MB/s] : 200
[eoscp] # Write Start Position : 0
[eoscp] # Write Stop Position : 10344
[eoscp] #################################################################
[eoscp] # Date : ( 1529944964 ) Mon Jun 25 18:42:44 2018[eoscp} # auth forced=sss krb5=<none> gsi=<none>
[eoscp] # Source Name [00] : root://warp-ornl-cern-01.ornl.gov:1095//replicate:078121ce
[eoscp] # Destination Name [00] : root://warp-ornl-cern-02.ornl.gov:1095//replicate:078121ce
[eoscp] # Data Copied [bytes] : 9280589
[eoscp] # Realtime [s] : 0.345000
[eoscp] # Eff.Copy. Rate[MB/s] : 26.900258
[eoscp] # Bandwidth[MB/s] : 200
[eoscp] # Write Start Position : 0
[eoscp] # Write Stop Position : 9280589
[eoscp] #################################################################
[eoscp] # Date : ( 1529944965 ) Mon Jun 25 18:42:45 2018[eoscp} # auth forced=sss krb5=<none> gsi=<none>
[eoscp] # Source Name [00] : root://warp-ornl-cern-01.ornl.gov:1095//replicate:07b48c9c
[eoscp] # Destination Name [00] : root://warp-ornl-cern-02.ornl.gov:1095//replicate:07b48c9c
[eoscp] # Data Copied [bytes] : 37070844
[eoscp] # Realtime [s] : 1.100000
[eoscp] # Eff.Copy. Rate[MB/s] : 33.700766
[eoscp] # Bandwidth[MB/s] : 200
[eoscp] # Write Start Position : 0
[eoscp] # Write Stop Position : 37070844
sar: IOWait is ~4%
10:50:01 AM CPU %user %nice %system %iowait %steal %idle
11:00:01 AM all 0.50 0.00 10.86 3.62 0.00 85.02
11:10:01 AM all 0.59 0.00 11.84 3.66 0.00 83.91
11:20:01 AM all 0.50 0.00 11.34 3.68 0.00 84.48
11:30:01 AM all 0.77 0.00 10.84 4.30 0.00 84.09
11:40:01 AM all 0.71 0.00 11.80 4.03 0.00 83.46
11:50:01 AM all 0.50 0.00 11.84 3.52 0.00 84.14
12:00:01 PM all 0.52 0.00 12.53 3.71 0.00 83.24