CERN Accelerating science

Drain Stalling Despite Available Capacity


(Pete Eby) #1

Attempting to drain these two fsids results in drain stalls, but nothing logged in /var/log/eos/fst/eoscp.log on the FST

Any hints on how to get the drains working?

23:44:05 # eos fs ls ornl-cern-02.ornl.gov

...ornl-cern-02.ornl.gov (1095)     59  /warpfs/cern-09        default.2                        booted          drain     stalling   online
...ornl-cern-02.ornl.gov (1095)     60  /warpfs/cern-10        default.3                        booted          drain     stalling   online

There is room each respective group:

root@alice-eos-01.ornl.gov:~
23:43:53 # eos group ls --io
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                                                                                    
#           name # diskload # diskr-MB/s # diskw-MB/s #eth-MiB/s # ethi-MiB # etho-MiB #ropen #wopen # used-bytes #  max-bytes # used-files # max-files #  bal-shd #drain-shd                                                                                                                                    
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                                                                                                  
default.0              0.00            0            0       4887          1        457     67      0    155.22 TB    200.07 TB       7.04 M     87.62 G          0          0                                                                                                                                                  
default.1              0.00            0            0       4887          1        457     69      0    181.51 TB    196.78 TB       9.42 M     29.85 G          0          0                                                                                                                                                  
default.2              0.00            0            0       4887        268        407    307      0    149.45 TB    206.67 TB       8.84 M    111.79 G          0          0                                                                                                                                                                   
default.3              0.00            0            0       4887        268        407    288      0    174.50 TB    203.37 TB       9.24 M     56.42 G          0          0
default.4              0.00            0            0       4887          2        688    315      0    175.33 TB    203.37 TB       9.14 M     54.80 G          0          0
default.5              0.00            0            0       4887          2        688    293      0    175.83 TB    203.37 TB       9.15 M     53.83 G          0          0
default.6              0.00            0            0       3695          1        401     34      0     99.82 TB    163.28 TB       7.31 M    123.95 G          0          0
default.7              0.00            0            0       4887          1        407    110      0    162.80 TB    236.39 TB       6.67 M    143.76 G          0          0
recovery               0.00            0            0       4768          0          8      0      0     67.04 GB     67.10 GB            0      4.19 M          0          0

The log on the FST performing the drains is empty:

root@warp-ornl-cern-02.ornl.gov:~
23:42:18 # du -sh /var/log/eos/fst/eoscp.log*
512     /var/log/eos/fst/eoscp.log
512     /var/log/eos/fst/eoscp.log-20180329.gz
512     /var/log/eos/fst/eoscp.log-20180330.gz
512     /var/log/eos/fst/eoscp.log-20180331.gz
512     /var/log/eos/fst/eoscp.log-20180401.gz
512     /var/log/eos/fst/eoscp.log-20180402.gz
512     /var/log/eos/fst/eoscp.log-20180403.gz
512     /var/log/eos/fst/eoscp.log-20180404.gz
512     /var/log/eos/fst/eoscp.log-20180405.gz
512     /var/log/eos/fst/eoscp.log-20180406.gz
512     /var/log/eos/fst/eoscp.log-20180407.gz
512     /var/log/eos/fst/eoscp.log-20180408.gz
512     /var/log/eos/fst/eoscp.log-20180409.gz
512     /var/log/eos/fst/eoscp.log-20180410.gz

root@warp-ornl-cern-02.ornl.gov:~
23:42:50 # zcat /var/log/eos/fst/eoscp.log*
gzip: /var/log/eos/fst/eoscp.log: unexpected end of file

Space config:

root@alice-eos-01.ornl.gov:~
23:48:59 # eos space status default
# ------------------------------------------------------------------------------------
# Space Variables
# ....................................................................................
autorepair                       := on
balancer                         := off
balancer.node.ntx                := 12
balancer.node.rate               := 500
balancer.threshold               := 20
converter                        := on
converter.ntx                    := 20
drainer.node.ntx                 := 10
drainer.node.rate                := 100
drainperiod                      := 259200
geotagbalancer                   := off
geotagbalancer.ntx               := 10
geotagbalancer.threshold         := 5
graceperiod                      := 86400
groupbalancer                    := off
groupbalancer.ntx                := 18
groupbalancer.threshold          := 5
groupmod                         := 8
groupsize                        := 4
headroom                         := 50.00 GB
quota                            := off
scaninterval                     := 1814400
stat.converter.active            := 0

(Franck Eyraud) #2

Hi Pete,

We had such a situation once, some files were failing to drain for some reason (one was outdated .xsmap files

You can inspect the drain situation with command eos fs ls -d, it will give you the interesting information of the number of files remaining. Then I think that the good way to extract the list of the files is eos fs dumpmd <fsid> --path

So you can then investigate why these files fail to drain, potentially converting them manually so that they are moved elsewhere.