I have few questions about the group balancing activity. It has been running on our instance it in the last 2 weeks, without particular issue, but some details puzzle us.
First, how should we understand the
threshold parameter ? From the documentation it says that :
To avoid oscillations a threshold parameter defines when group balancing stops e.g. the deviation from the average in a group is less then the threshold parameter.
But we observe that the balancing is triggered when the threshold is much lower than what we would expect from the above sentence. For instance on our instance, the group balancing stopped with
threshold=1 , but groups were filled between 84.6% and 89.3% (without counting the special case below), whereas average on all instance is around 86.3%. To go on with group balancing, we would need to change the threshold to a value below 1% (0.9%).
Second, we have noticed that after the group balancing activity, all groups are well balanced (at least the less full ones), except the fuller one which is more than 2% fuller than the second one (91% vs 89%).
# eos group ls | sort -k6h ┌──────────┬────────────────┬────────────┬──────┬────────────┬────────────┬────────────┬──────────┬──────────┬──────────┐ │type │ name│ status│ N(fs)│ dev(filled)│ avg(filled)│ sig(filled)│ balancing│ bal-shd│ drain-shd│ └──────────┴────────────────┴────────────┴──────┴────────────┴────────────┴────────────┴──────────┴──────────┴──────────┘ groupview default.43 on 32 25.66 84.66 14.59 idle 0 0 groupview default.29 on 32 24.69 84.69 13.88 idle 0 0 groupview default.36 on 32 26.69 84.69 5.00 idle 0 0 groupview default.41 on 32 24.69 84.69 14.54 idle 0 0 groupview default.32 on 32 24.72 84.72 14.46 idle 0 0 groupview default.27 on 32 24.75 84.75 14.32 idle 0 0 ... groupview default.2 on 46 19.50 88.50 7.92 idle 0 0 groupview default.21 on 46 17.70 88.70 6.89 idle 0 0 groupview default.14 on 46 14.28 89.28 5.90 idle 0 0 groupview default.5 on 46 14.30 89.30 6.23 idle 0 0 groupview default.8 on 46 11.63 91.63 5.33 idle 0 0
We could check by counting the source groups the
GroupBalancer.log file that this group
default.8 is selected half the times with respect to other groups as source for balancing :
47925 src_group=default.8 55252 src_group=default.18 72864 src_group=default.10 95319 src_group=default.22 95535 src_group=default.17 95551 src_group=default.5 95590 src_group=default.21 95782 src_group=default.14 96433 src_group=default.2
So there is probably an inconsistency in the group balancing algorithm which doesn’t select the fuller group often enough (since our version is old (4.5.15), maybe this was already fixed, but in case it is not, it might be interesting.
Now, even if fixed, we do not plan to upgrade soon. Would there be a correct way to balance (i.e. drain) this group manually ? For instance find the biggest files in this group, and move them out manually.
As a workaround, I have thought of temporarily disabling this group for writing, if possible. Using the Geo scheduler, I understand that this could be done this way :
eos geosched disabled add JRC plct "default.8"
Is it advisable ? Are there some side effect ?
Last small comment, in the
group ls output, we noticed that the decimal part is always the same for every group in both
avg(filled) columns (e.g.
91.63), which is very unlikely that this is true. Is this only a display error in the output ? In some case, it might help to have the exact correct value. We observed that in both v4.5.15 and v4.6.8.