Some questions about group balancing

franck-jrc · August 7, 2020, 3:16pm

Hello,

I have few questions about the group balancing activity. It has been running on our instance it in the last 2 weeks, without particular issue, but some details puzzle us.

1

First, how should we understand the threshold parameter ? From the documentation it says that :

To avoid oscillations a threshold parameter defines when group balancing stops e.g. the deviation from the average in a group is less then the threshold parameter.

But we observe that the balancing is triggered when the threshold is much lower than what we would expect from the above sentence. For instance on our instance, the group balancing stopped with threshold=1 , but groups were filled between 84.6% and 89.3% (without counting the special case below), whereas average on all instance is around 86.3%. To go on with group balancing, we would need to change the threshold to a value below 1% (0.9%).

2

Second, we have noticed that after the group balancing activity, all groups are well balanced (at least the less full ones), except the fuller one which is more than 2% fuller than the second one (91% vs 89%).

# eos group ls | sort -k6h
┌──────────┬────────────────┬────────────┬──────┬────────────┬────────────┬────────────┬──────────┬──────────┬──────────┐
│type      │            name│      status│ N(fs)│ dev(filled)│ avg(filled)│ sig(filled)│ balancing│   bal-shd│ drain-shd│
└──────────┴────────────────┴────────────┴──────┴────────────┴────────────┴────────────┴──────────┴──────────┴──────────┘
 groupview        default.43           on     32        25.66        84.66        14.59       idle          0          0 
 groupview        default.29           on     32        24.69        84.69        13.88       idle          0          0 
 groupview        default.36           on     32        26.69        84.69         5.00       idle          0          0 
 groupview        default.41           on     32        24.69        84.69        14.54       idle          0          0 
 groupview        default.32           on     32        24.72        84.72        14.46       idle          0          0 
 groupview        default.27           on     32        24.75        84.75        14.32       idle          0          0 
...
 groupview         default.2           on     46        19.50        88.50         7.92       idle          0          0 
 groupview        default.21           on     46        17.70        88.70         6.89       idle          0          0 
 groupview        default.14           on     46        14.28        89.28         5.90       idle          0          0 
 groupview         default.5           on     46        14.30        89.30         6.23       idle          0          0 
 groupview         default.8           on     46        11.63        91.63         5.33       idle          0          0

We could check by counting the source groups the GroupBalancer.log file that this group default.8 is selected half the times with respect to other groups as source for balancing :

  47925 src_group=default.8
  55252 src_group=default.18
  72864 src_group=default.10
  95319 src_group=default.22
  95535 src_group=default.17
  95551 src_group=default.5
  95590 src_group=default.21
  95782 src_group=default.14
  96433 src_group=default.2

So there is probably an inconsistency in the group balancing algorithm which doesn’t select the fuller group often enough (since our version is old (4.5.15), maybe this was already fixed, but in case it is not, it might be interesting.

Now, even if fixed, we do not plan to upgrade soon. Would there be a correct way to balance (i.e. drain) this group manually ? For instance find the biggest files in this group, and move them out manually.
As a workaround, I have thought of temporarily disabling this group for writing, if possible. Using the Geo scheduler, I understand that this could be done this way :

eos geosched disabled add JRC plct "default.8"

Is it advisable ? Are there some side effect ?

3

Last small comment, in the group ls output, we noticed that the decimal part is always the same for every group in both dev(filled) and avg(filled) columns (e.g. 11.63 vs 91.63), which is very unlikely that this is true. Is this only a display error in the output ? In some case, it might help to have the exact correct value. We observed that in both v4.5.15 and v4.6.8.

esindril · August 11, 2020, 7:34am

Hi Franck,

Thank you for the detailed report and for the issues that you raised. The GroupBalancing is one of the next areas that I would like to tackle in the sense of enhancing the monitoring information and better integration with the QDB namespace. Let me take you points one by one:

Indeed this looks fishy, just to get a better understanding could you enable debugging on the MGM machine for a few tens of seconds and at the same time collect the following info from the MGM logs:
tail -f /var/log/eos/mgm/xrdlog.mgm | grep "threshold="
Our operators have not spotted this, I will look at our instances to see if there is something similar and I will also review this code and come back to this - it does look like some bias in the selection algorithm.
I don’t think the geo balancer would help in this case since that one only takes the geotag into consideration so unless to have your disks tagged with the same geotag as the group they belong to (which is kind of wired) I don’t think this would work. The best is random selection of files from that group and triggering a conversion to a different group. This is also what the GroupBalancer is doing internally.
Yes, there is something strange also here, I see it also in our instances but not for all entries. But indeed in the majority of cases the decimal part matches. I’ll have a look.

Cheers,
Elvin

franck-jrc · August 11, 2020, 11:22am

Thank you for your answer Elvin,

About point 2, I was not talking about the geo balancer (we do not use it, and I wasn’t aware that groups also have an associated getoag) to actively move data, but to temporarily change the placement policy (used inside the Geo Scheduler, if I understand well) to prevent this group to be filled by new data, with the command eos geosched disabled add geotag plct "default.8".

I also thought about a manual GroupBalancing by forcing conversion of some selected files. Do you have some hints on how to best list files in a group (and potentially sorted by size to first move the biggest ones). Some search query in the QuarkDB ? Or directly listing them in the filesystems ?

esindril · August 11, 2020, 12:08pm

Hi Franck,

No, groups don’t have a geotag, sorry, my fault I misread the command you suggested. Indeed, you can use this command to disable placement to that particular group.

There is no easy way to get the largest files from a particular group. What you can do, is use the eos-ns-inspect to get a dump and then grep for locations of fsids from that group and then sort them - all this is a highly manual job …

Cheers,
Elvin

esindril · August 11, 2020, 12:43pm

Hi Franck,

Point 3 is now fixed by the following commit and will be available in 4.8.12:
https://gitlab.cern.ch/dss/eos/-/commit/a2a1513b50e0b7a82eb57440b0f599a78a6f7e1c

This also influence the way the selection process happens for point 1, but nevertheless if you can provide the logs that would help a bit in clearing things up.

Thanks for reporting it,
Elvin

franck-jrc · August 12, 2020, 11:15am

Hi Elvin,

Happy that I could identify some small issue.

The debug level was activated for 20 seconds. The threshold= lines are containing such kind of information, I’m not sure if they are enough to understand : (the group name is not specified on the line; the groupbalancer.thresold parameter was 1 at the time)

diff=-0.01 threshold=0.01
diff=0.01 threshold=0.01
diff=0.02 threshold=0.01
diff=0.03 threshold=0.01
diff=0.05 threshold=0.01

Some other information about groups could help :

group=default.11 average=85.66
group=default.15 average=85.66
group=default.20 average=85.66
group=default.41 average=85.66
group=default.45 average=85.66
group=default.27 average=85.67
...
group=default.22 average=88.71
group=default.2 average=88.82
group=default.21 average=89.02
group=default.14 average=89.81
group=default.5 average=89.82
group=default.8 average=92.12

Also this appears :

No groups under the average! 
New average calculated: 86.66 %

So maybe the group balancer isn’t triggered because it looks only the below average (many many groups are closed to low threshold 85.66%), and not the fact that some groups are much fuller than the average. And it is normal that we would need to lower again the threshold to go one with GroupBalancing.

I took this from the GroupBalancer.log file which seems more readable that xrdlog.mgm file. I can provide extract af both of them, but only via email as I can’t upload here. Let me know if you think this is necessary.

Thank you also for your suggestion about the file selection; we will do that if we consider it really critical, this is not yet the case.

esindril · August 13, 2020, 1:42pm

Hi Franck,

Your analysis perfectly describes what is happening here. Unfortunately, the algorithm for group balancing is not as smart as it should be … quite on the other side. This will be an opportunity for me to do some more in-depth refactoring.

Thanks,
Elvin

franck-jrc · August 17, 2020, 7:30am

Hi Elvin,

The GroupBalancer is very convenient any way in current version, There is always room for improvement, but it is difficult to foresee all cases, maybe addressing our case would make it less efficient for another one. In our case, most of the groups are filed like the average, just a few of them are quite a lot above.

One lead could be to trigger the group balancing not only when one group is above and one group is below the average, but if one is above OR one is below, so that it addresses such a non linear distribution of groups fill ratios.

When trying to understand how the threshold was considered, I was looking at the emptiest group fill ratio (let’s say 70%) and the fullest one (let’s say 90%) and so I was spontaneously expecting that the group balancing would trigger if the threshold is below 10% (half of 20%), but this was assuming that the global average was around 80%, which was not the case.

CERN Accelerating science

Some questions about group balancing

1

2

3