OK, thank you for the precision, I wasnât aware (or forgot about it) we could already test the new balancer on version 5.1âŚ
Regarding our issue we didnât manage to reproduce it yet, but an interesting behaviour is observed in the logs. After the threshold change, and until the next one, the GetUnbalancedGroups
messages disappeared, whereas they usually repeat every minute. For instance here only showing group default.68
(but it was the same for all groups) and config threshold changes : just after the threshold was changed from 85 to 80, only one update balancer stats
event occurred, then none until we changed back the threshold to 84.9, and the activity went back to normal. After this I gradually decreased to 80, and this didnât happen again. Not sure if this helps you understand what might have happened :
240912 10:49:00 time=1726130940.553063 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=85.00
240912 10:49:00 time=1726130940.560669 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=85.00
240912 10:50:01 time=1726131001.409279 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=85.00
240912 10:50:01 time=1726131001.418496 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=85.00
240912 10:51:03 time=1726131063.287049 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=85.00
240912 10:51:03 time=1726131063.295449 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=85.00
240912 10:51:19 time=1726131079.238545 func=HandleProtobufRequest level=INFO logid=2da47130-70e4-11ef-99d1-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7039bfe700 source=ProcInterface:206 tident=root.67089:1531@localhost sec=sss uid=0 gid=0 name=daemon geo="JRC" cmd_proto={"space":{"config":{"mgmspaceName":"default","mgmspaceKey":"space.balancer.threshold","mgmspaceValue":"80"}}}
240912 10:51:19 time=1726131079.238904 func=SetConfigValue level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7039bfe700 source=QuarkDBConfigEngine:369 tident= sec=(null) uid=99 gid=99 name=- geo=""msg="store config" key="/config/jeodpp/space/default#balancer.threshold" val="80"
240912 10:51:19 time=1726131079.238952 func=PublishConfigChange level=INFO logid=99d3f236-6a86-11ef-a675-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7039bfe700 source=IConfigEngine:215 tident=<service> sec= uid=0 gid=0 name= geo="" msg="publish configuration change" key="global:/config/jeodpp/space/default#balancer.threshold" val="80"
240912 10:52:03 time=1726131123.972236 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo=""msg="update balancer stats" threshold=80.00
240912 10:52:03 time=1726131123.979119 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=80.00
240912 11:14:25 time=1726132465.213404 func=HandleProtobufRequest level=INFO logid=67bf66f6-70e7-11ef-9be1-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f2f72bb8700 source=ProcInterface:206 tident=root.72662:2245@localhost sec=sss uid=0 gid=0 name=daemon geo="JRC" cmd_proto={"space":{"config":{"mgmspaceName":"default","mgmspaceKey":"space.balancer.threshold","mgmspaceValue":"84.9"}}}
240912 11:14:25 time=1726132465.213911 func=SetConfigValue level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f2f72bb8700 source=QuarkDBConfigEngine:369 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="store config" key="/config/jeodpp/space/default#balancer.threshold" val="84.9"
240912 11:14:25 time=1726132465.213969 func=PublishConfigChange level=INFO logid=99d3f236-6a86-11ef-a675-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f2f72bb8700 source=IConfigEngine:215 tident=<service> sec= uid=0 gid=0 name= geo="" msg="publish configuration change" key="global:/config/jeodpp/space/default#balancer.threshold" val="84.9"
240912 11:15:17 time=1726132517.754109 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=84.90
240912 11:15:17 time=1726132517.761674 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.95 threshold=84.90
240912 11:16:18 time=1726132578.494337 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=84.90
240912 11:16:18 time=1726132578.501152 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.95 threshold=84.90
240912 11:17:20 time=1726132640.692484 func=Balance level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=84.90
240912 11:17:20 time=1726132640.698395 func=GetUnbalancedGroups level=INFO logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.95 threshold=84.90
How do we increase the thread pool for balance jobs ? We might have some more efficient balacing if we increase the default values :
ALL balancer info pool=balance min=10 max=100 size=100 queue_sz=986 space=default
ALL tracker info tracker=balance size=1086
Edit : maybe you remember that our instance contains a lot (many many, i.e. 2 billions) files. Could it be that this behaviour is linked to the fact that it takes a lot of time to list the files to be balanced ? The same thing has occurred again after the restart (to fix the kerberos cache option mentioned above) when the balancer was put back to on with the same low threshold⌠i.e. one single GetUnbalancedGroup event per group, a very low (<1Hz) BalanceStarted rate, and zero in bal.shd columns, 15 minutes after the change. Maybe waiting long enough will unlock itâŚ
Edit2: Indeed this morning it was observed it has started, seems 20 minutes after the enabling. In the past we had some issue with high number of files, but it was much improved, and with the old balancing on version 5.1.22, balancing was starting immediately, so I was expecting the same on this new implementation. But waiting 20 minutes to start balancing is ok, and since it doesnât come with additional latency on the instance like it was sometime in version 4, it doesnât disturb. Sorry for false alarm.