Important notes for the eos-5.2.0 release

esindril · September 11, 2024, 10:16am

Hi Franck,

Yes, indeed the balancing column will actually disappear in 5.3.0 and one should just rely on the last column to figure out if the is any activity happening or not. This was just syntactic sugar which in the new implementation is not very clean to implement.

For the eos ns stat output, these are new tags that differentiate the new implementation of the balancer from the old one. We could not reuse the same tags as in 5.2.0 one could switch from one implementation to the other - and we wanted to see the effect and performance of the new one separately. In 5.3.0 only the Balancer.... tags are available.

Concerning the parameter names, nothing has changed so there is no need to modify the configuration.

Cheers,
Elvin

franck-jrc · September 12, 2024, 9:38am

Thank you for the detailed explanations. I reckon that the balancing state information is not essential and better to remove it if difficult to implement now. But it was particularly handy to quickly check the balancing situation after changing the parameters, so when I didn’t see it after the upgrade, I wondered if there was an issue .

You mention that it is possible to choose the balancer implementation, how can this be done, as a curiosity ? I’m not sure we will try that if the new one works well, but I had a strange situation where I changed the threshold to a value involving much more groups to balance (5% below the maximum deviance) and the balancing almost stopped completely, doing very few transfers, and all bal-shd were at 0. When I increased a bit the parameter, it started again correctly. Is this behaviour explainable ?

Another question, you are mentioning version 5.3, but there are not yet any candidate to install, right ? I do not see any package in the testing repository, not even in el-9.

esindril · September 12, 2024, 11:45am

Hi Franck,

The old implementation was actually removed in 5.2.0 so the different tags performance tags were introduced much earlier in 5.1.0 where one could change the balancer engine. The change in 5.1.0 was done by setting the new_balancer=on attribute on the space which would turn automatically the old balancer off i.e. balancer=off. But this options is no longer there in 5.2.0 and the attribute balancer=on/off is the only one used and refers to the latest balancer implementation.

We have been running the new FsBalancer without issues in our production instances since long and haven’t experienced such a behavior. So, in case this is reproducible you can collect some logs from the MGM, for example grepping for GetUnbalancedGroups and then you will see exactly which are the candidate groups for balancing. Then, it also depends on the size of the thread pool for balance jobs which you can inspect with eos ns | grep balance. This should already provide you with more information to infer whether or not things are working as expected.

Indeed, 5.3.0 is not yet released but it will come out in the following period.

Cheers,
Elvin

franck-jrc · September 12, 2024, 2:58pm

OK, thank you for the precision, I wasn’t aware (or forgot about it) we could already test the new balancer on version 5.1…

Regarding our issue we didn’t manage to reproduce it yet, but an interesting behaviour is observed in the logs. After the threshold change, and until the next one, the GetUnbalancedGroups messages disappeared, whereas they usually repeat every minute. For instance here only showing group default.68 (but it was the same for all groups) and config threshold changes : just after the threshold was changed from 85 to 80, only one update balancer stats event occurred, then none until we changed back the threshold to 84.9, and the activity went back to normal. After this I gradually decreased to 80, and this didn’t happen again. Not sure if this helps you understand what might have happened :

240912 10:49:00 time=1726130940.553063 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=85.00
240912 10:49:00 time=1726130940.560669 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=85.00
240912 10:50:01 time=1726131001.409279 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=85.00
240912 10:50:01 time=1726131001.418496 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=85.00
240912 10:51:03 time=1726131063.287049 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=85.00
240912 10:51:03 time=1726131063.295449 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=85.00
240912 10:51:19 time=1726131079.238545 func=HandleProtobufRequest    level=INFO  logid=2da47130-70e4-11ef-99d1-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7039bfe700 source=ProcInterface:206              tident=root.67089:1531@localhost sec=sss   uid=0 gid=0 name=daemon geo="JRC" cmd_proto={"space":{"config":{"mgmspaceName":"default","mgmspaceKey":"space.balancer.threshold","mgmspaceValue":"80"}}}
240912 10:51:19 time=1726131079.238904 func=SetConfigValue           level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7039bfe700 source=QuarkDBConfigEngine:369        tident= sec=(null) uid=99 gid=99 name=- geo=""msg="store config" key="/config/jeodpp/space/default#balancer.threshold" val="80"
240912 10:51:19 time=1726131079.238952 func=PublishConfigChange      level=INFO  logid=99d3f236-6a86-11ef-a675-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7039bfe700 source=IConfigEngine:215              tident=<service> sec=      uid=0 gid=0 name= geo="" msg="publish configuration change" key="global:/config/jeodpp/space/default#balancer.threshold" val="80"
240912 10:52:03 time=1726131123.972236 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo=""msg="update balancer stats" threshold=80.00
240912 10:52:03 time=1726131123.979119 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.96 threshold=80.00
240912 11:14:25 time=1726132465.213404 func=HandleProtobufRequest    level=INFO  logid=67bf66f6-70e7-11ef-9be1-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f2f72bb8700 source=ProcInterface:206              tident=root.72662:2245@localhost sec=sss   uid=0 gid=0 name=daemon geo="JRC" cmd_proto={"space":{"config":{"mgmspaceName":"default","mgmspaceKey":"space.balancer.threshold","mgmspaceValue":"84.9"}}}
240912 11:14:25 time=1726132465.213911 func=SetConfigValue           level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f2f72bb8700 source=QuarkDBConfigEngine:369        tident= sec=(null) uid=99 gid=99 name=- geo="" msg="store config" key="/config/jeodpp/space/default#balancer.threshold" val="84.9"
240912 11:14:25 time=1726132465.213969 func=PublishConfigChange      level=INFO  logid=99d3f236-6a86-11ef-a675-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f2f72bb8700 source=IConfigEngine:215              tident=<service> sec=      uid=0 gid=0 name= geo="" msg="publish configuration change" key="global:/config/jeodpp/space/default#balancer.threshold" val="84.9"
240912 11:15:17 time=1726132517.754109 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=84.90
240912 11:15:17 time=1726132517.761674 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.95 threshold=84.90
240912 11:16:18 time=1726132578.494337 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=84.90
240912 11:16:18 time=1726132578.501152 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.95 threshold=84.90
240912 11:17:20 time=1726132640.692484 func=Balance                  level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsBalancer:202                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="update balancer stats" threshold=84.90
240912 11:17:20 time=1726132640.698395 func=GetUnbalancedGroups      level=INFO  logid=static.............................. unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f6fd5bf4700 source=FsView:3435                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="collect group info" group=default.68 max_dev=22.95 threshold=84.90

How do we increase the thread pool for balance jobs ? We might have some more efficient balacing if we increase the default values :

ALL      balancer info                    pool=balance        min=10  max=100  size=100  queue_sz=986 space=default
ALL      tracker info                     tracker=balance size=1086

Edit : maybe you remember that our instance contains a lot (many many, i.e. 2 billions) files. Could it be that this behaviour is linked to the fact that it takes a lot of time to list the files to be balanced ? The same thing has occurred again after the restart (to fix the kerberos cache option mentioned above) when the balancer was put back to on with the same low threshold… i.e. one single GetUnbalancedGroup event per group, a very low (<1Hz) BalanceStarted rate, and zero in bal.shd columns, 15 minutes after the change. Maybe waiting long enough will unlock it…

Edit2: Indeed this morning it was observed it has started, seems 20 minutes after the enabling. In the past we had some issue with high number of files, but it was much improved, and with the old balancing on version 5.1.22, balancing was starting immediately, so I was expecting the same on this new implementation. But waiting 20 minutes to start balancing is ok, and since it doesn’t come with additional latency on the instance like it was sometime in version 4, it doesn’t disturb. Sorry for false alarm.

esindril · September 13, 2024, 11:46am

Hi Franck,

There a two things that could explain such a behavior: one would be the delay to load the file identifiers corresponding to the file systems that need to be balanced. This is required since we randomly pick files and for this the full list needs to be loaded on the MGM from QDB. This is done asynchronously outside any locks so it does not impact the performance of the rest of the MGM.
Secondly it could be that the balancer actually started doing some work but all the slots are busy with jobs and then the balancer is just waiting for submit more work and durring this time the GetUnbalancedGroups method is not called.

If you can send me the output from the MGM logs during the time interval mentioned in your previous commend with grepping for FsBalancer, I might be able to provide a more accurate explanation. Either way, this is part of the expected way of working and in general the priority of the balancer is not high, so this should not be an issue from an operational point of view.

You can set the following configuration on the corresponding space to modify the size of the thread pool used for balancing jobs:

space config <space-name> space.balancer.max-thread-pool-size=<#>     : configure the maximum number of threads to be used in the balancer thread pool [ default=100 (threads) ]

Cheers,
Elvin

franck-jrc · September 23, 2024, 2:51pm

Dear @esindril,

Sorry I didn’t wrote to you earlier.

I have gathered log when changing the balancer threshold. I can confirm that when introducing a large number of groups into the balancing scheduler (by lowering the balancer threshold) the pause can be long until the balancer resumes its process. Here is a log of last week, the balancing activity was reduced (but not completely stopped) for around 45 minutes, in case it can help you understand what is going on, but again, it is just puzzling, not critical at all, balancing resumes after all.

Log file (Access with Eos@Mgm20240923; last time you had to open it in a private window to download it, also I had it working in Chrome but not Firefox)

I also have to report an issue with the FSTs after they have been upgraded to v5.2.24 : it has happened to 3 of them that they hang, they are seen offline by the MGM, although the service is still running, and when stopping it it ends up being killed, so maybe some internal deadlock ? I will send by email to eos-support the stacktraces and log files that I manage to extract, hoping it can help.

esindril · October 15, 2024, 9:46am

Hi Franck,

For anyone who is reading this post, the issue concerning the FSTs hanging or crashing has been identified. It is triggered on instances that have the FST asynchronous close functionality enabled and eos fuse mountpoints accessing that instance. The problem is fixed in version 5.2.26 and requires an update of the FSTs.

Cheers,
Elvin

CERN Accelerating science

Important notes for the eos-5.2.0 release