Correct way to add new FSTs nodes

franck-jrc · December 5, 2019, 11:45am

A thread to discuss the correct way to add new nodes on an instance, because we add some issue this morning while adding the first node since a while (and many versions) ago.

While registering the disks (using eos fs add command), we enabled the node on the MGM (using eos node set ... on) and the MGM went blocked (hang of every command or access, except eos ns command).
It took 3 tries to correctly restart it, having stopped the new node, and disabled the balancer. The first 2 restarts also saw the MGM blocked in the same way as first issue.

We didn’t find anything in the log, except this kind of messages at the time of the first block (but not at the 2 successive restarts) concerning the newly added node, but they could as well be a consequence of the block :

191205 09:55:06 time=1575536106.127250 func=open                     level=NOTE  logid=ebe3b860-173c-11ea-8f95-48df374dec7c unit=mgm@s-jrciprjeop214p.cidsn.jrc.it:1094 tid=00007f7f302e4700 source=IProcCommand:66                tident=root.17287:876@s-jrciprjeos010p sec=sss   uid=2 gid=2 name=daemon geo="JRC" command not ready, stall the client 5 seconds

We then could add correctly 5 nodes by registering the disks while the eos FST daemon is stopped, activate it, then start the FST daemon, with balancer disabled.

Questions are :

do you have an idea where this block could come from ?
could the balancer be struggled when new disks are added because it sees many files to balance, and block the MGM ? If yes, what is the gentle way to activate the balancer ? We had increased the threshold to 70 before adding the node.
is there a recommended moment to issue the eos node set ... on command ?

Version is 4.5.17 on all FSTs of the instance, MGM is 4.5.15.

franck-jrc · December 5, 2019, 5:22pm

Another observation about adding disks, is that the values space.scaninterval and space.headroom are not propagated to the newly added disks.

And since neither eos node config <nodename> scaninterval nor headroom can allow to set these values to all FS of a node (error: the specified key is not known - consult the usage information of the command), we need to set individually on each FS. Is that correct, or are we missing something ?

franck-jrc · December 11, 2019, 11:14am

could the balancer be struggled when new disks are added because it sees many files to balance, and block the MGM ?

After having activated the balancer for some days, and slowly decrease the threshold depending on the filling status of the groups to put only 4 or 5 of them in balancing, we can observe that such decreases in the threshold (of 1%, even 0.3%) result in the MGM being unresponsive for some seconds, maybe around one minute. The average execution time goes up from 1ms to 400ms within 2 minutes, then goes back down. In addition, the MGM memory footprints climbs abruptly by 1 or 2 GB. Is that a known/expected behaviour ?

We didn’t try to decrease it by more than that, we are wondering if the full block of the MGM mentioned if the first post might repeat. Indeed, at the time of adding the node, the balancer was active, and the threshold implied all groups to go in balancing. We have 48 groups, each of them containing between 26M and 42M files for 200TB to 300TB. Maybe these figures have an impact on how the balancer behaves ?

So, about the initial question of this post, answering “disable the balancer, or set the threshold to 100% before adding a new node” could be the solution ? But then, what happens if the threshold is set to a too low value with many many potential files to balance ?

franck-jrc · December 4, 2020, 8:37am

Hello,

I’m reviving this thread because we had a very similar issue again in the last days.

Adding some disk from a new host, the instance went unavailable, with the famous message command not ready, stall the client 5 seconds, nothing is possible except eos ns commands.

This happened twice in a row. The first FST added 5 disks (5 seconds delay between each eos fs add command) before this happened. We realized that the balancer was active at that time, and thought this was the root cause.

Instance was restarted (with troubles, as in this thread we stopped many fusex clients to avoid the instance being in blocked state just after restart).

But the day after, another FST, this time with balancer disabled, created the same issue. We also needed to restart the MGM twice, but no need to switch off the fusex clients.

By reading back this thread, we see that we managed last year by adding the disks while the FST daemon is shut down. Could this be a safety solution to avoid bringing the instance down again ? We have several FSTs to be added in the next days.
During the last year, we added some nodes successfully without the need to shut FST down.

Or could you see any other explanation for this incident ? It seems that there is some deadlock, but the logs do not help us at all to understand which components go in conflict. It seems that balancer is not involved, since we also got the issue while it was disabled.

Can someone explain the procedure generally used to add node & disks to one instance.

Unfortunately, we still couldn’t plan the upgrade of the instance, so we are still stuck with versions 4.5.15/4.5.17.

esindril · December 8, 2020, 9:06am

Hi Franck,

Unfortunately, the version that you are running is quite old, and we had a number of improvements throughout time concerning these type of issues. If I remember correctly also our ops team saw such problems in the past but nothing new was reported in the last ~6 months.

A few things that have improved in the meantime that come to my mind are:

proper prefetching and random selection of files to balance on the MGM
optimize the configuration in QDB so that it is more efficient and doesn’t block the MGM in case the configuration size is large
various improvements to the locking inside the MGM when it comes to adding new nodes/fs

If this happens again it would help a lot to have a stacktrace of the MGM process that we could inspect and better identify the root issue. For this you can use the eu-stack -p <pid> tool which is very fast or the usual gdb option is still available - though with this old version of EOS that you are running, I fear gdb (from devtoolset-8) will crash …

Cheers,
Elvin

franck-jrc · December 9, 2020, 10:53am

Hi Elvin,

Thank you for your answer and to confirm that the issue we had is probably fixed in the recent version (the QDB configuration optimization and improvement in the locking could indeed help in that case).
We definitely need to plan an upgrade. OK for the stacktrace, we will try that in case of another issue (hopefully not).

I was thinking about the improvement linked to the QDB configuration, is it possible that ours is big and could create some block ? How can we check its size ?

So regarding the way to add FST, it seems that the procedure we use should be correct, right ?

Cheers,
Franck

esindril · December 11, 2020, 12:38pm

Hi Franck,

You can check the size of the configuration by doing the following command:
redis-cli -p 7777 hgetall eos-config:default > /tmp/dump.cfg

Then look at the size of the file and that is the size of your configuration. Concerning the FST, yes, everything looks correct. I would add a longer sleep (10-20s) between the fs commands but otherwise all looks good.

Cheers,
Elvin

CERN Accelerating science

Correct way to add new FSTs nodes