MGM hangs for some minutes (with QuarkDB namespace & config)

Hello,

We want to report 2 incidents that we got the last days with the MGM in the QuarkDB setup :

For some long period (20 minutes) the MGM seems hanged, or at least has difficulties in accepting new connections. For this reason, after a while it thinks that all FSTs are offline (because they can’t connect to it) and the clients that are still connected to the MGM get “Network unreachable” when reading, or “No space left on device” when writing.

Things settle down by themselves after a while. We still need to figure out if these incidents were caused by some unusual activity from our users. We didn’t see anything particular, unless maybe many deletions at the time/just after the problem, but which don’t seem to be that much (100K to 300K); could that be linked ?

All our servers are running 4.4.23. If this is a know issue for you, and maybe already solved in later version, could you point out the version that might solve this ? We will indeed probably plan some upgrade, just to pick the correct version

Thank you

PS: Another episode just occurred today, but much shorter (2-3 minutes) and with no sign of file deletion.

Hi Franck,

We’ve also seen some instabilities in the presence of high number of deletions but for the moment we don’t have a clear picture where the bottleneck comes from. Would it be possible to share with us the MGM&MQ and one FST log when such an incident happens?

You can send me the links via email if you want.

Thanks,
Elvin

OK, I will extract the logs and send them to you, hoping that it can help you find something.