We want to report 2 incidents that we got the last days with the MGM in the QuarkDB setup :
For some long period (20 minutes) the MGM seems hanged, or at least has difficulties in accepting new connections. For this reason, after a while it thinks that all FSTs are offline (because they can’t connect to it) and the clients that are still connected to the MGM get “Network unreachable” when reading, or “No space left on device” when writing.
Things settle down by themselves after a while. We still need to figure out if these incidents were caused by some unusual activity from our users. We didn’t see anything particular, unless maybe many deletions at the time/just after the problem, but which don’t seem to be that much (100K to 300K); could that be linked ?
All our servers are running 4.4.23. If this is a know issue for you, and maybe already solved in later version, could you point out the version that might solve this ? We will indeed probably plan some upgrade, just to pick the correct version
PS: Another episode just occurred today, but much shorter (2-3 minutes) and with no sign of file deletion.