We’ve had a few cases of the MGM’s meta data log disk filling up to the point of being unable to perform a compactification. Unfortunately EOS doesn’t handle this very well.
Looking at mgm/Master.cc, in Master::Supervisor() it has some code for handling writes to a nearly full metadata disk, but as far as I can see, the compactification code has no similar check. The result is I can have compactifications fail, requiring a MGM restart and possibly a manual recovery of the mdlog files.
In one extreme case, I’ve managed to create a situtation where I’ve had to copy the directory mdlog from a file handle in /proc for the MGM thread, after it has been unlinked from the filesystem.
I could write a check in the compactification section to handle this, although I’m favouring making the MGM write stalling configurable or based on the current mdlog sizes. I propose setting this at a percentage over the size of the current metadata size. This means compactification would succeed at least for the current pass, making it easy to log and detect prior to it being an actual production issue, and allowing earlier handling of negative situtations that require heroic effort to recover from.
How are others managing this at the moment?