while running eos touch in a loop today i managed to fill up the quarkdb ns directory, resulting in 2 out of 3 quarkdb nodes crashing due to io error. this also caused the mgm to crash because it couldn’t reserve inode(s), apparently.
not entirely sure what a good solution to this would be - maybe have the mgm return an error message instead of crashing, or die neatly?
i’ve also kept the stack trace if it’s of any interest.
Indeed, a full disk will result in QDB crashing for now, it’s not able to handle such an error. I plan on having it switch into read-only mode in such case, while printing lots of scary warnings in the log.
The MGM crash is probably a bug, in theory it shouldn’t crash if QDB is down, stacktrace would be appreciated.
switching to read-only sounds good! @davidjericho also posted something similar (but about the in-memory namespace) mentioning that the mgm tries to handle writes to an almost full disk - could something similar be done for quarkdb?
that is, maybe start yelling/warning as the disk is starting to get full, then switch to read only at some higher threshold?
Yes indeed, we should start yelling before the read-only threshold is reached. I’ll just have a background thread poll every few seconds how full the partition QDB resides in is, and start warning / trigger read-only when needed.