Next week, the network team will run a maintenance on our infrastructure switch, which will last around 2 hours, potentially interrupting all connections between all hosts (clients, FST, mgm, QDB nodes).
Any suggestion about how to handle that from eos point of view ?
Better switch off all the daemons before the maintenance, and restart them afterwards, or leave them alive and they will recover by themselves ?
If we shut down all the FSTs, would it be safe to restart them all at once (we have 50 of them, with a total 1800+ file systems), or is there a risk to blow up the MGM ?
Our instance is running MGM v4.5.15, FSTs v4.5.17, with QuarkDB 0.4.2
Probably the cleanest way would be to stop the services in an orderly fashion and then restart them once the intervention is over. I believe in 4.5.20 we already have the dumpmd that can connect to QuarkDB if the configuration fstofs.qdbcluster is present in the /etc/xrd.cf.fst file.
Also automatic dumpmd is disabled since some time now and I don’t think there should be any issues concerning this.
Thank you for your answer. Just to be sure I understand, the orderly fashion would be to switch off FSTs, MGM then QDB, correct ?
About the dumpmd detail, you are meaning that restarting all FSTs at the same time should not hurt ?
I checked and we do not have the fstofs.qdbcluster option in our xrd.cf.fst files. But I do not know if it is because we weren’t aware that we should add it, or if because the feature is not available. I’ll check that.
OK, thank you for your answer, we will do like this. We will as well add the fstofs.qdbcluster directive lacking in our FSTs configuration, as it seems to be already supported in v 4.5.17 to avoid any issue when restarting the FSTs if they decide to fully boot.
Just one precision about this : in our procedure , we are used to set all the fs to off with eos node config hostname configstatus=off before shutting down a node.
But wouldn’t it be also correct (maybe better ?) to use eos node set hostname off ? Or does it have some drawback ?