Planned network interruption : how to better handle an eos instance?

franck-jrc · June 15, 2020, 10:24am

Hello,

Next week, the network team will run a maintenance on our infrastructure switch, which will last around 2 hours, potentially interrupting all connections between all hosts (clients, FST, mgm, QDB nodes).

Any suggestion about how to handle that from eos point of view ?

Better switch off all the daemons before the maintenance, and restart them afterwards, or leave them alive and they will recover by themselves ?

If we shut down all the FSTs, would it be safe to restart them all at once (we have 50 of them, with a total 1800+ file systems), or is there a risk to blow up the MGM ?

Our instance is running MGM v4.5.15, FSTs v4.5.17, with QuarkDB 0.4.2

Thank you for your help,

Franck

@esindril maybe do you have some idea about it ?

esindril · June 19, 2020, 2:33pm

Hi Franck,

Probably the cleanest way would be to stop the services in an orderly fashion and then restart them once the intervention is over. I believe in 4.5.20 we already have the dumpmd that can connect to QuarkDB if the configuration fstofs.qdbcluster is present in the /etc/xrd.cf.fst file.

Also automatic dumpmd is disabled since some time now and I don’t think there should be any issues concerning this.

Cheers,
Elvin

franck-jrc · June 20, 2020, 6:58am

Hi Elvin,

Thank you for your answer. Just to be sure I understand, the orderly fashion would be to switch off FSTs, MGM then QDB, correct ?

About the dumpmd detail, you are meaning that restarting all FSTs at the same time should not hurt ?

I checked and we do not have the fstofs.qdbcluster option in our xrd.cf.fst files. But I do not know if it is because we weren’t aware that we should add it, or if because the feature is not available. I’ll check that.

esindril · June 22, 2020, 8:09am

Hi Franck,

Orderly fashion would be: put the nodes to off, wait that no more writes are happening (node ls --io) then indeed stop the FSTs, MGM and QDB.

Yes, it should be fine restarting all FSTs at once.

Cheers,
Elvin

franck-jrc · June 22, 2020, 9:44am

OK, thank you for your answer, we will do like this. We will as well add the fstofs.qdbcluster directive lacking in our FSTs configuration, as it seems to be already supported in v 4.5.17 to avoid any issue when restarting the FSTs if they decide to fully boot.

franck-jrc · June 23, 2020, 8:00am

Just one precision about this : in our procedure , we are used to set all the fs to off with eos node config hostname configstatus=off before shutting down a node.
But wouldn’t it be also correct (maybe better ?) to use eos node set hostname off ? Or does it have some drawback ?

esindril · June 24, 2020, 8:34pm

No, eos node set <hostname> off is also good enough.

franck-jrc · June 29, 2020, 9:53am

Thank you for your support @esindril! The maintenance went well.

CERN Accelerating science

Planned network interruption : how to better handle an eos instance?