For those interested, today had to apply firmware updates to an EOS file server (FST role) and a reboot was required. There may be similar cases for other kind of maintenance operations. I asked Andreas about how to do it properly and here are the steps:
Put the node in read-only : EOS Console [root://localhost] |/> node config mynode.fqdn:1095 configstatus=ro
Wait until all write operations have stopped monitoring with : node ls --io mynode.fqdn
Then stop the eos fst service: service eos stop (or similar command with systemctl)
Perform maintenance operations including reboot as required
After server restart and ready, start eos services
Check eos filesystems of the node for booted status: EOS Console [root://localhost] |/> fs ls mynode.fqdn
When all filesystem have booted, put the node in rw mode : EOS Console [root://localhost] |/> node config mynode.fqdn:1095 configstatus=rw
to make the procedure completely transparent also for reading clients (avoid a read error+retry)
one need to avoid as well to shut-down an FST while there are reads ongoing.
In order to avoid scheduling new transfers to the node in addition to the previous steps
You need to first set the node off and wait that all the load (both read and write) goes away: eos node set mynode.fqdn:1095 off
and once the node has been updated to set it back to on (and read-write) eos node set mynode.fqdn:1095 on
we are using this in EOSUSER/CERNBox to make this procedure completely
transparent for our clients.
On our instance, files are sometimes open for many hours (long process), other time some close are missing and they stay open forever, so waiting for ropen or wopen to go to 0 is often infinite, and we have to cut them down.
Is there a way to list which files are kept open, and by whom on one FS or FST so that we can decide if it is critical to shut down the FST anyway, or notify the users ? I can only see the hotfiles information on one FS, but they are not all of them.