A long due upgrade at JRC, some suggestions?

Hello,

We have been delaying the upgrade of our EOS production instance for too long, for internal reasons, combined with the fact that our instance is very stable with the versions we are currently running : MGM 4.5.15, FSTs 4.5.17 and we prefer to avoid too many more frequent upgrades.

But now it is time to make a step forward, especially with the next promising EOS5 as we saw at last week workshop, so we would like to make a step towards a 4.x version first.

First question for EOS experts, or other site manager who have upgraded recently, which is the best version to pick for us, according to you ?

Second question, more practical : what is the suggested upgrade scenario, between these 2 :

  • Gradual : Upgrade the MGM, then within few days a rolling upgrade of the FSTs; this diminish the downtime period to just the upgrade of the MGM (using the HA requires a DNS change, so we prefer not use the switch and stick to the current master), but we will have the MGM and FSTs run with version mismatch. We know that this makes the converter (and so group balancer) unusable because of a change and it needs to be disabled (it is already the case), but is there any other known downside ? We are running a test instance with a MGM 4.8.x and FST 4.5.17 and this runs well, but the load is not comparable with the production instance, so behavior could change a lot.
  • Stop the whole instance, upgrade all nodes, and start them all at once. This would require more downtime period (more than 60 FSTs to upgrade, start back, and check that none of them has issue with the upgrade)

Any other suggestion about the large version change (intermediate version, configuration change, etc…) is welcome.

Thank you,

Franck

Any feedback on this?

Best
Armin

Hi Armin,
we don’t have really hands on experience with such a large version jump. It is clear, that a short downtime and consistent upgrade is safer. I will try to see with Elvin tomorrow, if there is some incompatible change in the communication from 4.5.17 FSTs to new MGMs … will let you know!

Thanks for checking!

So once we have an updated version, we should run updates more frequently then, right? How is this done at CERN for EOS instances with a lot of users where downtimes should probably be avoided, like the one used for CERNBox? For smaller version jumps I assume the updates can be done without downtimes then.

For small changes you can probably use the script we are using for updates and adapt it.
Essentially it puts an FST in RO mode when nobody writes, updates the software, restarts and puts into RW.
Still for this big update, just announce a short downtime and update all the FSTs in one go.

Hello,

We are finally planning this upgrade for the next weeks, as we unfortunately had to work on other priorities until now.

We will so have a period of shutdown of our whole production instance, upgrade to latest version, then restart the MGM, then the FSTs, and hope to have a working instance with the latest 4.8.x version.

Do you have some suggestions about the following things :

  • which version to pick ? We see that the stable repository has 4.8.62, the testing repository has 4.8.91, and the version mentioned on the website is 4.8.66. But maybe you have an advice based on what you have been experiencing in running your instances during the last months.
  • is there anything that might have changed in the configuration that would be needed to review during this large upgrade step ?
  • we will of course make a backup of the QuarkDB data; do you retain necessary to also back up the LevelDB data on each FSTs ? If yes, what would be the correct way to do it ?
  • anything else to consider apart from these above points ?

Hi Franck,
we run 4.9.91 in production on CERNBOX. That should be good for you!

We also run now all LHC instances on EOS5.

From which version do you update?

There is no need to backup LevelDB on the FST.

Hi Andreas,

Thank you for your answer. We are still upgrading from 4.5.15/4.5.17, this is why we are asking for extra suggestions, if you think of any.

I suppose that you mean 4.8.91 for your cernbox production (I do not see any 4.9.91 version)

We also think to upgrade to EOS5 after that, but we first need to align on 4.x.x.