Eos5 FST Unable to get free physical space

Hello,

We’ve added two new 1.5PB eos 5.1.19 FSTs to the otherwise ORNL::EOS 4.8.40 cluster (mgm with multiple FST) with the plan to drain the old FSTs to new ones and then install new MGMs to replace and upgrade the existing.

The new eos 5.1.19 node appears healthy, is active, and its fsids boot without issue.

However, xrdcp write attempts result in

Server responded with an error: [3009] Unable to get free physical space /eos/aliceornleos/hosts; No space left on device (destination)

And drain operations on log on the mgm /var/log/eos/mgm/Clients.log:

230512 14:29:14 ERROR [10367/01395] alienmaster ::Emsg Unable to get free physical space /eos/aliceornleos/cond/10/38031/9213f88e-f0c0-11ed-ad5b-b47af1a61b9a; No space left on device

The fst geotag is set to e204ah75 (same as others) and I set the default tag as well via eos vid set geotag default e204ah75. All fsids are in the same default.0 namespace, no RAIN, etc.

I’ve not identified errors in logs other than the above, nor other applicable solutions searching in the eos-community. Any suggestions for what else to check are appreciated.

Cheers,
Pete

This might be a stupid question, but did you check the permissions and ownership on the root of the filesystems? Should be demon:demon and 755 if I recall correctly.


Dan Szkola
FNAL

Hi Dan, not a silly question at all - I’ve made that mistake many times.

In this case permissions are demon:daemon 755 Neither the fsts or mgms log errors when booting the fsids and each fsid contains the files scrub.re-write.1 and scrub.write-once.1- so eos was able to at least create those.

Pete

I changed the geotag from e204ah75 to e204::ah75 per the second point in this post though the issue persists.

Any other suggestions? The fsids boot, appear healthy, and eos created the scrub.re-write.1 files with correct ownership.

Thank you,
Pete

Hi Pete,

Did you also update the nominalsize value for your space after adding the new capacity?
What is the output of eos space ls?
Do you get any errors in the MGM logs when trying to do such a transfer?

Thanks,
Elvin

Hi @esindril - sorry for delay, I’ve been on travel. Unfortunately we are seeing no errors in the MGM log when attempting the xrdcp to the activated fsids, only the client side “error: [3009] Unable to get free physical space”

With the newly added eos-fst-12.ornl.gov fst set to off status, the space ls is:

[root@ornl-eos-01 data]# eos space ls default
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type      │            name│   groupsize│    groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│ balancing│  threshold│  converter│   ntx│  active│        wfe│   ntx│  active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
 spaceview           default            4          120    563       116         1.40 PB        4.02 PB     863.86 TB           0 B    off        off          20         off      2        0         off      1        0          on 

When the node is activated, capacity(rw) changes from 863.86 TB to (negative) -6.60 PB

[root@ornl-eos-01 data]# eos space ls default
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type      │            name│   groupsize│    groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│ balancing│  threshold│  converter│   ntx│  active│        wfe│   ntx│  active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
 spaceview           default            4          120    563       157         1.40 PB        4.02 PB      -6.60 PB           0 B    off        off          20         off      2        0         off      1        0          on 

Re: nominalsize was undefined per eos space status default and we have not previously set this value when adding new fsts, the added capacity has automatically been recognized. I see eos docs state “It is possible to set a physical space restriction using the space parameter nominalsize” though behavior above (negative capacity.rw) seem counterintuitive to that description.

I set eos space config default space.nominalsize=4P however eos space ls default still returns capacity(rw) negative value of -6.60 PB

Thank you for any additional suggestions you may have. We’d like to begin using the newly added PB of space, and an identical node as well.

Hi Pete,

Hard to say what triggers this wired behavior, but I guess there is some incompatibility that we are not aware between your installed version and the new one you are trying to deploy. Given that 4.8.40 is almost 2 years old, I would try a less “aggressive” update strategy, namely:

  • first update your current instance to the latest eos 4 release i.e 4.8.102
  • add the new nodes with the same version (4.8.102)
  • one you decide you want to move to EOS 5 (which I would encourage asap) then I can give you some instructions on what are the configuration changes needed for this.

Hope this helps!
Cheers,
Elvin

Hi @esindril

How about we proceed with: Replace the current mgm with a freshly installed eos5 instance (we have new mgm hardware to deploy) and Export instance configuration to QuarkDB

Then upgarde in place the remaining eos4 FSTs to eos5.

Will doing so in that order allow the environment to continue to function as we then upgrade in place each fst?

Any suggested additional instructions on upgrading to eos5 (mgm or fst) are welcome.

Pete

Hi Pete,

Give the issues that you are seeing, I would be reluctant to perform such an upgrade. We have no experience going from such an old version (4.8.40) directly to EOS 5, therefore I would suggest to go the safer route and do the update in stages.

If you can afford the downtime, you can do everything in one go to EOS 5 and then you can continue with adding the new FSTs. It all depends on your constraints.

You can find instructions for the migration from EOS 4 to 5 here:
https://eos-docs.web.cern.ch/quickstart/update_eos4to5.html

Cheers,
Elvin