Eos quota used bytes reported exceeds space capacity

peby · July 11, 2018, 3:46pm

We are re-deploying FST, all but one of which is currently near empty.

The total capacity for activated FSTs thus far is 1PB, with ~250TB used.

However, eos quota reports 1.2PB ‘used bytes’ by user:group cern.

Why is eos quota reporing more ‘used bytes’ than the space ‘capacity’?

Cheers,
Pete

root@alice-eos-01.ornl.gov:~
17:34:19 # eos quota ls

┏━> Quota Node: /eos/aliceornl/
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│user      │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
 adm               0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 cern          1.22 PB    1.22 PB    30.68 M        0 B        0 B          0   100.00 %    ignored    ignored
 daemon      185.01 MB  185.01 MB         14        0 B        0 B          0   100.00 %    ignored    ignored
 root         10.48 MB   10.48 MB          0        0 B        0 B          0   100.00 %    ignored    ignored

┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│group     │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
 adm               0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 cern          1.22 PB    1.22 PB    30.68 M        0 B        0 B          0   100.00 %    ignored    ignored
 daemon      185.01 MB  185.01 MB         14        0 B        0 B          0   100.00 %    ignored    ignored
 nobody            0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 root         10.48 MB   10.48 MB          0        0 B        0 B          0   100.00 %    ignored    ignored

┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│summary   │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
 All users     1.22 PB    1.22 PB    30.68 M        0 B        0 B          0   100.00 %    ignored    ignored
 All groups    1.22 PB    1.22 PB    30.68 M        0 B        0 B          0   100.00 %    ignored    ignored
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛



root@alice-eos-01.ornl.gov:~
17:34:24 # eos space ls
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type      │            name│   groupsize│    groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│ balancing│  threshold│  converter│   ntx│  active│        wfe│   ntx│  active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
 spaceview           default            6            8     31        21       253.59 TB        1.00 PB     644.98 TB           0 B    off        off          20          on     20       10                  0        0          on
 spaceview         default:0            0            0      0         0             0 B            0 B           0 B           0 B    off        off          20         off      2        0         off      1        0         off

peby · July 17, 2018, 8:37pm

Am I misinterpreting?

eos quota seems to believe there is 1.24PB used bytes, whereas eos space shows (accurately) 276T used.

Attempting to set a quota for .8PB therefor shows the quota is ‘exceeded’.

Setting gid:uid quotas are represented in the default.eoscf config. However, where does ‘eos quota’ obtain the values it represents as ‘used bytes’ and why does this value differ from what is acutal show as used in the space?

Pete

root@alice-eos-01.ornl.gov:~
22:29:43 # eos quota set -p /eos/aliceornl/ -u cern -i 100M -v .8P
success: updated volume quota for uid=602 for node /eos/aliceornl/
success: updated inode quota for uid=602 for node /eos/aliceornl/
root@alice-eos-01.ornl.gov:~
22:30:16 # eos quota ls

┏━> Quota Node: /eos/aliceornl/
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│user      │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
 adm               0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 cern          1.24 PB    1.24 PB    32.46 M  800.00 TB  800.00 TB   100.00 M   100.00 %   exceeded         ok
 daemon            0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 root         10.48 MB   10.48 MB          0        0 B        0 B          0   100.00 %    ignored    ignored

┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│group     │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
 adm               0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 cern          1.24 PB    1.24 PB    32.46 M        0 B        0 B          0   100.00 %    ignored    ignored
 daemon            0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 nobody            0 B        0 B          0        0 B        0 B          0   100.00 %    ignored    ignored
 root         10.48 MB   10.48 MB          0        0 B        0 B          0   100.00 %    ignored    ignored

┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│summary   │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
 All users     1.24 PB    1.24 PB    32.46 M  800.00 TB  800.00 TB   100.00 M   100.00 %   exceeded         ok
 All groups    1.24 PB    1.24 PB    32.46 M        0 B        0 B          0   100.00 %    ignored    ignored
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛


root@alice-eos-01.ornl.gov:~
22:30:28 # grep quota /var/eos/config/alice-eos-01.ornl.gov/default.eoscf
global:/config/eosaliceornl/space/2#quota => off
global:/config/eosaliceornl/space/4#quota => off
global:/config/eosaliceornl/space/7#quota => off
global:/config/eosaliceornl/space/default#quota => off
global:/config/eosaliceornl/space/default:0#quota => off
global:/config/eosaliceornl/space/recovery#quota => off
global:/config/eosaliceornl/space/spare#quota => off
global:/config/eosaliceornl/space/test#quota => off
quota:/eos/aliceornl/:uid=602:userbytes => 800000000000000
quota:/eos/aliceornl/:uid=602:userfiles => 100000000
root@alice-eos-01.ornl.gov:~



22:30:43 # eos space ls
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type      │            name│   groupsize│    groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│ balancing│  threshold│  converter│   ntx│  active│        wfe│   ntx│  active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
 spaceview           default            6            8     31        21       276.87 TB      999.99 TB     644.97 TB           0 B    off         on          20          on     20        8                  0        0          on

gbitzes · July 17, 2018, 10:06pm

Hi Peter,

Indeed, this looks odd. I’ll have a look at the code to check if quotas are accounted correctly, and try to reproduce what you’re seeing.

The in-memory namespace reconstructs the quota values on reboot when replaying the changelogs, so it would be interesting to see if they change then. (not suggesting to reboot just for this, but a thing to keep in mind when it does happen - please let us know if the numbers change)

By the way, Elvin and Andreas are on holiday this week, let’s wait to see if they know something more.

Cheers,
Georgios

mvala · July 18, 2018, 3:09am

Hi,

Most of quota bugs were fixed starting from eos v4.3.4. You can try that.

Ciao
Martin

peby · July 18, 2018, 3:19pm

Thanks Georgios and Martin.

The latest eos published in the Citrine repo we are using at https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/el-6/x86_64/ is 4.2.28 - is this the correct production release repo?

What is the current version recommended for production?

Cheers,
Pete

mvala · July 18, 2018, 3:44pm

There is testing repo. You can give it a try. I feel stable with our eos running on 4.3.4
https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-6/x86_64/

peby · July 23, 2018, 3:15pm

Hi Georgios,

Fwiw, a reboot of the mgm and all fsts did not resolve the issue.

eos quota ls still shows used bytes as 1.29P while eos space ls shows sum(used bytes) of 336TB

We are a bit reluctant to move to the eos testing branch for our production T2 site.

Cheers,
Pete

apeters · July 23, 2018, 3:36pm

Hi Pete,
did you have some mixture between Beryl and CITRINE runing? What can be is, that there are files commited with a very large size but indeed they are just created by a truncate operation, so they don’t take space on disk.
In Beryl there was a 1TB truncate used to indicate a certain condition due to the lack of a plug-in in Xrootd 3 and if you mixed Beryl with CITRINE it was indeed doing a truncate to 1 TB when talking to a citrine FST.
The other option is, that you actually have more files in the namespace registered than in the storage nodes. You might try to dump the namespace with ‘find --size’ and sort by size to figure out, if the extra space is created by one or few files which are extremely large.

CERN Accelerating science

Eos quota used bytes reported exceeds space capacity