Difference between accounting report and space ls for default space

Dear eos people
for space ls you have

[root@grid67 atlas]# eos space ls default
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type │ name│ groupsize│ groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│

sum(capacity)│ capacity(rw)│ nom.capacity│sched.capacity│ quota│ balancing│  threshold│  converter│   ntx│  active│        wfe│   ntx│  active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
 spaceview           default          128          128    184       180         1.45 PB        4.32 PB       4.23 PB      10.00 PB        2.79 PB     on         on          20          on      2        0         off      1        0         off

where used capacity is reported 1.45PB
where from accounting report we got only ~1PB (base 2)

do you have any idea from where could came the difference ?
( we have plain layout for the file everywhere)

thank you in avance
best
e.v.
p.s.
we are running on 5.0.18 and we belived the value 1PBytes of stored data on default space

hello
any hint about this topic
thank you in advance
best
e.v.

eos space ls

grabs the statvfs info from all connected filesystems and sums it up.

This is reported to be 1.45 PB for all filesystems in the default space.

While the accounting comes from the Quota accounting and depends on the setting. This can be only a part of the overall size depending on the configuration. It can also be that there are orphaned files on your filesystems.

hello Andreas
thank you
from eos fsck stat does not look that I have orfran files (e.g. orphans_n : 44)
in order to solve some alice issue we use
EOS_MGM_STATVFS_ONLY_QUOTA=1
best
e.v.

hello Andreas
I think I found the issue, if I rm a quota node , and set it again (e.g for the group)
the quota data are reseted the particular node even if the node is not empty
there is a way to re-popuate quota data in a quota node ?
e.g.
eos ls -l /eos/grif/dteam
-rw-r–r-- 1 dte000 dteam 149 Jun 26 11:51 16745.null
-rw-r–r-- 1 dte000 dteam 149 Jun 26 11:51 16916.null
-rw-r–r-- 1 dtes dteam 1073741824 Jun 26 10:26 1g.zeros
-rw-r–r-- 1 dtes dteam 1073741824 Jun 26 10:27 2g.zeros
-rw-r–r-- 1 dtes dteam 1073741824 Jun 26 10:27 3g.zeros
-rw-r–r-- 1 dte000 dteam 149 Jun 26 11:51 5142.null
-rw-r–r-- 1 dte000 dteam 149 Jun 26 11:51 8827.null

but the re-set quota reports
┏━> Quota Node: /eos/grif/dteam/
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│group │used bytes│logi bytes│used files│aval bytes│aval logib│aval files│ filled[%]│vol-status│ino-status│
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
99 0 B 0 B 0 0 B 0 B 0 100.00 % ignored ignored
dteam 0 B 0 B 0 33.00 GB 33.00 GB 20.00 K 0.00 % ok ok

FYI
best
e.v.

hello all
do we have any hint for this quota issue ?
thank you in advance
best
e.v.

Yes these commands to recompute a quota node after modification:

eos ns recompute_quotanode |cid:<decimal_id>|cxid:<hex_id>
recompute the specified quotanode

Ideally nobody is writing there when you recompute!

hello Andreas
Thank you for the hint I run some re-compute and I solvied some issue with quota.

In meanwhile the actual problem that I described connect with XFS and dynamic preallocation

I have a strange issue on XFS of eos partition on FST
'''
2.1G 000701f3 4.0G 000701f3
9.7G 00070214 12G 00070214
9.7G 000702bf 12G 000702bf
1.6G 0007041b 1.6G 0007041b
'''
The first colunm is the apparent-size of the file
and the second is the allocated size on XFS
for some file there is difference

from filefrag 000702bf Ihave 
000702bf: 7 extents found

the the files with difference in size and apparent size  exibits large fragmentation

if I defrag the file and difference disappered 
I think is related with how the kerel/xfs deals with dynamic preallocation 

do you saw this issue in the past ? could be related with EOS and how eos/rootd “close” the files

it is true that we do not run and masiverly deletion or drain in our space
I think for th reason above I saw different on the used space ( statvfs () ) and quota accountring report ( apparent size)

thank you in advance ?
best
e.v.
p.s.
it is supposed that if the system the space the preallocation will be free automatically
but I am not sure if the difference came from preallocation or from file fragmatention
also I do not understand we in the semi full (50%) partition to have fragmented files ( with size more than 1-2 GB) ?

hello Andreas
do we have any news and hit about this issue ?
thank you in adance
best
e.v.

When you upload a file with xrdcp or eoscp it fallocates the target filesize. Maybe you get some hint about what happened when you track how this files were generated. Did you start the instance with empty xfs filesystems or you reused old ones, which were never defragmented?

hello Andreas,

We are starting from new fresh xfs file systems
mosty of the files ( e.g. for atlas vo) are created with https/TPC via FTS
but migth this method do not use fallocates() , I will make few check
thank you for the hint

best
e.v

Hello Andreas I am return to the issue with some finding

on eos, we have a mismatch on userspace as calculated directly from fst via statvfs() calls and quota accounting for the default space

sum.stat.statfs.usedbytes=4901489687896064
where the quota report is “usedsize” : 4026197504675803
they are missing 875,292,183,220,261 bytes !

We spot that the apparent sizes of some files are smallest than the disk sizes (allocated size)
( on metadata db or rucio db the file size corresponds to the apparent size, the same is measured by the quota eos system).

This is due to speculative preallocation to allocate blocks past EOF
those could be verified as
you could connect to any fst
and go to any partition
and find some files with large sizes and difference in du ; du --apparent-size
and run

xfs_bmap -pvv /fspool/disk06/0000083c/0141cc0d
/fspool/disk06/0000083c/0141cc0d:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0…7]: 41432754000…41432754007 25 (15653200…15653207) 8 001111
1: [8…603999]: 43036269056…43036873047 25 (1619168256…1619772247) 603992 000111
2: [604000…1026047]: 43036873048…43037295095 25 (1619772248…1620194295) 422048 011111
FLAG Values:
0100000 Shared extent
0010000 Unwritten preallocated extent
0001000 Doesn’t begin on stripe unit
0000100 Doesn’t end on stripe unit
0000010 Doesn’t begin on stripe width
0000001 Doesn’t end on stripe width

the last block of 422048*512=216,088,576 correspond to Unwritten preallocated extent
for each affected file this space can be recovered only if the system reclaims the inode , (e.g. delete the file) or run the defragment process on the file.

The root cause is , in a timeout window of 5 minutes the inodes of some files appeared “dirty” in the vfs cache and the kernel can not remove the preallocated extent , and stay there forever after the expiration of the timeout We are running on CentOS Stream release 8 with 4.18.0-383.el8.x86_64. We did not see something similar on dpm for CentOS Linux release 7.9.2009 (Core) with 3.10.0-1160.45.1.el7.x86_64

see at 3.10. Migrating from ext4 to XFS Red Hat Enterprise Linux 7 | Red Hat Customer Portal for CENTOS 7

(e.g. on DPM xfs partitions)

Futher comment

a) I do not think that those parameters could help
XFS FAQ - xfs.org in order to align the xfs filesystem with the underlying raid volume ( in some preliminary test did see any difference in sizes) but we could remake some tests.

b) With the usage of static allocation and switching off the dynamic speculative preallocation fixed allocation size with the ‘allocsize=’ mount option ( e.g. 256kb), we do not have the problem with the size but some files appear to be highly fragmented ( up to 40 extertns)
c) with command /usr/sbin/xfs_db -r -c “frag” /dev/sdxx
we get the level of fragmentation for a partition
the command can run with the partition mounted
the -r ensures read-only operation
d) we can run the defragment process on a device,directoty or file
(e.g. xfs_fsr /dev/sdxx ) the good news is this can be with the partition mounted the bad news is for a partition of 25TB and 100K took 12 hours

as I wrote these lines, I released that we can run xfs_fsr on some partition if the total size and apparent size is greater than one threshold to recover back the lost space. As Our partition does not suffer really from fragmentation 2-3 externs per file on average appeared to be normal. Just we need a way to reclaim the inodes without deleting the file.

We have to find the eliminate the root cause related with vfs cache
otherwise the issue will be returned after the defrag campaign ( special when will and new nodes).
to use static allocation size might be an option but could cause serious performance issues due to the large fragmentation of the files.

please any comment on the above analysis will be very helpfull ?

best
e.v.

Dear Andreas please you could find the xfs Linux developer’s reply in a question about how we could get back the preallocation space on XFS. As I understood we have to investigate how the eos/xrootd close and update the inodes metadata on XFS and what operation is block the XFS triggers which delete the preallocation space at the end of the file (EOF) :

b) for the speculative preallocation beyond EOF of my files as I understood have to run xfs_fsr to get the space back.

No, you don't need to do anything, and you *most definitely* do
*not* want to run xfs_fsr to remove it. If you really must remove
specualtive prealloc, then run:

# xfs_spaceman -c "prealloc -m 0" <mntpt>

And that will remove all specualtive preallocation that is current
on all in-memory inodes via an immediate blockgc pass.

If you just want to remove post-eof blocks on a single file, then
find out the file size with stat and truncate it to the same size.
The truncate won't change the file size, but it will remove all
blocks beyond EOF. (**EV, I found this proposition with high risk)** 

*However*

You should not ever need to be doing this as there are several
automated triggers to remove it, all when the filesytem detects
there is no active modification of the file being performed. One
trigger is the last close of a file descriptor, another is the
periodic background blockgc worker, and another is memory reclaim
removing the inode from memory.

In all cases, these are triggers that indicate that the file is not
currently being written to, and hence the speculative prealloc is
not needed anymore and so can be removed.

So you should never have to remove it manually.

thank you in advance
best e.v.

p.s.
original thread at

https://lore.kernel.org/all/20220804225554.GD3600936@dread.disaster.area/

I think the proper way to do this, is to add the pre-allocation to http/tpc and xrdhttp/PUT, which is missing in both, because we know the filesize upfront. There is no need for speculative preallocation!

Hello Andreas
thank you for your reply, I suppose that we are going to have soon a version with the correct pre-allocation on to http/tpc and xrdhttp/PUT ?.
in meanwhile, it is not possible to get the preallocated space back from xfs ( or the xfs_blockgs thread is too slow ) , I tried various configuration hints ( play with xfs parameters, cache parameter ysync the vfs cache, umount and mount the FSs use, xfs_spaceman -c “prealloc -m 0” .
xfs developers propose to find the apparent size with stat and "truncate"the file, I found this method to be intrusived and dangerous.

another way is to run xfs_fsr to defrag the FS , and practically re-copied the file but is too slow, 12h per 20T for 100K but xfs developers do not recommend this options ( but migth I will run this for big files)

They is a way to drain a file system only specific file size ( files > 1G ) ?
if re-copied the big file to another machine with eoscp tools the preallocation space will be freed again.

For the moment the new nodes that we are going to add I will switch off the speculative preallocation with fix allocsize size (e…g 32k) of course this will drive to large fragmentation but , we can not afford the allocation of the space.

any comments and hits are welcome

Best Regards

e.v

xfs developers propose to file the apparent size with stat and truncate the file, I found this method to intrusive and dangerous.

other way is to run xfs_fsr to defrag the FS , and practicaly re-copied the file
but is too slow , 12h per 20T for 100K

other why is to drain the file system … etc

they is a way to drain only specific file ( files > 1G ) ?
for the moment the new nodes that we are going to add I will switch off the speculative preallocation with fix allocsize size (e…g 32k) of course this will drive to large fragmention but , we can not afford the allocation of the space.

any comment and hits are welcome
best regards
e.v

There is now a ticket open to get the content size passed from the TPC plug-in into the EOS layer. Cedric is looking into that. In theory it is easy to solve, the only complication is that this plug-in has a separate release process in XRootd.

For the existing files, you can run:

eos file convert --rewrite /eos/…/bigfile

You have to have the converter enabled (eos space config default space.converter=on)

The convert --rewrite will trigger an asynchronous rewrite of that file in a new location deleting the previous one.

To find files bigger than 1 GB you can run:
eos-ns-inspect … scan-files --only-sizes

and then you filter the output and create async conversions using ‘eos file convert --rewrite …’.

Otherwise you can also put your filesystem in RO mode in EOS and then do the ‘stat’ + ‘truncate’ option as the XFS developers recommended.

Another option would be do to create a smaller space where files are written and configure the LRU engine to pick-up large files and rewrite them into a new space based on a minimum file size policy or to use the EOS space policy configuration to rewrite files after they have been closed into a new space. These things work only for new files arriving …