Ceph rbd du shows usage 2-4x higher than inside VM

SteveITS

Renowned Member
Feb 6, 2025
674
285
68
I've noticed VMs that show much higher usage via rbd du than in the VM, for example:

Code:
NAME            PROVISIONED  USED
vm-119-disk-0       500 GiB  413 GiB
vm-122-disk-0       140 GiB  131 GiB

Inside the VM, df shows 95G and 63G used space, respectively. Both of these are Debian 12 which has fstrim.timer running, and the VM has had ssd=1, discard=on since creation. I haven't looked in to many other VMs, but from skimming the usage column, others may be similar.

I can run fstrim -av manually inside the VM which does trim (18 and 25 GB) but the usage in rbd du doesn't immediately drop...in fact it increased by 1 GB each.

It's not a problem for our Ceph capacity, but it seems unexpected to be that different, and I'm not finding much online other than discussions of running a trim. I'd expect a decent difference in usage until a trim ran.

Also, the totals from rbd don't seem to add up to an expected number...the total of the USED column from rbd du is about 70% of the PVE Ceph dashboard "Usage" and about 200% of the PVE storage entry's usage line (which shows numbers 1/3 of the dashboard/summary page, due to the 3/2 replication). Cephfs exists but has almost nothing in it.

What am I missing?

Thanks,
 
All the ones I've looked at have the standard weekly fstrim timer/Windows disk optimize, and I can run it manually. But as noted above trimming 18 GB leaves a wide gulf between 95 GB "on disk" and ~400 GB USED.

I've found a few other threads around the Internet asking a similar question. Best I've got so far (without trying resize2fs yet) is that "it can take some time" for USED to update. :-/

Given it doesn't match either usage number in PVE I'm starting to wonder if it's just wrong, or maybe not the same "usage" it sounds like.
 
Problem could be that your "trim" stopped working at some point for some reason (i.e. discard wasn't ticked in the VMs disk configuration) and even if fstrim tries to discard the whole free space, the underlying storage stack will only act on will discard "new free space since the last discard" only. TRy these steps, I've always managed to recover thin provision with them if the discard is working in the whole storage stack (it should on Ceph if VM/guest OS is properly configured).

To force a full discard:
  • Remount the mount point
  • Or, restart the VM
Then run fstrim -v. Should show the whole space as processed and get released on Ceph, albeit it may take a while.

Alternatively, to avoid service interruption:
  • On Windows, try zeroing the whole disk with sdelete -z.
  • On Linux, try zeroing free space with something like dd if=/dev/zero of=/zero.file bs=1024 count=102400 or cat /dev/zero > zero.file, then rm /zero.file.
If the discard works on your stack, when the "zerofile" is removed, it should issue discards to the storage and free space on PVE's storage.

If this procedure works and your VMs have QEMU Agent, you can also loop through all VMs and issue qm agent VMID fstrim to trim all mounted disks.

On RBD there's also rbd sparsify --image poolName/vm-ID-disk-X, that will try to re-sparse the image space. This is quite fast, but usually less effective, specially if the filesystem in the disk is fragmented (needs bigger contiguous free areas to able to trim them).
 
577 PGs.
PVE Datacenter "Ceph" view usage shows: 4.48 TiB of 18.38 TiB
Each node's storage entry shows: Usage: 26.41% (1.63 TB of 6.18 TB)
(which is 1/3 of the Datacenter view)

Code:
# ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  14 TiB  4.5 TiB   4.5 TiB      24.37
TOTAL  18 TiB  14 TiB  4.5 TiB   4.5 TiB      24.37

Code:
# rbd -p ceph_cluster2 du
NAME            PROVISIONED  USED
(...)
<TOTAL>             9.4 TiB  3.3 TiB

The 3.3T looks like a valid total of the column but doesn't match up with the other numbers, which are at least roughly 1.6*3 = 4.5 (edit: and 6.1*3 = 18). Hence my confusion.

Ceph usage does go up and down over time, which I had always assumed was the trimming. For instance per PVE graphs Ceph usage dropped from 1.68T to 1.59T from Sunday night to Monday.

We restart the VMs relatively regularly for updates; in particular the one with the ~300 GB difference was last restarted 15 days ago. For most of these we created new Debian 12 VMs to migrate them, and our process was to check SSD+Discard during VM creation. So it should have been enabled from the start on all those VMs.

I picked one, and creating 11 GB of zero files, deleting, and trimming (14 GB trimmed) yields no immediate change from rbd du:
Code:
# fstrim -av
/boot: 0 B (0 bytes) trimmed
/: 14.6 GiB (15699951616 bytes) trimmed

Code:
NAME            PROVISIONED  USED
vm-1002-disk-0       43 GiB   37 GiB
 
Last edited:
One of the smaller VMs is off so I ran rbd sparsify and it worked. The syntax I found was a bit different, rbd sparsify --pool poolname diskname, but it did immediately reduce the rbd du usage from 11 GB to 8.8 GB, and only took a few seconds to run.

Do you happen to know if it can be run on a running VM? I didn't find much but it sounded like that might not be a good idea...just means I'd need to shut them down after hours.

Edit:
after running on 1002 above:
Code:
vm-1002-disk-0       43 GiB   12 GiB

Edit 2: script I didn't try, to loop through all disks: https://gist.github.com/p-v-a/67401eb8bcdbf971092276b3e1004e82
 
Last edited:
Yes, can be run on a live VM in the sense that the command runs, i.e. doesn't check if the given rbd image as an owner/lock, so we can suppose it's safe. As you mention, Ceph docs doesn't explicitly tell if it can be run live or not. That said, I use it from time to time on labs and training env, but last time I had to run it on some production VMs I did stop them, just in case.

If after zeroing sparsify works, it means that there's really empty space from the guest OS filesystem point of view, but it was still filled with zeros instead of sparse space in the disk image: something is dropping those discards from fstrim on their way to Ceph. If there was still some data in the disk image, sparsify would have had no effect.
 
> it means that there's really empty space

That sounds right of course, however ~14 hours later Ceph "usage" hasn't dropped 400 GB...in fact it went up slightly. Also, I'm not sure what would be out of sync with discard, considering it's enabled in the VMs and trim runs automatically in Linux and Windows. Implies a bug somewhere between the two?

And let's say I erase all drives and sparsify literally everything out, Ceph usage ought to be negative 1.3 TB which makes no sense.

At this point I've progressed to a different question, "why is 'rbd du' USED different than Ceph usage"? It is 2.9T, after yesterday's sparsify runs, and not the 4.5T or 1.6T numbers in PVE. Starting to think I'm chasing my tail...
 
Compression doesn't seem to be enabled or used.

Code:
--- POOLS ---
POOL             ID  PGS   STORED   (DATA)  (OMAP)  OBJECTS     USED   (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
.mgr              1    1  217 MiB  217 MiB     0 B       55  652 MiB  652 MiB     0 B      0    4.1 TiB            N/A          N/A    N/A         0 B          0 B
ceph_cluster2     4  512  1.5 TiB  1.5 TiB  28 KiB  673.64k  4.5 TiB  4.5 TiB  84 KiB  26.59    4.1 TiB            N/A          N/A    N/A         0 B          0 B

The default for bluestore_compression_mode is "not set" so presumably "none," currently?

Edit:
the number of OBJECTS has decreased by about 80,000 since the last time I checked, before some sparsifying, but USED and %USED is the same or increased slightly since yesterday. Seems like sparsify reduces OBJECTS but not actual storage usage.
 
Last edited:
I asked on the ceph-users@ceph.io list and two answers so far were to not run it on an active disk, and to not run it at all...from a Croit email account, "I'd advise against using sparsify, especially on any mapped RBD (running VM). We had a support case once where this operation on a live VM caused an OSD lockup, halting the operation for other clients."

So, seems safest to just ignore all this.

Notably one of the answers said, "rbd du estimates size cheaply by counting the number of objects and multiplying that count by the set object size. Therefore it diverges from the size you see of the guest's filesystem." So perhaps it is just counting empty objects.

I known that ext4 had problem with discard in the past (not about fragmentation, but discard not always working).
Personally, I'm using xfs in production, and I never had this problem (on 4000 vms)
That mailing list thread mentioned "EXT4 uses an in-memory bitmap that does not trim blocks if they haven't been overwritten before." And linked to https://forum.proxmox.com/threads/help-with-trim-on-virtio-scsi-single.123819/ which links to https://serverfault.com/questions/1...return-same-value-unlike-ext4/1113129#1113129 stating EXT4 trims on VM reboot...which doesn't seem to be the case from what I can see, or at least it doesn't change "rbd du" USED. The threads also mention XFS. But from what little I can tell it seems like TRIM works, it just doesn't change "rbd du."

It will be interesting/odd if rbd du USED ever gets higher than the cluster capacity...