Ceph rbd du shows usage 2-4x higher than inside VM

SteveITS

Renowned Member
Feb 6, 2025
658
282
68
I've noticed VMs that show much higher usage via rbd du than in the VM, for example:

Code:
NAME            PROVISIONED  USED
vm-119-disk-0       500 GiB  413 GiB
vm-122-disk-0       140 GiB  131 GiB

Inside the VM, df shows 95G and 63G used space, respectively. Both of these are Debian 12 which has fstrim.timer running, and the VM has had ssd=1, discard=on since creation. I haven't looked in to many other VMs, but from skimming the usage column, others may be similar.

I can run fstrim -av manually inside the VM which does trim (18 and 25 GB) but the usage in rbd du doesn't immediately drop...in fact it increased by 1 GB each.

It's not a problem for our Ceph capacity, but it seems unexpected to be that different, and I'm not finding much online other than discussions of running a trim. I'd expect a decent difference in usage until a trim ran.

Also, the totals from rbd don't seem to add up to an expected number...the total of the USED column from rbd du is about 70% of the PVE Ceph dashboard "Usage" and about 200% of the PVE storage entry's usage line (which shows numbers 1/3 of the dashboard/summary page, due to the 3/2 replication). Cephfs exists but has almost nothing in it.

What am I missing?

Thanks,
 
All the ones I've looked at have the standard weekly fstrim timer/Windows disk optimize, and I can run it manually. But as noted above trimming 18 GB leaves a wide gulf between 95 GB "on disk" and ~400 GB USED.

I've found a few other threads around the Internet asking a similar question. Best I've got so far (without trying resize2fs yet) is that "it can take some time" for USED to update. :-/

Given it doesn't match either usage number in PVE I'm starting to wonder if it's just wrong, or maybe not the same "usage" it sounds like.
 
Problem could be that your "trim" stopped working at some point for some reason (i.e. discard wasn't ticked in the VMs disk configuration) and even if fstrim tries to discard the whole free space, the underlying storage stack will only act on will discard "new free space since the last discard" only. TRy these steps, I've always managed to recover thin provision with them if the discard is working in the whole storage stack (it should on Ceph if VM/guest OS is properly configured).

To force a full discard:
  • Remount the mount point
  • Or, restart the VM
Then run fstrim -v. Should show the whole space as processed and get released on Ceph, albeit it may take a while.

Alternatively, to avoid service interruption:
  • On Windows, try zeroing the whole disk with sdelete -z.
  • On Linux, try zeroing free space with something like dd if=/dev/zero of=/zero.file bs=1024 count=102400 or cat /dev/zero > zero.file, then rm /zero.file.
If the discard works on your stack, when the "zerofile" is removed, it should issue discards to the storage and free space on PVE's storage.

If this procedure works and your VMs have QEMU Agent, you can also loop through all VMs and issue qm agent VMID fstrim to trim all mounted disks.

On RBD there's also rbd sparsify --image poolName/vm-ID-disk-X, that will try to re-sparse the image space. This is quite fast, but usually less effective, specially if the filesystem in the disk is fragmented (needs bigger contiguous free areas to able to trim them).
 
577 PGs.
PVE Datacenter "Ceph" view usage shows: 4.48 TiB of 18.38 TiB
Each node's storage entry shows: Usage: 26.41% (1.63 TB of 6.18 TB)
(which is 1/3 of the Datacenter view)

Code:
# ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  14 TiB  4.5 TiB   4.5 TiB      24.37
TOTAL  18 TiB  14 TiB  4.5 TiB   4.5 TiB      24.37

Code:
# rbd -p ceph_cluster2 du
NAME            PROVISIONED  USED
(...)
<TOTAL>             9.4 TiB  3.3 TiB

The 3.3T looks like a valid total of the column but doesn't match up with the other numbers, which are at least roughly 1.6*3 = 4.5 (edit: and 6.1*3 = 18). Hence my confusion.

Ceph usage does go up and down over time, which I had always assumed was the trimming. For instance per PVE graphs Ceph usage dropped from 1.68T to 1.59T from Sunday night to Monday.

We restart the VMs relatively regularly for updates; in particular the one with the ~300 GB difference was last restarted 15 days ago. For most of these we created new Debian 12 VMs to migrate them, and our process was to check SSD+Discard during VM creation. So it should have been enabled from the start on all those VMs.

I picked one, and creating 11 GB of zero files, deleting, and trimming (14 GB trimmed) yields no immediate change from rbd du:
Code:
# fstrim -av
/boot: 0 B (0 bytes) trimmed
/: 14.6 GiB (15699951616 bytes) trimmed

Code:
NAME            PROVISIONED  USED
vm-1002-disk-0       43 GiB   37 GiB
 
Last edited:
One of the smaller VMs is off so I ran rbd sparsify and it worked. The syntax I found was a bit different, rbd sparsify --pool poolname diskname, but it did immediately reduce the rbd du usage from 11 GB to 8.8 GB, and only took a few seconds to run.

Do you happen to know if it can be run on a running VM? I didn't find much but it sounded like that might not be a good idea...just means I'd need to shut them down after hours.

Edit:
after running on 1002 above:
Code:
vm-1002-disk-0       43 GiB   12 GiB

Edit 2: script I didn't try, to loop through all disks: https://gist.github.com/p-v-a/67401eb8bcdbf971092276b3e1004e82
 
Last edited: