Proxmox Ceph: RBD running out ouf space, usage does not fit to VMs disk sizes

Rainerle

Renowned Member
Jan 29, 2019
120
35
68
Hi,
it seems that I am running out of space on the OSDs of a five node hyperconverged Proxmox Ceph cluster:

Code:
root@proxmox07:~# rbd du --pool ceph-proxmox-VMs
NAME                            PROVISIONED  USED
vm-100-disk-0                         1 GiB     1 GiB
vm-100-disk-1                         1 GiB  1020 MiB
vm-101-disk-0                         1 GiB     1 GiB
vm-101-disk-1                         1 GiB  1016 MiB
vm-102-disk-0                         1 GiB     1 GiB
vm-102-disk-1                         1 GiB  1020 MiB
vm-103-disk-0                         1 GiB     1 GiB
...
vm-158-disk-2                        32 GiB    32 GiB
vm-158-disk-3                        32 GiB    32 GiB
vm-159-disk-0                       128 MiB     4 MiB
vm-159-disk-1                         2 GiB   2.0 GiB
vm-160-disk-0                       128 MiB     4 MiB
vm-160-disk-1                         2 GiB   2.0 GiB
vm-161-disk-0@2022-05-05_04:15        8 GiB     8 GiB
vm-161-disk-0@2022-05-06_04:15        8 GiB     8 GiB
vm-161-disk-0@2022-05-08_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-09_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-10_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-11_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-12_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-13_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-15_04:15        8 GiB       0 B
vm-161-disk-0                         8 GiB       0 B
vm-161-disk-1@2022-05-05_04:15       16 GiB   4.6 GiB
vm-161-disk-1@2022-05-06_04:15       16 GiB   5.6 GiB
vm-161-disk-1@2022-05-08_04:15       16 GiB   4.5 GiB
vm-161-disk-1@2022-05-09_04:15       16 GiB   4.5 GiB
vm-161-disk-1@2022-05-10_04:15       16 GiB   3.1 GiB
vm-161-disk-1@2022-05-11_04:15       16 GiB   2.9 GiB
vm-161-disk-1@2022-05-12_04:15       16 GiB   2.4 GiB
vm-161-disk-1@2022-05-13_04:15       16 GiB   2.8 GiB
vm-161-disk-1@2022-05-15_04:15       16 GiB   4.3 GiB
vm-161-disk-1                        16 GiB   3.5 GiB
vm-162-disk-0@2022-05-05_05:15        8 GiB     8 GiB
vm-162-disk-0@2022-05-06_05:15        8 GiB     8 GiB
vm-162-disk-0@2022-05-08_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-09_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-10_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-11_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-12_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-13_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-15_05:15        8 GiB       0 B
vm-162-disk-0                         8 GiB       0 B
vm-162-disk-1@2022-05-05_05:15       16 GiB   4.3 GiB
vm-162-disk-1@2022-05-06_05:15       16 GiB   5.4 GiB
vm-162-disk-1@2022-05-08_05:15       16 GiB   4.5 GiB
vm-162-disk-1@2022-05-09_05:15       16 GiB   4.6 GiB
vm-162-disk-1@2022-05-10_05:15       16 GiB   2.9 GiB
vm-162-disk-1@2022-05-11_05:15       16 GiB   2.7 GiB
vm-162-disk-1@2022-05-12_05:15       16 GiB   2.7 GiB
vm-162-disk-1@2022-05-13_05:15       16 GiB   2.6 GiB
vm-162-disk-1@2022-05-15_05:15       16 GiB   4.2 GiB
vm-162-disk-1                        16 GiB   3.5 GiB
...
vm-519-disk-1                        48 GiB    48 GiB
vm-520-disk-0                        16 GiB    16 GiB
vm-520-disk-1                        32 GiB    32 GiB
vm-521-disk-0                        16 GiB    16 GiB
vm-521-disk-1                        32 GiB    32 GiB
vm-522-disk-0                        16 GiB    16 GiB
vm-522-disk-1                        32 GiB    32 GiB
vm-523-disk-0                        16 GiB    16 GiB
vm-523-disk-1                        32 GiB    32 GiB
<TOTAL>                             4.8 TiB   3.8 TiB
root@proxmox07:~#

So 4.8TiB VM disk sizes with 3.8TiB actually used. Times three as we keep 3 copies on Ceph equals to 14.4TiB/11,4TiB.

But on the rados usage it looks like this:
Code:
root@proxmox07:~# rados df
POOL_NAME                 USED   OBJECTS    CLONES     COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS       RD       WR_OPS       WR  USED COMPR  UNDER COMPR
ceph-proxmox-VMs        49 TiB  13031234  11844514   39093702                   0        0         0  74678059494  1.5 PiB  51951708296  1.6 PiB     4.8 TiB       15 TiB
cephfs_data             18 TiB  72881660  29035966  218644980                   0        0         0    478582658   39 TiB    196941115  6.3 TiB      11 TiB       23 TiB
cephfs_metadata         12 GiB   4305902         0   12917706                   0        0         0    287980760  792 GiB    451931553  113 TiB     2.2 GiB      4.3 GiB
device_health_metrics   36 MiB        15         0         45                   0        0         0        12189   92 MiB         8676   35 MiB         0 B          0 B
nfs-ganesha            5.2 MiB        35         0        105                   0        0         0       292586  147 MiB          394  399 KiB         0 B          0 B

total_objects    90218846
total_used       69 TiB
total_avail      18 TiB
total_space      87 TiB
root@proxmox07:~#

49TiB!!! What is going on here?!?!?

Ceph RBD trash is empty by the way...

Code:
root@proxmox07:~# rbd trash ls ceph-proxmox-VMs
root@proxmox07:~#

Best regards
Rainer
 
Last edited:
I moved all VM disks from the Ceph RBD to a host local ZFS mirror und after migrating it looks like this:

Code:
root@proxmox07:~# rbd disk-usage --pool ceph-proxmox-VMs
root@proxmox07:~# ceph df detail
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    87 TiB  19 TiB  68 TiB    68 TiB      78.39
TOTAL  87 TiB  19 TiB  68 TiB    68 TiB      78.39

--- POOLS ---
POOL                   ID   PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
cephfs_data             2   256   11 TiB   11 TiB      0 B   72.90M   18 TiB   18 TiB      0 B  80.66    1.5 TiB            N/A          N/A    N/A      11 TiB       23 TiB
cephfs_metadata         3   128  4.8 GiB  1.3 GiB  3.4 GiB    4.31M   12 GiB  2.0 GiB   10 GiB   0.27    1.5 TiB            N/A          N/A    N/A     2.0 GiB      3.9 GiB
ceph-proxmox-VMs        4  1280   45 TiB   45 TiB   56 MiB   12.27M   49 TiB   49 TiB  168 MiB  91.73    1.5 TiB            N/A          N/A    N/A     3.6 TiB       13 TiB
device_health_metrics   6     1   12 MiB      0 B   12 MiB       15   36 MiB      0 B   36 MiB      0    1.5 TiB            N/A          N/A    N/A         0 B          0 B
nfs-ganesha             9     1  1.6 MiB   20 KiB  1.6 MiB       35  5.2 MiB  408 KiB  4.8 MiB      0    1.5 TiB            N/A          N/A    N/A         0 B          0 B
root@proxmox07:~#

No Disks but still 45TiB used?!?!?!?

What is wrong here???
 
Looking at
https://forum.proxmox.com/threads/ceph-storage-usage-confusion.94673/

So here is the resut of the four counts
Code:
root@proxmox07:~# rbd ls ceph-proxmox-VMs | wc -l
0
root@proxmox07:~# rados ls -p ceph-proxmox-VMs | grep rbd_data | sort | awk -F. '{ print $2 }' |uniq -c |sort -n |wc -l
16
root@proxmox07:~# rados ls -p ceph-proxmox-VMs | grep rbd_data | wc -l
420230
root@proxmox07:~# rados -p ceph-proxmox-VMs ls | grep -v rbd_data
rbd_directory
rbd_children
rbd_info
rbd_trash
root@proxmox07:~#
 
Last edited:
Trying now to get rid of that broken RBD Pool:
Code:
root@proxmox07:~# rados -p ceph-proxmox-VMs ls | head -10
rbd_data.7763f760a5c7b1.00000000000013df
rbd_data.76d8c7e94f5a3a.00000000000102c0
rbd_data.7763f760a5c7b1.0000000000000105
rbd_data.f02a4916183ba2.0000000000013e45
rbd_data.48639154204b80.0000000000004aea
rbd_data.f02a4916183ba2.000000000001f9eb
rbd_data.48639154204b80.000000000000ec2c
rbd_data.f02a4916183ba2.0000000000018ca7
rbd_data.48639154204b80.000000000001a395
rbd_data.48639154204b80.000000000001625d
^C
root@proxmox07:~# rados -p ceph-proxmox-VMs rm --force-full rbd_data.7763f760a5c7b1.00000000000013df
error removing ceph-proxmox-VMs>rbd_data.7763f760a5c7b1.00000000000013df: (2) No such file or directory
root@proxmox07:~# rados -p ceph-proxmox-VMs stat rbd_data.7763f760a5c7b1.00000000000013df
 error stat-ing ceph-proxmox-VMs/rbd_data.7763f760a5c7b1.00000000000013df: (2) No such file or directory
root@proxmox07:~# rados -p ceph-proxmox-VMs lssnap
0 snaps
root@proxmox07:~#

Is Ceph really production ready?
 
hi rainerle ... we run in the same problem, but did not fixed it.
i saw, that we forgot to enable trim in our kvm vms.
after manual "trim -av" we got almost 1,5tb data back.

but we actually run in the problem, that with activated journaling feature, most of our windows vms do not boot, after shutdown.
the error : timed out connecting qmp socket

we had to disable journaling on the images and everything is fine ... 2 sleepless nights.

we will test again next week
 
We had rbd image snapshots which we deleted but rados objects relating to those snapshots where left behind.

We moved all VM disk images to a local disk storage and had to delete the rbd pool. Which caused further problems. Do not delete stuff on Ceph - just add disks... ;-)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!