Proxmox Ceph: RBD running out ouf space, usage does not fit to VMs disk sizes

Rainerle · May 15, 2022

Hi,
it seems that I am running out of space on the OSDs of a five node hyperconverged Proxmox Ceph cluster:

Code:

root@proxmox07:~# rbd du --pool ceph-proxmox-VMs
NAME                            PROVISIONED  USED
vm-100-disk-0                         1 GiB     1 GiB
vm-100-disk-1                         1 GiB  1020 MiB
vm-101-disk-0                         1 GiB     1 GiB
vm-101-disk-1                         1 GiB  1016 MiB
vm-102-disk-0                         1 GiB     1 GiB
vm-102-disk-1                         1 GiB  1020 MiB
vm-103-disk-0                         1 GiB     1 GiB
...
vm-158-disk-2                        32 GiB    32 GiB
vm-158-disk-3                        32 GiB    32 GiB
vm-159-disk-0                       128 MiB     4 MiB
vm-159-disk-1                         2 GiB   2.0 GiB
vm-160-disk-0                       128 MiB     4 MiB
vm-160-disk-1                         2 GiB   2.0 GiB
vm-161-disk-0@2022-05-05_04:15        8 GiB     8 GiB
vm-161-disk-0@2022-05-06_04:15        8 GiB     8 GiB
vm-161-disk-0@2022-05-08_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-09_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-10_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-11_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-12_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-13_04:15        8 GiB       0 B
vm-161-disk-0@2022-05-15_04:15        8 GiB       0 B
vm-161-disk-0                         8 GiB       0 B
vm-161-disk-1@2022-05-05_04:15       16 GiB   4.6 GiB
vm-161-disk-1@2022-05-06_04:15       16 GiB   5.6 GiB
vm-161-disk-1@2022-05-08_04:15       16 GiB   4.5 GiB
vm-161-disk-1@2022-05-09_04:15       16 GiB   4.5 GiB
vm-161-disk-1@2022-05-10_04:15       16 GiB   3.1 GiB
vm-161-disk-1@2022-05-11_04:15       16 GiB   2.9 GiB
vm-161-disk-1@2022-05-12_04:15       16 GiB   2.4 GiB
vm-161-disk-1@2022-05-13_04:15       16 GiB   2.8 GiB
vm-161-disk-1@2022-05-15_04:15       16 GiB   4.3 GiB
vm-161-disk-1                        16 GiB   3.5 GiB
vm-162-disk-0@2022-05-05_05:15        8 GiB     8 GiB
vm-162-disk-0@2022-05-06_05:15        8 GiB     8 GiB
vm-162-disk-0@2022-05-08_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-09_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-10_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-11_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-12_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-13_05:15        8 GiB       0 B
vm-162-disk-0@2022-05-15_05:15        8 GiB       0 B
vm-162-disk-0                         8 GiB       0 B
vm-162-disk-1@2022-05-05_05:15       16 GiB   4.3 GiB
vm-162-disk-1@2022-05-06_05:15       16 GiB   5.4 GiB
vm-162-disk-1@2022-05-08_05:15       16 GiB   4.5 GiB
vm-162-disk-1@2022-05-09_05:15       16 GiB   4.6 GiB
vm-162-disk-1@2022-05-10_05:15       16 GiB   2.9 GiB
vm-162-disk-1@2022-05-11_05:15       16 GiB   2.7 GiB
vm-162-disk-1@2022-05-12_05:15       16 GiB   2.7 GiB
vm-162-disk-1@2022-05-13_05:15       16 GiB   2.6 GiB
vm-162-disk-1@2022-05-15_05:15       16 GiB   4.2 GiB
vm-162-disk-1                        16 GiB   3.5 GiB
...
vm-519-disk-1                        48 GiB    48 GiB
vm-520-disk-0                        16 GiB    16 GiB
vm-520-disk-1                        32 GiB    32 GiB
vm-521-disk-0                        16 GiB    16 GiB
vm-521-disk-1                        32 GiB    32 GiB
vm-522-disk-0                        16 GiB    16 GiB
vm-522-disk-1                        32 GiB    32 GiB
vm-523-disk-0                        16 GiB    16 GiB
vm-523-disk-1                        32 GiB    32 GiB
<TOTAL>                             4.8 TiB   3.8 TiB
root@proxmox07:~#

So 4.8TiB VM disk sizes with 3.8TiB actually used. Times three as we keep 3 copies on Ceph equals to 14.4TiB/11,4TiB.

But on the rados usage it looks like this:

Code:

root@proxmox07:~# rados df
POOL_NAME                 USED   OBJECTS    CLONES     COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED       RD_OPS       RD       WR_OPS       WR  USED COMPR  UNDER COMPR
ceph-proxmox-VMs        49 TiB  13031234  11844514   39093702                   0        0         0  74678059494  1.5 PiB  51951708296  1.6 PiB     4.8 TiB       15 TiB
cephfs_data             18 TiB  72881660  29035966  218644980                   0        0         0    478582658   39 TiB    196941115  6.3 TiB      11 TiB       23 TiB
cephfs_metadata         12 GiB   4305902         0   12917706                   0        0         0    287980760  792 GiB    451931553  113 TiB     2.2 GiB      4.3 GiB
device_health_metrics   36 MiB        15         0         45                   0        0         0        12189   92 MiB         8676   35 MiB         0 B          0 B
nfs-ganesha            5.2 MiB        35         0        105                   0        0         0       292586  147 MiB          394  399 KiB         0 B          0 B

total_objects    90218846
total_used       69 TiB
total_avail      18 TiB
total_space      87 TiB
root@proxmox07:~#

49TiB!!! What is going on here?!?!?

Ceph RBD trash is empty by the way...

Code:

root@proxmox07:~# rbd trash ls ceph-proxmox-VMs
root@proxmox07:~#

Best regards
Rainer

Rainerle · May 15, 2022

I moved all VM disks from the Ceph RBD to a host local ZFS mirror und after migrating it looks like this:

Code:

root@proxmox07:~# rbd disk-usage --pool ceph-proxmox-VMs
root@proxmox07:~# ceph df detail
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    87 TiB  19 TiB  68 TiB    68 TiB      78.39
TOTAL  87 TiB  19 TiB  68 TiB    68 TiB      78.39

--- POOLS ---
POOL                   ID   PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
cephfs_data             2   256   11 TiB   11 TiB      0 B   72.90M   18 TiB   18 TiB      0 B  80.66    1.5 TiB            N/A          N/A    N/A      11 TiB       23 TiB
cephfs_metadata         3   128  4.8 GiB  1.3 GiB  3.4 GiB    4.31M   12 GiB  2.0 GiB   10 GiB   0.27    1.5 TiB            N/A          N/A    N/A     2.0 GiB      3.9 GiB
ceph-proxmox-VMs        4  1280   45 TiB   45 TiB   56 MiB   12.27M   49 TiB   49 TiB  168 MiB  91.73    1.5 TiB            N/A          N/A    N/A     3.6 TiB       13 TiB
device_health_metrics   6     1   12 MiB      0 B   12 MiB       15   36 MiB      0 B   36 MiB      0    1.5 TiB            N/A          N/A    N/A         0 B          0 B
nfs-ganesha             9     1  1.6 MiB   20 KiB  1.6 MiB       35  5.2 MiB  408 KiB  4.8 MiB      0    1.5 TiB            N/A          N/A    N/A         0 B          0 B
root@proxmox07:~#

No Disks but still 45TiB used?!?!?!?

What is wrong here???

Rainerle · May 16, 2022

Looking at
https://forum.proxmox.com/threads/ceph-storage-usage-confusion.94673/

So here is the resut of the four counts

Code:

root@proxmox07:~# rbd ls ceph-proxmox-VMs | wc -l
0
root@proxmox07:~# rados ls -p ceph-proxmox-VMs | grep rbd_data | sort | awk -F. '{ print $2 }' |uniq -c |sort -n |wc -l
16
root@proxmox07:~# rados ls -p ceph-proxmox-VMs | grep rbd_data | wc -l
420230
root@proxmox07:~# rados -p ceph-proxmox-VMs ls | grep -v rbd_data
rbd_directory
rbd_children
rbd_info
rbd_trash
root@proxmox07:~#

Rainerle · May 16, 2022

Trying now to get rid of that broken RBD Pool:

Code:

root@proxmox07:~# rados -p ceph-proxmox-VMs ls | head -10
rbd_data.7763f760a5c7b1.00000000000013df
rbd_data.76d8c7e94f5a3a.00000000000102c0
rbd_data.7763f760a5c7b1.0000000000000105
rbd_data.f02a4916183ba2.0000000000013e45
rbd_data.48639154204b80.0000000000004aea
rbd_data.f02a4916183ba2.000000000001f9eb
rbd_data.48639154204b80.000000000000ec2c
rbd_data.f02a4916183ba2.0000000000018ca7
rbd_data.48639154204b80.000000000001a395
rbd_data.48639154204b80.000000000001625d
^C
root@proxmox07:~# rados -p ceph-proxmox-VMs rm --force-full rbd_data.7763f760a5c7b1.00000000000013df
error removing ceph-proxmox-VMs>rbd_data.7763f760a5c7b1.00000000000013df: (2) No such file or directory
root@proxmox07:~# rados -p ceph-proxmox-VMs stat rbd_data.7763f760a5c7b1.00000000000013df
 error stat-ing ceph-proxmox-VMs/rbd_data.7763f760a5c7b1.00000000000013df: (2) No such file or directory
root@proxmox07:~# rados -p ceph-proxmox-VMs lssnap
0 snaps
root@proxmox07:~#

Is Ceph really production ready?

spark 5 · May 24, 2022

hi rainerle ... we run in the same problem, but did not fixed it.
i saw, that we forgot to enable trim in our kvm vms.
after manual "trim -av" we got almost 1,5tb data back.

but we actually run in the problem, that with activated journaling feature, most of our windows vms do not boot, after shutdown.
the error : timed out connecting qmp socket

we had to disable journaling on the images and everything is fine ... 2 sleepless nights.

we will test again next week

Rainerle · Jun 2, 2022

We had rbd image snapshots which we deleted but rados objects relating to those snapshots where left behind.

We moved all VM disk images to a local disk storage and had to delete the rbd pool. Which caused further problems. Do not delete stuff on Ceph - just add disks... ;-)

Search

Search

Proxmox Ceph: RBD running out ouf space, usage does not fit to VMs disk sizes

Rainerle

Renowned Member

Rainerle

Renowned Member

Rainerle

Renowned Member

Rainerle

Renowned Member

spark 5

Active Member

Rainerle

Renowned Member