[SOLVED] CEPH Storage Usage Confusion

Tmanok · Aug 17, 2021

Hi Everyone,

Getting into using the CEPH RBD command because I need to identify storage usage in my CEPH cluster. When issuing the command below, it almost appears as though my snapshots are using an immense amount of space:

Code:

rbd -p pool du

Excerpt of one VM:

Code:

                                                                                            Provisioned   Used
vm-111-disk-1@Aug5thWindowsUpdates                   245 GiB          213 GiB
vm-111-disk-1@PostReboot                                          245 GiB          213 GiB
vm-111-disk-1                                                                  245 GiB          41 GiB

Code:

rados -p pool df

Code:

total_objects    520496
total_used       3.8 TiB
total_avail        3.4 TiB
total_space      7.3 TiB

So I tried issuing two commands, one to calculate all usage of everything including snapshots, and the other to calculate all images without snapshots (or MiB because there's only a couple of small disks that are in the MiB).

Code:

rbd -p pool du | grep GiB | awk '{print $4}' | awk '{s+=$1} END {print s}'
4615.8 GiB

Code:

rbd -p pool du | grep GiB | grep -v -e Aug -e Post -e Pre | awk '{print $4}' | awk '{s+=$1} END {print s}'
2419.7 GiB

Both of those didn't seem quite right so then I tried issuing a command to calculate "provisioned" in case they are not thin-provisioned like on VM-Ware:

Code:

rbd -p pool du | grep GiB | grep -v -e Aug -e Post -e Pre | awk '{print $2}' | awk '{s+=$1} END {print s}'
2654 GiB

And now I'm simply not sure how my cluster is using 3.8TiB when all of my disks only add up to 2.654 TiB. I'm pretty certain that the rest is either being used by CEPH for redundancy, or is possibly the snapshots. I just don't feel like I have a clear enough understanding of what is using my storage.
Thanks for any guidance from more experienced members of the community or staff.

Tmanok

aaron · Aug 18, 2021

Did you do some benchmarks on the cluster with rados bench? If so, it could be that you told the benchmark to not cleanup afterwards and still have the objects around that were created by the benchmark.

if you run rados -p <pool> ls you will get a long list of all the objects. The most common objects should be the ones starting with rbd_data.. Those contain the actual objects of the disk images. If you filter away those you will have some others (rbd_id., rbd_header., rbd_object_map. and so forth). Those contain the metadata needed for the disk images.

If you see a lot of benchmark_data... objects, you will most likely have found out why your pool is using more space than expected.

The other option could be orphaned objects, but the chances for that happening are slim, especially if it is a rather new cluster.
But you can check it with the following snippets if you might have some orphaned disk image objects:
First run:

Code:

rbd ls <pool>|wc -l

and then

Code:

rados ls -p <pool> | grep rbd_data | sort | awk -F. '{ print $2 }' |uniq -c |sort -n |wc -l

If the number matches, then everything should be okay. If the second snippet gives you a higher number, then there seem to be orphaned disk image objects in the pool.

Tmanok · Aug 20, 2021

aaron said:
Did you do some benchmarks on the cluster with rados bench?

Have not yet.

aaron said:
if you run rados -p <pool> ls you will get a long list of all the objects.

No kidding, the list took nearly 5 minutes to produce so I sent it to a text file the second time I ran it.

aaron said:
rados ls -p <pool> | grep rbd_data | sort | awk -F. '{ print $2 }' |uniq -c |sort -n |wc -l
39
rbd ls <pool> | wc -l
40

Is is bad if the rados command returns one fewer than the rbd command?

Thanks, about 149 line items are not rbd_data (headers, object maps & ids) while the other 382,779 lines were all rbd_data btw.

Tmanok

aaron · Aug 23, 2021

Tmanok said:
Is is bad if the rados command returns one fewer than the rbd command?

Interesting because if there are orphans, you would have a lower number from rbd ls. But you got a higher one. Was a VM disk or a snapshot created in the meantime?

Tmanok said:
Thanks, about 149 line items are not rbd_data (headers, object maps & ids) while the other 382,779 lines were all rbd_data btw.

That should be okay and contain the metadata of the images.

Tmanok · Aug 26, 2021

aaron said:
Interesting because if there are orphans, you would have a lower number from rbd ls. But you got a higher one. Was a VM disk or a snapshot created in the meantime?

No snapshots, in fact if I run the commands again right now they output:
rados ls... 42
rdb ls... 43

Maybe there is something stuck? No ongoing snapshots or backups. Just 2 new CTs and 1 new VM compared to last time.
Thanks, should I be concerned?

Tmanok

Tmanok · Oct 20, 2021

HI Aaron,

I was incorrect and misinterpreted your question about snapshots. While performing maintenance on our largest VMs there were actually snapshots from August 5th and September (21 days prior to my last post indicating that there were no snapshots). So after removing a grand total of around 16 of them, we saw our SSD CEPH Pool usage drop from 1.86TB to 1.12TB. Or if you want to look at it another way (the in Node>Ceph menu) we're at 2.8TB of 7.28TB.

Thanks Aaron!

Search

Search

[SOLVED] CEPH Storage Usage Confusion

Tmanok

Renowned Member

aaron

Proxmox Staff Member

Tmanok

Renowned Member

aaron

Proxmox Staff Member

Tmanok

Renowned Member

Tmanok

Renowned Member

We value your privacy