RBD pool size

sc3705

Member
Jul 3, 2020
27
1
8
42
Hi,

I'm very new to Ceph and I'm trying but this thing is confusing. I'm trying to build our new proxmox cluster with Ceph for storage. When I created the RBD pool it created a pool that was 4.2TB but the raw capacity is 20TB. Probably not my best moment, but I wasn't very concerned at the time because I thought it was just an initial size that could be extended later. I have 3.8TB of data to copy to the pool so I figured I'd start that process since it's going to take a while and grow the disk later (data is coming from very slow disks).

This was on Friday; today I check the pool and the pool capacity is 3.25TB. So not enough to store all the data. Additionally, checking on the file transfer only 680GB have been copied but in Proxmox it's reporting almost 2TB in use for the storage. I can't see where that data is as all files, I can locate total 1.7 TB. I've also noticed that over the weekend the total capacity fluctuated a few times but never went back to the original pool size.

I looked up how to grow the pool but I don't think I found the correct information or I don't understand what I'm looking at. Is it the quota that sets the pool capacity? I found this "ceph osd pool set-quota pool-name max_objects obj-count max_bytes bytes" but I don't know what to put for the max_onjects and I don't really want to issue the command without prior knowledge about what it actually does. Since it's clearly fluctuating on its own, I was hoping it would auto grow as usage grew, but I'm currently at 91% used and counting so I've abandoned that thought.

I'd like to grow the pool before runs out of space in the next couple hours so the weekend file transfer wasn't for nothing. Can someone please point me in the right direction?
 
I reread that and confused myself with the numbers so it's probably not very clear.

Raw capacity: 20TB
Initial file transfer last Tuesday: 1.1TB
File transfer over the weekend: 680GB
Total used as reported by proxmox: 2.99 TB; by my math should be around 1.8TB
Total pool size 3.24TB down from 4.2TB

Could thin provisioning cause this? Is proxmox reporting on the total disk space provisioned or the actual space being used? I know about redundancy but 4TB usable out of 20TB leads me to believe I did something wrong.
 
Typically Ceph uses 3 way replication so you should have had around 6 TB or so. What does ceph status show and what does ceph osd tree show?
 
Ran the 2 commands. The output is below.

Some background:
I'm in a tough spot between business goals and finance. The new cluster is to consist of 1 new server and 2 existing production servers being repurposed. They're road mapped for replacement in 2 phases with the addition of a 4th server by this time next year. The pve server is new and the T310 is repurposed. The 3rd server for the cluster is where the data is coming from now. Once that's done, I can repurpose it to play the 3rd node. There are 8 disks in the 3rd server, 2 in a raid 1 (where the data is coming from currently and will be removed once done), one of the remaining 6 disks will be moved to the T310, and the other 5 will remain to play thier roll in the cluster. It's a risky plan but I don't have many options. As you can probably guess, my plan was to run degraded while I bring up the 3rd node (a week at most). I was expecting reduced capacity and performance but not by this much (performance actually isn't bad at all), and I certainly never expected the total capacity to change by itself over time. That was a curve ball for sure :-) . I added the 5'th disk to the pve host early in an attempt to gain a little extra space, but that only lead to reduced performance. Since my last post the total capacity is down to 3.20TB from 3.24TB.



Code:
root@pve:~# ceph status
  cluster:
    id:     fb367ae5-7735-4067-b3bb-ce6e0e77ebf9
    health: HEALTH_WARN
            1 backfillfull osd(s)
            2 pool(s) backfillfull
            Degraded data redundancy: 68432/2389191 objects degraded (2.864%), 22 pgs degraded, 66 pgs undersized
            2 daemons have recently crashed
 
  services:
    mon: 2 daemons, quorum pve,T310 (age 6m)
    mgr: pve(active, since 2d), standbys: T310
    mds:  2 up:standby
    osd: 9 osds: 9 up (since 23h), 9 in (since 23h); 254 remapped pgs
 
  data:
    pools:   2 pools, 320 pgs
    objects: 796.40k objects, 3.0 TiB
    usage:   11 TiB used, 9.3 TiB / 20 TiB avail
    pgs:     68432/2389191 objects degraded (2.864%)
             591146/2389191 objects misplaced (24.743%)
             207 active+clean+remapped
             47  active+clean
             44  active+undersized
             22  active+undersized+degraded
 
  io:
    client:   179 KiB/s rd, 18 MiB/s wr, 30 op/s rd, 227 op/s wr



Code:
root@pve:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME     STATUS REWEIGHT PRI-AFF
-1       20.01286 root default                         
-5       10.00638     host T310                         
 0   hdd  2.00130         osd.0     up  1.00000 1.00000
 1   hdd  2.00130         osd.1     up  1.00000 1.00000
 2   hdd  3.00189         osd.2     up  1.00000 1.00000
 3   hdd  3.00189         osd.3     up  1.00000 1.00000
-3       10.00648     host pve                         
 4   hdd  2.00130         osd.4     up  1.00000 1.00000
 5   hdd  2.00130         osd.5     up  1.00000 1.00000
 6   hdd  2.00130         osd.6     up  1.00000 1.00000
 7   hdd  2.00130         osd.7     up  1.00000 1.00000
 8   hdd  2.00130         osd.8     up  1.00000 1.00000
 
If you look at this line objects: 796.40k objects, 3.0 TiB it's showing that you have 3 TB of actual data in the pool. It sounds like Promox and Ceph are reporting the actual usage so now it's just a matter of figuring out what's using more space than our expecting. Ceph pools by default use thin provisioning so it shouldn't be that. When you goto the ceph pool under storage under one of your nodes does it show 3 TB of 6 TB used or something similar? Or does it show show 3 TB of 4 TB used?
 
I swear the file transfer has sped up since I started looking at this lol.

1595869861002.png


I'm using 3.04TB. I think all would have been fine if the total capacity didn't drop from 4.2TB. I'm not sure why that happened and why it's still dropping. From my last post to this one I'm down another .1TB. It looks like as the usage goes up the overall capacity goes down. I can't say I've ever seen this behavior in a storage system. It's a little confusing when the goal posts move. :) .
 
I'm wondering if that screen isn't incorrectly calculating the free space for some reason because ceph status shows that your only at around 50% total usage. If you click on Ceph under Data Center what does it show under usage? Another option to potentially get you more space is to create a pool that's only 2x replication but that would be a bit risky. If ceph status continues to show free space I might keep copying and see if it actually runs out of space or not.
 
The Data Center view agrees with you; There's 54% used of 20.01TB. I assumed that this was raw capacity of the cluster, including redundancy. As a result, not to be used to measure the usable space in a pool. My assumption was, like ZFS, there's a pool and Dataset. The available space of the pool is not necessarily the available space in the dataset. I'm just reading, clicking, and trying to figure this out so my assumption appears incorrect.

I thought about 2x replica when I first set all of this up but I thought if the immediate goal is 3 nodes, it would be harder to change the replication than it would to add another node, at least for me. I figured on a temporary basis my risk was the same between a 3-node cluster running with 2 nodes or a healthy 2 node cluster. For the life of me I couldn't start the cluster with only one node and it fired right up with zero effort with 2 nodes, so I assumed 2 was the absolute minimum for a 3 node setup.

I don't think I have a choice but to let it run. It would suck to kill it only to find out you're correct and it would have finished. I started full system state backups early this morning in case it runs out of space and corrupts system files in the VMs or something (had that happen before). Fingers crossed; I'm at 96.45% and climbing, total capacity is 3.17TB and dropping. The business knows and they started shaking the dice before I could finish my report... Wish me luck, lol.

I'm pretty sure this is going to fail, I don't have that kind of luck.... ;-)
 
The capacity under Data Center is raw capacity of the disks, with redudancy you'd divide that by 3 if you have 3x replication. The only way to restrict the size of a ceph pool is via quotas and you'd have had to do that yourself. It's probably just messing up the math because you don't have enough nodes for 3x replication across 3 nodes.
 
I just got half way down the hall and had a thought. I can't be all wrong about the Data Center storage view. That's reporting 10TB in use. We don't have 10TB in use anywhere in this environment (expect the backup server, different conversation :-) ).
 
The capacity under Data Center is raw capacity of the disks, with redudancy you'd divide that by 3 if you have 3x replication. The only way to restrict the size of a ceph pool is via quotas and you'd have had to do that yourself. It's probably just messing up the math because you don't have enough nodes for 3x replication across 3 nodes.

Posted my last before seeing this.

correct, I most certainly did not create a quota, HA. That command doesn't look like something you could accidentally do either :). Okay, that's a solid theory. I can see how that could happen. Well, I have .9TB to see if you're correct, I really hope you are!! However it works out, thank you for responding and assisting me!!! Thank you, Thank you!
 
Two things you may want to be aware of:
- if you have the default pool setting of 3/2 replicas, you're just wasting space since you only have 2 osd nodes; every write will be written twice to the same node. it will also stop writes on either node being down unless you change the rule to 3/1.
- you have one node with 5 drives and one with 4 drives. assuming all the same size, you will only be able to use 4 drives worth of space on the second node since there are insufficient PGs available on the other.

All in all this isnt a very efficient configuration.
edit: didnt notice your weights on the osds. carry on.
 
I have to admit, I didn't consciously play a role with the weights. I'm assuming what's set is good for this. I'll only have 2 nodes with 4/5 disks each for about a week. Once I get the data off of the 3rd node, I'll have 6 disks available and another node to even everything out. I was hoping to run it degraded for a few days to get the data into the environment since I don't have a 4th server at the moment to stage it on.

Based on what you just wrote, the data is probably doubling up on the 2 existing nodes to make up for the missing 3rd node. This is unfortunate, I have to get back to the drawing boards because this isn't going to work. The file transfer failed when proxmox started showing 100% on the RBD pool. Shout out to Proxmox as it appears all the VMs paused when that happened, it was still showing as running but nothing was happening. I deleted the virtual disk i was pumping data into and everything just started working again on its own.

Now that I deleted the virtual disk, the total capacity in the pool went back up. This is what's messing me up. If it didn't change, I'd have enough space to move into the environment, repurpose the 3rd server, and grow the space to get back a buffer. This entire plan was created using the numbers reported by Proxmox, then we execute and the numbers start changing. Honestly, that's a kind of frustrating. At 4.20TB I can move in with .20 TB to spare; tight but doable.

I really don't want to resort to USB drives. I'll need 2 usb drives for safety and that will likely slow down the overall file transfer significantly. I'm not sure what to do, or where's the 4TB limit coming from, or why the capacity drops when I add data to the pool?

After deleting the vdisk. Note the capacity vs my screen shot above:
1595879874603.png
 
You can do what you ask, although I'd suggest that you keep your osd nodes balanced with drive count AND capacity, as differences will either result in poor(er) performance in case of imbalanced osd count, or unusable capacity in case of imbalanced capacity per node.

create a new replication rule with max size 2 and min size 1. this will allow you to function in a space efficient manner for the time being (DISCLAIMER: this is NOT a safe manner of operation; make sure you're fully backed up as it is possible for you to lose data in case of a downed node.) Once you bring the third node online, simply reassign the replication rule to a 3/2 rule and ceph will do the rest.
 
You can do what you ask, although I'd suggest that you keep your osd nodes balanced with drive count AND capacity, as differences will either result in poor(er) performance in case of imbalanced osd count, or unusable capacity in case of imbalanced capacity per node.

create a new replication rule with max size 2 and min size 1. this will allow you to function in a space efficient manner for the time being (DISCLAIMER: this is NOT a safe manner of operation; make sure you're fully backed up as it is possible for you to lose data in case of a downed node.) Once you bring the third node online, simply reassign the replication rule to a 3/2 rule and ceph will do the rest.


I swear I have the weirdest type of experiences. I'm going to chuck it up to the missing node and monitor again when the cluster is healthy. I started copying data into the environment from a USB drive (Transfer was MUCH faster than I thought it would be) and now the capacity is going up as I put data into the pool. Why the heck couldn't it do this a couple days ago, LOL! Capacity is still moving around so it'll probably drop again. I figured I'd get as much data as I could off the USB drives before I blow away the server that will become the 3rd node. The thought of blowing away a perfectly good server in production is terrifying. The business wants it........


Great tip, Thank you!

I did ask the question in the forum about the mismatched OSDs. Performance is subjective and I didn't get a whole lot to go on to liken it to my situation. As a result, I was expecting a performance hit but I didn't know how hard it would be. Now I know that unbalanced osds are definitely not worth it. It's not enough to just say poor performance, because there are people around who take high performance to unnecessary lengths. A perfect example is when our array here gets degraded. I notice the performance hit while it's rebuilding but most folks here are oblivious to it. It's still fast to them, if I didn't tell them there is a storage issue they would know. I've also accepted the loss of capacity. Of the 15 disks they're pretty much half 2TB and half 3TB. Only option is to increase the 2TB to 3s over time. The kicker is 5 of the 2TB disks are brand new sas drives. They weren't cheap so that's not going to be an easy sell upstairs....

You make changing the replication sound easy, I may have selected the wrong route regarding initial cluster configuration. I honestly figured the risk was the same either way. A single node failure will bring it all down in both cases, right? Then again, if I did what you're suggesting I would have been done copying the data already. Live and learn....

I truly appreciate all your help with this! May I ask you a slightly off topic question? I'm thinking for a separate project, how silly of an idea is a 3 node 3 disk cluster? Say 3 nodes with a 2TB NVME in each?
 
I notice the performance hit while it's rebuilding but most folks here are oblivious to it. It's still fast to them, if I didn't tell them there is a storage issue they would know.
Yes, performance is subjective, and what is acceptable for one use case is inadequate for another. The way to think about performance degradation is not by whats acceptable to you but rather how much less then optimum. If you dont need optimal performance from a given solution you're overdesigning it, which means you may be spending too much deploying too much equipment.

Notice I said MAY, because performance isnt the only factor in a storage solution; hell it may not even be the most important. In the case of the storage you're describing (4/5/6 OSDs, three nodes) consider that the TOTAL possible usage for this pool is 12 drives worth, since there are insufficient PGs to fit within your pool definition (3/2.) In case it isnt clear, 3/2 means every write needs 3pg (placement group), ONE PER NODE. since its not possible to guarantee 2 available pgs for more the 4 drives, you woulnt be able to use the excess capacity that way.

I truly appreciate all your help with this! May I ask you a slightly off topic question? I'm thinking for a separate project, how silly of an idea is a 3 node 3 disk cluster? Say 3 nodes with a 2TB NVME in each?

Its not silly, but as mentioned before its all dependent on your use case. I would feel better with more OSDs but thats up to you. In any case, since NVMEs have performance headroom greater then a single queue would allow, if you DO run with this config, consider splitting your NVMEs to multiple OSDs (eg, partition them and then create multiple OSDs on the partitions)
 
This is great information; Thank you!!! I get your point about performance, and I agree, but there's a reason.

It's just not a perfect world, unfortunately. We're coming from a raid solution; where performance, safety, and capacity are a little more interconnected. As I know it, if you need more capacity, you'll add more drives for data to be striped. By adding more stripped disks you also increase performance (to a fault). This is where my org lives. We needed the capacity and safety so we mostly ran raid10. We then needed more capacity so we added more disks in pairs. This naturally increased performance along with capacity for us and it was measurable by the network monitor. Maybe there's a way to properly increase capacity and not performance, but I don't know it :-).

What's driving the ceph cluster is the desire to maintain 99.99% uptime and data safety. We'd like to keep all the VMs running if hardware or HV maintenance is required. A serious pain point we're trying to fix. Unfortunately, we need 3 nodes to do that safely and there isn't budget to buy new servers all at once. So, we have to repurpose servers and this is the hardware we're left with. We're left with unusable capacity and performance higher than we actually need. We'll solve the unused capacity over time.

I can tell you there's a drive for more safety. As a result, we'll be adding a 4th node by December. My understanding is with ceph this will result in more safety but not more capacity or performance. I think distributed storage systems is a different world to RAID. I'm see this now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!