Proxmox CEPH multiple storage classes questions

dmulk

Member
Jan 24, 2017
74
5
13
50
Hey there,

I currently have an existing 10 node, Proxmox 5.4-5 and CEPH luminous configuration. Each of the 10 nodes are running Proxmox and CEPH side by side (meaning, I have VM's running on the same nodes as are serving the RBD pool that they are running from. Each node has a number of SSD's that server as OSDs.

I'm looking to add an additional 5-7 nodes to expand both my compute (proxmox) and CEPH storage.

In order to add some storage depth to my existing CEPH cluster, I'm considering using HDD's backed by either NVME or SSDs.

I'd like to leave the existing pool that i have comprised of all SSD's alone and add these additional nodes into the cluster.

My initial question is: What's the best way to do this using Proxmox's implementation of CEPH?

Currently I *think* I need to add these additional nodes to the existing Proxmox cluster, install CEPH and then create OSD's with the data on the HDD's and the Rocks.db/.wal on the SSD/NVMEs and segment them into a separate pool...

Is this the correct way to go about this? If so....can this all be accomplished using the GUI or do I need to do this from the CLI?

Thanks,
<D>
 
My initial question is: What's the best way to do this using Proxmox's implementation of CEPH?
I want to clarify this a little bit. We use the upstream Ceph with a couple of cherry-picked patches on top and Proxmox VE comes with management tools. But it is possible to use any aspect of the upstream Ceph.

Currently I *think* I need to add these additional nodes to the existing Proxmox cluster, install CEPH and then create OSD's with the data on the HDD's and the Rocks.db/.wal on the SSD/NVMEs and segment them into a separate pool...
Yes, to get better OSD performance with spinners, (NVMe) SSDs serving as DB/WAL is a good option. To easily create a pool with only these type OSDs you can use the device class feature of Ceph. This needs some CLI usage.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes
 
  • Like
Reactions: dmulk
Alwin,
Thank you for your quick response and clarification. Much appreciated. I think what I was trying to ask with the "proxmox implementation" (although poorly worded on my part) was if there were specific ways I needed to do things mainly related to the gui (as I know not all things can be accomplished with the gui). You've confirmed that I need to use the CLI to make this happen.

Related to the device class feature in CEPH: I had my workmate (who runs spinners backed by Intel Optane cards) in a homogenous config run:

"ceph osd crush tree --show-shadow" and his OSD's showed up as the "HDD" device class.

Is this expected? (I mean, I know I can edit the device class and change it, but should I do that?). Since I intend to only have 2 classes of storage in my cluster: Pure SSD and SSD or NVMe backed HDD's it seems it would be ok to leave the class HDD.

Also, since I currently have all SSD's in my cluster and I want to preserve the (RBD) data in that pool and add a second pool to mount on the nodes with SSD's serving DB/WAL for the HDD's that I'm adding, what are the high level steps for the recipe to accomplish this safely. (This is a production environment and the first time I'm tackling this, so really need the help to avoid making a CLM... :) ). I'm struggling with understanding the commands and steps to make this happen.

It seems like I need to:

1) Add the new SSD/HDD nodes to the Proxmox cluster
2) Install CEPH on the new nodes
3) Create the second pool with the SSD/HDD's

Thanks again for the guidance.
 
Questions related to leveraging an SSD to back the HDD's:

If I'm going to leverage a single SSD to back multiple HDD's, what SSD to HDD ratio would you recommend for HDD's likely being 8TB (SATA connected) and SSD's likely being ~3.5TB (SATA connected).

Am I correct to guess that if I plan to use a single SSD to back multiple SSD's I'll want to manually configure/size multiple partitions on the SSD first and then point each OSD at a dedicated partition on the SSD or can I simply point say 8 HDD's at a single SSD device? ex: /dev/sdb1,2,3 vs all at /dev/sdb).

Thank you.
 
I have a similar configuration to yours however I began with hdd's with their journals on a small (200g) ssd. When I added some ssd storage I created the following rules and applied them to the appropriate pools.

ceph osd crush rule create-replicated ssdrule default host ssd
(applied to pool ssdstor)

ceph osd crush rule create-replicated hddrule default host hdd
(applied to pool hddstor)

Now data headed for the pool ssdstor is stored on device class ssd disks. Data headed for pool hddstor is stored on device class hdd disks.

This has worked well for us. So my current cluster looks like this:

8 Proxmox 5.4 nodes, 5 with ceph installed and 3 compute nodes.

Ceph nodes have 3 2t ssd's with colocated wall/db, and 5 4t hdd's with wall/db on 200g ssd.

Total capacity is 117t.

Hope this helps.


Bonus question for Alwin:

I would like to increase my hdd storage capacity. Can I take my time and just drain my 4t disks and replace them with 8t or 12t disks? My intention would be to end up with all 8's or 12's.

Thanks!
 
  • Like
Reactions: dmulk
Thank you Kmgish! Much appreciated.

So here's another couple questions related to your response:

When you first implemented CEPH with the ssd backend hdds did you only have that one pool and no special ruleset for it because it was, at the time, homogenous? (I originally implemented my cluster with Jewel and then upgraded....al though I haven't upgraded from filestore to bluestore yet...at the time I was reading things about SSD performance potentially being better with filestore...)

Did you then have to do BOTH of those rules because you added the second pool?

(If so..with my one pool, could I create the first rule now and then the second rule at the time that I create the second pool with the different node configurations?)

Thanks again for the response and input.
 
"ceph osd crush tree --show-shadow" and his OSD's showed up as the "HDD" device class.

Is this expected? (I mean, I know I can edit the device class and change it, but should I do that?). Since I intend to only have 2 classes of storage in my cluster: Pure SSD and SSD or NVMe backed HDD's it seems it would be ok to leave the class HDD.
Yes, this is expected. The OSD has the data portion still on spinners, while only the DB/WAL is located on another device.

Also, since I currently have all SSD's in my cluster and I want to preserve the (RBD) data in that pool and add a second pool to mount on the nodes with SSD's serving DB/WAL for the HDD's that I'm adding, what are the high level steps for the recipe to accomplish this safely.
As @Kmgish points out, first add the crush rule for the SSDs and set the pool to use it. Some data migration might be happening at this point. This step runs normally fine and does not really interfere with client IO. Afterwards you add the new nodes with its OSDs and continue with the second ruleset and pool.

Am I correct to guess that if I plan to use a single SSD to back multiple SSD's I'll want to manually configure/size multiple partitions on the SSD first and then point each OSD at a dedicated partition on the SSD or can I simply point say 8 HDD's at a single SSD device? ex: /dev/sdb1,2,3 vs all at /dev/sdb).
Judging from the quote below, I need to clarify that I talk about Bluestore, as Filestore is the legacy backend for Ceph. In my opinion, Bluestore gives you better control over its resources and removes the constraints of a filesystem for data storage. Bluestore also CRCs the data and corrupt objects are recognized and repaired if good copies are available.

Yes, size matters. :rolleyes: The Bluestore DB will spill over to the data disk if it doesn't fit into the partition anymore. Ceph states it should be at least 4% (1 TB disk, 40 GB DB size).

Another point is to not overtax the SSD, as it will become the bottleneck, if it can not serve the DB fast enough. As a rule of thumb, 4:1 HDD to SATA SSD, depending on the protocol, more are possible.

When you first implemented CEPH with the ssd backend hdds did you only have that one pool and no special ruleset for it because it was, at the time, homogenous? (I originally implemented my cluster with Jewel and then upgraded....al though I haven't upgraded from filestore to bluestore yet...at the time I was reading things about SSD performance potentially being better with filestore...)
If you don't configure any crush rules on the beginnen, than every pool created will just use the default rule and that includes all OSDs available.

Did you then have to do BOTH of those rules because you added the second pool?
Yes this is needed, see my comment above.
 
  • Like
Reactions: dmulk
I would like to increase my hdd storage capacity. Can I take my time and just drain my 4t disks and replace them with 8t or 12t disks? My intention would be to end up with all 8's or 12's.
If you distribute them evenly, then yes. If you have a free slot, you can do the draining and addition at the same time and spare some rebalance time.
 
  • Like
Reactions: dmulk
Thanks Alwin, that's what I thought. My DR cluster is now replaying journals of 30 images, just over 4t across a 150m internet circuit via sdwan. Hence the need for a little capacity upgrade. Thanks again for the rbd-mirror instructions!

dmulk:
We never had/allowed a "homogeneous" environment. We created the rules ahead of adding the ssd's for storage. I would create and apply a rule for your current class of disks and apply to the appropriate pools. Let things settle and then create the rule/pools for the new class of disks and then add your new hardware.
 
  • Like
Reactions: dmulk
@Alwin Thank you for your detailed response and time. Much appreciated.

One final (probably obvious) question related to storing the meta data (DB/Wal) on SSD:

If I am going to do a 4 to 1 ratio (assuming the disk's bandwidth can handle the throughput for this) is it required to manually create 4 individual partitions and point each OSD to the respective partitions or can I point the 4 OSD's to the total disk and CEPH will handle how it's used (up to it's maximum capacity)?

The reason I ask is that there were some posts on another thread on here that I didn't fully understand that I think was related to this topic....it seemed there were multiple ways to "skin a cat" if you will.

Thanks again for your time!

<D>
 
  • Like
Reactions: Alwin
Thanks Alwin, that's what I thought. My DR cluster is now replaying journals of 30 images, just over 4t across a 150m internet circuit via sdwan. Hence the need for a little capacity upgrade. Thanks again for the rbd-mirror instructions!

dmulk:
We never had/allowed a "homogeneous" environment. We created the rules ahead of adding the ssd's for storage. I would create and apply a rule for your current class of disks and apply to the appropriate pools. Let things settle and then create the rule/pools for the new class of disks and then add your new hardware.

Makes total sense. Thanks again for the time and input on this! :)
 
If I am going to do a 4 to 1 ratio (assuming the disk's bandwidth can handle the throughput for this) is it required to manually create 4 individual partitions and point each OSD to the respective partitions or can I point the 4 OSD's to the total disk and CEPH will handle how it's used (up to it's maximum capacity)?
Ceph will use its rather small default partition size when you just specify a disk. You can either change the size to a more sensible or create the partitions beforehand. You just need to make sure the partition have a proper UUID, otherwise the ceph-disk might link the device name and not the UUID.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-210638
 
  • Like
Reactions: dmulk
Ceph will use its rather small default partition size when you just specify a disk. You can either change the size to a more sensible or create the partitions beforehand. You just need to make sure the partition have a proper UUID, otherwise the ceph-disk might link the device name and not the UUID.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-210638

Hmmm...I guess where I'm confused is if I point, say, each OSD I'm creating at the same drive, will the creation process be smart enough to create partitions automatically without overwriting other partitions? (I've never been through the process before...).

Apologies for the elementary questions. Thanks for the patience. :)
 
Hmmm...I guess where I'm confused is if I point, say, each OSD I'm creating at the same drive, will the creation process be smart enough to create partitions automatically without overwriting other partitions?
It will do that.

(I've never been through the process before...).
You can create a virtual Proxmox VE + Ceph cluster and play around with it before you do this on the production system. ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!