Proxmox CEPH multiple storage classes questions

dmulk · Jul 3, 2019

Hey there,

I currently have an existing 10 node, Proxmox 5.4-5 and CEPH luminous configuration. Each of the 10 nodes are running Proxmox and CEPH side by side (meaning, I have VM's running on the same nodes as are serving the RBD pool that they are running from. Each node has a number of SSD's that server as OSDs.

I'm looking to add an additional 5-7 nodes to expand both my compute (proxmox) and CEPH storage.

In order to add some storage depth to my existing CEPH cluster, I'm considering using HDD's backed by either NVME or SSDs.

I'd like to leave the existing pool that i have comprised of all SSD's alone and add these additional nodes into the cluster.

My initial question is: What's the best way to do this using Proxmox's implementation of CEPH?

Currently I *think* I need to add these additional nodes to the existing Proxmox cluster, install CEPH and then create OSD's with the data on the HDD's and the Rocks.db/.wal on the SSD/NVMEs and segment them into a separate pool...

Is this the correct way to go about this? If so....can this all be accomplished using the GUI or do I need to do this from the CLI?

Thanks,
<D>

Alwin · Jul 3, 2019

dmulk said:
My initial question is: What's the best way to do this using Proxmox's implementation of CEPH?

I want to clarify this a little bit. We use the upstream Ceph with a couple of cherry-picked patches on top and Proxmox VE comes with management tools. But it is possible to use any aspect of the upstream Ceph.

dmulk said:
Currently I *think* I need to add these additional nodes to the existing Proxmox cluster, install CEPH and then create OSD's with the data on the HDD's and the Rocks.db/.wal on the SSD/NVMEs and segment them into a separate pool...

Yes, to get better OSD performance with spinners, (NVMe) SSDs serving as DB/WAL is a good option. To easily create a pool with only these type OSDs you can use the device class feature of Ceph. This needs some CLI usage.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes

dmulk · Jul 3, 2019

Alwin,
Thank you for your quick response and clarification. Much appreciated. I think what I was trying to ask with the "proxmox implementation" (although poorly worded on my part) was if there were specific ways I needed to do things mainly related to the gui (as I know not all things can be accomplished with the gui). You've confirmed that I need to use the CLI to make this happen.

Related to the device class feature in CEPH: I had my workmate (who runs spinners backed by Intel Optane cards) in a homogenous config run:

"ceph osd crush tree --show-shadow" and his OSD's showed up as the "HDD" device class.

Is this expected? (I mean, I know I can edit the device class and change it, but should I do that?). Since I intend to only have 2 classes of storage in my cluster: Pure SSD and SSD or NVMe backed HDD's it seems it would be ok to leave the class HDD.

Also, since I currently have all SSD's in my cluster and I want to preserve the (RBD) data in that pool and add a second pool to mount on the nodes with SSD's serving DB/WAL for the HDD's that I'm adding, what are the high level steps for the recipe to accomplish this safely. (This is a production environment and the first time I'm tackling this, so really need the help to avoid making a CLM...

). I'm struggling with understanding the commands and steps to make this happen.

It seems like I need to:

1) Add the new SSD/HDD nodes to the Proxmox cluster
2) Install CEPH on the new nodes
3) Create the second pool with the SSD/HDD's

Thanks again for the guidance.

dmulk · Jul 3, 2019

Questions related to leveraging an SSD to back the HDD's:

If I'm going to leverage a single SSD to back multiple HDD's, what SSD to HDD ratio would you recommend for HDD's likely being 8TB (SATA connected) and SSD's likely being ~3.5TB (SATA connected).

Am I correct to guess that if I plan to use a single SSD to back multiple SSD's I'll want to manually configure/size multiple partitions on the SSD first and then point each OSD at a dedicated partition on the SSD or can I simply point say 8 HDD's at a single SSD device? ex: /dev/sdb1,2,3 vs all at /dev/sdb).

Thank you.

Kmgish · Jul 4, 2019

I have a similar configuration to yours however I began with hdd's with their journals on a small (200g) ssd. When I added some ssd storage I created the following rules and applied them to the appropriate pools.

ceph osd crush rule create-replicated ssdrule default host ssd
(applied to pool ssdstor)

ceph osd crush rule create-replicated hddrule default host hdd
(applied to pool hddstor)

Now data headed for the pool ssdstor is stored on device class ssd disks. Data headed for pool hddstor is stored on device class hdd disks.

This has worked well for us. So my current cluster looks like this:

8 Proxmox 5.4 nodes, 5 with ceph installed and 3 compute nodes.

Ceph nodes have 3 2t ssd's with colocated wall/db, and 5 4t hdd's with wall/db on 200g ssd.

Total capacity is 117t.

Hope this helps.

Bonus question for Alwin:

I would like to increase my hdd storage capacity. Can I take my time and just drain my 4t disks and replace them with 8t or 12t disks? My intention would be to end up with all 8's or 12's.

Thanks!

dmulk · Jul 4, 2019

Thank you Kmgish! Much appreciated.

So here's another couple questions related to your response:

When you first implemented CEPH with the ssd backend hdds did you only have that one pool and no special ruleset for it because it was, at the time, homogenous? (I originally implemented my cluster with Jewel and then upgraded....al though I haven't upgraded from filestore to bluestore yet...at the time I was reading things about SSD performance potentially being better with filestore...)

Did you then have to do BOTH of those rules because you added the second pool?

(If so..with my one pool, could I create the first rule now and then the second rule at the time that I create the second pool with the different node configurations?)

Thanks again for the response and input.

Alwin · Jul 4, 2019

dmulk said:
"ceph osd crush tree --show-shadow" and his OSD's showed up as the "HDD" device class.

Is this expected? (I mean, I know I can edit the device class and change it, but should I do that?). Since I intend to only have 2 classes of storage in my cluster: Pure SSD and SSD or NVMe backed HDD's it seems it would be ok to leave the class HDD.

Yes, this is expected. The OSD has the data portion still on spinners, while only the DB/WAL is located on another device.

dmulk said:
Also, since I currently have all SSD's in my cluster and I want to preserve the (RBD) data in that pool and add a second pool to mount on the nodes with SSD's serving DB/WAL for the HDD's that I'm adding, what are the high level steps for the recipe to accomplish this safely.

As @Kmgish points out, first add the crush rule for the SSDs and set the pool to use it. Some data migration might be happening at this point. This step runs normally fine and does not really interfere with client IO. Afterwards you add the new nodes with its OSDs and continue with the second ruleset and pool.

dmulk said:
Am I correct to guess that if I plan to use a single SSD to back multiple SSD's I'll want to manually configure/size multiple partitions on the SSD first and then point each OSD at a dedicated partition on the SSD or can I simply point say 8 HDD's at a single SSD device? ex: /dev/sdb1,2,3 vs all at /dev/sdb).

Judging from the quote below, I need to clarify that I talk about Bluestore, as Filestore is the legacy backend for Ceph. In my opinion, Bluestore gives you better control over its resources and removes the constraints of a filesystem for data storage. Bluestore also CRCs the data and corrupt objects are recognized and repaired if good copies are available.

Yes, size matters.

The Bluestore DB will spill over to the data disk if it doesn't fit into the partition anymore. Ceph states it should be at least 4% (1 TB disk, 40 GB DB size).

Another point is to not overtax the SSD, as it will become the bottleneck, if it can not serve the DB fast enough. As a rule of thumb, 4:1 HDD to SATA SSD, depending on the protocol, more are possible.

dmulk said:
When you first implemented CEPH with the ssd backend hdds did you only have that one pool and no special ruleset for it because it was, at the time, homogenous? (I originally implemented my cluster with Jewel and then upgraded....al though I haven't upgraded from filestore to bluestore yet...at the time I was reading things about SSD performance potentially being better with filestore...)

If you don't configure any crush rules on the beginnen, than every pool created will just use the default rule and that includes all OSDs available.

dmulk said:
Did you then have to do BOTH of those rules because you added the second pool?

Yes this is needed, see my comment above.

Alwin · Jul 4, 2019

Kmgish said:
I would like to increase my hdd storage capacity. Can I take my time and just drain my 4t disks and replace them with 8t or 12t disks? My intention would be to end up with all 8's or 12's.

If you distribute them evenly, then yes. If you have a free slot, you can do the draining and addition at the same time and spare some rebalance time.

Kmgish · Jul 5, 2019

Thanks Alwin, that's what I thought. My DR cluster is now replaying journals of 30 images, just over 4t across a 150m internet circuit via sdwan. Hence the need for a little capacity upgrade. Thanks again for the rbd-mirror instructions!

dmulk:
We never had/allowed a "homogeneous" environment. We created the rules ahead of adding the ssd's for storage. I would create and apply a rule for your current class of disks and apply to the appropriate pools. Let things settle and then create the rule/pools for the new class of disks and then add your new hardware.

dmulk · Jul 5, 2019

@Alwin Thank you for your detailed response and time. Much appreciated.

One final (probably obvious) question related to storing the meta data (DB/Wal) on SSD:

If I am going to do a 4 to 1 ratio (assuming the disk's bandwidth can handle the throughput for this) is it required to manually create 4 individual partitions and point each OSD to the respective partitions or can I point the 4 OSD's to the total disk and CEPH will handle how it's used (up to it's maximum capacity)?

The reason I ask is that there were some posts on another thread on here that I didn't fully understand that I think was related to this topic....it seemed there were multiple ways to "skin a cat" if you will.

Thanks again for your time!

<D>

dmulk · Jul 5, 2019

Kmgish said:
Thanks Alwin, that's what I thought. My DR cluster is now replaying journals of 30 images, just over 4t across a 150m internet circuit via sdwan. Hence the need for a little capacity upgrade. Thanks again for the rbd-mirror instructions!

dmulk:
We never had/allowed a "homogeneous" environment. We created the rules ahead of adding the ssd's for storage. I would create and apply a rule for your current class of disks and apply to the appropriate pools. Let things settle and then create the rule/pools for the new class of disks and then add your new hardware.

Makes total sense. Thanks again for the time and input on this!

Alwin · Jul 8, 2019

dmulk said:
If I am going to do a 4 to 1 ratio (assuming the disk's bandwidth can handle the throughput for this) is it required to manually create 4 individual partitions and point each OSD to the respective partitions or can I point the 4 OSD's to the total disk and CEPH will handle how it's used (up to it's maximum capacity)?

Ceph will use its rather small default partition size when you just specify a disk. You can either change the size to a more sensible or create the partitions beforehand. You just need to make sure the partition have a proper UUID, otherwise the ceph-disk might link the device name and not the UUID.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-210638

dmulk · Jul 8, 2019

Alwin said:
Ceph will use its rather small default partition size when you just specify a disk. You can either change the size to a more sensible or create the partitions beforehand. You just need to make sure the partition have a proper UUID, otherwise the ceph-disk might link the device name and not the UUID.
https://forum.proxmox.com/threads/where-can-i-tune-journal-size-of-ceph-bluestore.44000/#post-210638

Hmmm...I guess where I'm confused is if I point, say, each OSD I'm creating at the same drive, will the creation process be smart enough to create partitions automatically without overwriting other partitions? (I've never been through the process before...).

Apologies for the elementary questions. Thanks for the patience.

Alwin · Jul 9, 2019

dmulk said:
Hmmm...I guess where I'm confused is if I point, say, each OSD I'm creating at the same drive, will the creation process be smart enough to create partitions automatically without overwriting other partitions?

It will do that.

dmulk said:
(I've never been through the process before...).

You can create a virtual Proxmox VE + Ceph cluster and play around with it before you do this on the production system.

Search

Search

Proxmox CEPH multiple storage classes questions

dmulk

Member

Alwin

Proxmox Retired Staff

dmulk

Member

dmulk

Member

Kmgish

Active Member

dmulk

Member

Alwin

Proxmox Retired Staff

Alwin

Proxmox Retired Staff

Kmgish

Active Member

dmulk

Member

dmulk

Member

Alwin

Proxmox Retired Staff

dmulk

Member

Alwin

Proxmox Retired Staff

We value your privacy