PVE Ceph Rules for HDD Pools of Different Sizes

Illydth

New Member
Aug 23, 2022
8
2
1
I apologize as I am sure this question has probably been answered a thousand times but I cannot find appropriate documentation and I'm still too new on my Proxmox/Ceph journey to apparently get the right google terms.

I have a 3 node Proxmox Cluster with all HDD Disks and an already setup OSD/Pool: The Ceph cluster has a default replicated rule setup: 9x1T HDD Disks in a single cluster using 3 replicas...so ~3T of storage for the first pool.

I am now adding 3 additional 360G disks (again HDD). I would like these 3 HDD disks to be in their own Ceph Pool. What I'd ACTUALLY like is to remove CephFS from the first OSD pool (which is for VM images only) and set it up on the second OSD pool (the smaller one, intended for small docker/K8s volumes using CephFS).

I can find a TON of information and posts on how to create crush rules for differing disk types (SSD vs. HDD) but all of that information relies on using the device class to automatically separate the OSDs into the separated rules. Even the one post I found where the person had all devices listed as SDD eventually just set some of the devices to NVME so that he could use the standard methods of segregating his drive sets into two different rules.

I've found no information on how to create a crush rule and assign specific OSDs to that rule for disks of the same type without using different disk classes.

HELP? I'm happy to read documentation. I've looked at https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes which keeps getting linked, but everything there talks about setting crush rules based upon DIFFERENT device classes. In fact, I just want to separate same-classed OSDs into distinct pools for distinct use cases.

I'm not sure where to start creating a second crushmap rule and then telling Ceph to assign OSD 9, 10, and 11 to it.

Again, I apologize my Google Fu is obviously failing me.
 
6 hours into my search and there are no answers that don't include: "I have HDDs and SDDs, how do I separate them" with the answer:

ceph osd crush rule create-replicated hddpool <root> <failure-domain> hdd
ceph osd crush rule create-replicated sddpool <root> <failure-domain> sdd

And magic ceph knows all about the disk types and magically moves everything around for me...except I don't have different disk types.

Excuse the frustration but this feels like somewhere in the neighborhood of 20 seconds of work if I only had some post of someone who's done something even similar.
 
The answer is correct, but ssd and hdd are not the only possible names for disktypes!

I tested this a while ago but never used in production. Take a look at the raw write down. Its not complete or 100% correct and contains other things too. But it gives an idea how it works.
Looks like creating own device classes is the way to get this working.
Set the device class of the disks to HDDTYPE1 and HDDTYPE2, attach the osd to this new device classes and create crush rules for these two types.

CLASS, RULENAME, OWNCLASS,YOUR_POOL, ... are names given by the user.

#use more Pools, beta ceph osd crush rule create-replicated replicated_rule_hdd default host hdd ceph osd crush rule create-replicated replicated_rule_ssd default host ssd ceph osd pool rename ssd ssd500MB ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class> ceph osd tree ceph osd crush rm # here starts the interesting part #ceph osd crush rm-device-class osd.$OSD #ceph osd crush set-device-class CLASS osd.$OSD # CLASS can be a own defined type ceph osd crush rm-device-class OSD-ID ceph osd crush set-device-class OWNCLASS OSD-ID ceph osd crush rule create-replicated RULENAME default host OWNCLASS ceph osd pool set YOUR_POOL crush_rule RULENAME ceph osd pool set device_health_metrics crush_rule replicated_rule # correct? ceph osd crush rule rm RULENAME # to get rid of it?
 
Last edited:
So it sounds like I could do something like the following (If someone could please confirm I would be most appreciative)

Code:
# Create a new crush rule for a new pool.
ceph osd crush rule create-replicated replicated_rule_hddpool2 default host hdd2
# Remove the current device class on the OSDs I want to move to the new pool.
ceph osd crush rm-device-class osd.[9|10|11]
# Add new device classes to the OSDs to move.
ceph osd crush set-device-class hdd2 osd.[9|10|11]
# Create a new CEPH Pool associated with the new CRUSH Rule.
ceph osd pool set hddpool2 crush_rule replicated_rule_hddpool2

Is that the gist of the command line?

* I'm creating a new rule called "replicated_rule_hddpool2" and a new device class called "hdd2".
* This rule says "any OSDs using device class "hdd2" should be using the new "replicated_rule_hddpool2" CRUSH rule.
* I'm Removing the device class (hdd) from the OSDs I want to move.
* I'm adding the new device class (hdd2) to the OSDs I want to move, thus setting OSDs to point at hdd2/replicated_rule_hddpool2
* I finally create a pool named "hddpool2" which uses the new replicated_rule_hddpool2.

By creating the pool hddpool2 the OSDs I set "hdd2" type on move to the hddpool2 pool.

Confirmation? It makes sense, obviously, but is "device class" really just an identifier? All the documentation makes it sound like "hdd" "sdd" and "nvme" are special names that mean something to Ceph and that those are the only 3 possible values...and further that these values set some kind of internal "performance" tuning metrics that allow SDDs or HDDs to act properly given their differences in speeds.
 
Last edited:
  • Like
Reactions: proxale
Apologies for the length of this, but if I had come across this post when I was looking for answers to this problem, I would have FULLY understood the solution to my problem instead of the almost 10 hours of useless documentation reading and internet searching I did to still NOT find the answer.

Huge thanks to Toranaga above for leading me down the right path. And whomever is maintaining the docs on the Proxmox wiki might consider reading the semi-rant at the bottom of the page and determining if maybe it might merit a blurb or two in the way-over-linked https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes article.

------------

So following what I wrote above gives the following error:

# ceph osd crush rule create-replicated replicated_rule_hddpool2 default host hdd2
Error EINVAL: device class hdd2 does not exist

HOWEVER that's because apparently you need a device on the device class BEFORE you can create a rule for the device class...i.e. the "rm-device-class/set-device-class" and "crush rule create" lines above are reversed.

Code:
# Remove the current device class on the OSDs I want to move to the new pool.
$> ceph osd crush rm-device-class osd.$OSDNUM
# Add new device classes to the OSDs to move.
$> ceph osd crush set-device-class hdd2 osd.$OSDNUM
# Create a new crush rule for a new pool.
$> ceph osd crush rule create-replicated replicated_rule_hdd2 default host hdd2
# Create a new CEPH Pool associated with the new CRUSH Rule.
$> ceph osd pool set hddpool2 crush_rule replicated_rule_hdd2

So because this was impossible for me to find in information "out there" for anyone who runs across this trying to find the same thing I have:

In the Code Above:
  • $OSDNUM is the OSD Identifier. When you do "ceph osd tree" it will show the OSDs on your hosts, each OSD will be named "osd.#" where # is a consecutive identifier for the OSD. Probably didn't need to mention that, but lets call this "comprehensive" documentation.
  • hdd2 Is a user defined label for a new device class. As noted below, this can be ANYTHING you'd like it to be. This value is arbitrary and carries NO significance within Ceph at all. (See Below)
  • There must be AT LEAST one OSD known by Ceph on the new device class before running the "ceph osd crush rule" command. Otherwise you will get "Error EINVAL: device class <CLASSNAME> does not exist". This error DOES NOT mean that the device class names are a list of known values, it means that Ceph couldn't find an OSD with that device class on it in the cluster already. Run "rm-device-class" and "set-device-class" first.
  • replicated_rule_hdd2 is a user defined name for a new CRUSH Ruleset. Without modification, you will likely have the rule "replicated_rule" already defined in your Crushmap...you can use anything you want in place of this text EXCEPT the name of any existing rule you have in your crushmap.
  • hddpool2 is another arbitrarily defined name, this time it's the name of a new pool in Ceph which will get set to use the new crush rule.
The first two commands are simply removing and adding a distinct label to each OSD you want to create a new pool for.
The third command is creating a Ceph "Crushmap" rule associating the above "distinct label" to a unique crushmap rule.
The fourth command creates a new pool and tells that pool to use the new crushmap rule created by the third command above.

Thus this boils down to:
  • Create a Label
  • Assign the Label to a new Rule
  • Assign the Rule to a new Pool
Note that the 4th command can be replaced by using the Proxmox GUI for Ceph to create a new Pool. After running the "ceph osd crush rule" command the new rule will immediately show up in the Pool GUI's dropdown for selection when clicking the "create" button in the Ceph Pool interface.

And that's it.

The most important lesson I learned from this exercise:

The Device Class is NOT a sacrosanct value. It is nothing more than a text "tag" you can apply and remove from OSD Devices. I could have called my new device class "fred" or "new-device-class" or "I_hate_world_of_warcraft", it has no meaning to Ceph what-so-ever. Just because the terms HDD, SDD and NVME DO have meaning in the technical world and SOUND like they are important to "get right" , this is simply not the case. These tags DO NOT set some arbitrary tuning information within Ceph or cause Ceph to deal with the OSDs any differently.

The problem with ALL of the documentation on the net with regards to "device classes" is that it all talks about separating OSDs by speed and thus makes the tags "HDD" and "SDD" sound like they have some importance or meaning...after all "HDD" and "SDD" are two VERY different devices with VERY different performance profiles, so it must be important that one device's class is set to "hdd" and another devices class is set to "sdd" right? In fact, no, there is no importance to the device class AT ALL other than to group OSDs with class "$A" separately from another group of OSDs with class "$B"...the fact that "$A" in this case is named "HDD" is utterly irrelevant.

Even the term "device class" is a misnomer that creates confusion by assigning what actually amounts to a text tag a misleading importance by calling it a "device class" and then using names for those tags that have REAL technical implications with very district and different profiles. Because the term "device class" and the tag text "hdd", "sdd" and "nvme" DO match industry definitions of distinct devices AND we refer to these categories as "classes" of storage devices, one is incorrectly lead to assume an importance of this text tag that simply does not exist.

The only reason to tag an HDD as the "hdd" device class tag or an SDD as the "sdd" device class tag is because Ceph does some automatic magic for you by detecting the ACTUAL device class of your devices and assigning a matching ceph device class tag to the text of the ACTUAL device class of the OSD device when you bring the OSD online. So by default your HDDs will get the tag "hdd" assigned to them by Ceph...however it is still important to understand that just because Ceph assigned the text label "hdd" to a disk that reports itself as being a spinning disk with platters using magnetism to store data, this is simply correlation...in the same way that "All Criminals drink water, so drinking water causes criminality" is a correlation. Ceph could have assigned the text "NVME" to this spinning disk with platters that uses magnetism to store data and it wouldn't matter in the slightest to how Ceph handles that OSD.

If you decide to assign "john" to your HDD Devices and "sara" to your SDD Devices there's nothing stopping you from doing so, however it's possible that upon reboot/restart of Ceph it will re-tag your devices back to "hdd" and "sdd"...NOT because it's important to Ceph that a spinning disk is assigned the label "hdd" but because it MIGHT be important and convenient to you the Ceph user that your spinning rust and your "storage on chip" disks be in different pools...the fact that the labels used to separate your "fast" disks from your "slow" disks matches some industry standard definition is unfortunate and confusing.

To stop that, the following appears to be the defined method:
Code:
Modify your local /etc/ceph/ceph.conf to include an osd entry:

[osd]
osd_class_update_on_start = false
 
  • Like
Reactions: proxale
Thanks for the info about reboot.
I never tried a reboot. I added it to my raw documentation.

By the way: maybe it is better to name it "hdd2pool" instead of "hddpool2". You have a more consistent naming convention.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!