[TUTORIAL] Ceph - Classes, Rules and Pools

Seed

Renowned Member
Oct 18, 2019
109
63
68
124
Index:
1. Create Classes
2. Crushmap
3. Creating Rules
4. Adding Pools

--------

Piecing together information i've obtained from various threads, thought I'd compile it here for hopefully, easier consumption and use.

If you're visiting this topic you probably already know what ceph is and what it can do for you. ceph so far for me, has been really cool and proxmox really overall is pretty cool. A couple things we'll address in this small tutorial:
  • OSD classes for ceph are by default HDD and SSD.
    • Even if you have NVME drives, they'll get associated with the SSD class. That's kind of crappy. That's like making SSDs defaulting to the HDD class. We really need 3 class types here.
    • Classes will help isolate a way to assign VMs or LXC to these classes, which means you can put some things on HDD, others on SSD and even others on super fast NVMe storage.
    • Me for example, I want to put large media storage on HDD, but I want containers that are processing the data on VMs on NVMe drives. SSD runs other things that aren't storage or speed intensive, like a distil instance checking on prices of things for me around the internet.
  • By default, ceph adds all OSDs to one pool that you create.
    • There's no way to isolate devices to a specific pool out of the box.
    • If you're like some of us, we have multiple nodes with multiple devices types. Let's take ssd, hdd and nvme to name a few storage devices types.
    • As mentioned above, we want to group these into pools.

Creating a New Class
Classes, if you don't know, are a way to categorize devices based really on their speed. You could also create classes for things like their size as well I suppose. There's a few different ways to leverage classes that are useful. For this tutorial, we're going to add an nvme class.

List OSDs and their classes:
ceph osd crush tree --show-shadow

Code:
ID  CLASS WEIGHT    TYPE NAME
-16  nvme   4.54498 root default~nvme
-10        37.75299     host host1
13   hdd   9.09499         osd.13
14   hdd   9.09499         osd.14
15   hdd   9.09499         osd.15
16   hdd   9.09499         osd.16
20   ssd   0.90900         osd.20     (These .90900 drives, 1TB, are the nvme disks but see how they're ssd? Note the osd number, osd.20 in this case, 20.
17   ssd   0.46500         osd.17
-7        37.74199     host host2
  8   hdd   9.09499         osd.8
  9   hdd   9.09499         osd.9
10   hdd   9.09499         osd.10
11   hdd   9.09499         osd.11
  7   ssd   0.90900         osd.7       ***
18   ssd   0.45399         osd.18
-3        39.56000     host host3
  0   hdd   9.09499         osd.0
  1   hdd   9.09499         osd.1
  2   hdd   9.09499         osd.2
  3   hdd   9.09499         osd.3
  4   ssd   0.90900         osd.4       ***
  5   ssd   0.90900         osd.5       ***
  6   ssd   0.90900         osd.6       ***
19   ssd   0.45399         osd.19

*** Note the IDs of the devices you want to change, and unset and set the new class to the device:

Unset the class ssd for device 20:
ceph osd crush rm-device-class osd.20

Set the class to nvme for device 20:
ceph osd crush set-device-class nvme osd.20

Do the above for all OSD IDs that you want to change to a particular class.

Now we need to be sure ceph doesn't default the devices, so we disable the update on start. This will allow us to modify classes as they won't be set when they come up.

Modify your local /etc/ceph/ceph.conf to include an osd entry:
Code:
[osd]
osd_class_update_on_start = false

I THINK you have to hup the osd service but I'm not really sure to be honest right now
systemctl restart ceph-osd.target


Crushmap

You can see the crushmap in the UI @ Host >> Ceph >> Configuration

To understand a crushmap we need to do the following.
  1. Export and decompile a crush map
  2. Read and edit a crush map
  3. Import a crush map
  4. Rules to Pools

Get crushmap to compiled file
ceph osd getcrushmap -o crushmap.compiled

Decompile the crushmap
crushtool -d crushmap.compiled -o crushmap.text

Compile the crushmap, maintaining original compiled crushmap file (you choose how you want to manage the files)
crushtool -c crushmap.text -o crushmap.compiled.new

Import the new crushmap
ceph osd setcrushmap -i crushmap.compiled.new

Now that we know how to edit a crushmap, let's edit a crushmap!


Creating New Rules

Alright, we have a new class created but not really doing anything. So now we need to associate a class to a rule.

Using the above crushmap instructions, decompile and edit your crush map rules. The default rule name with ID 0 is something like replicated.

I decided to delete the default rule and replace it with 3 rules. The rules are what link the classes to the pool, which we'll do in the last step. Delete the original rule id 0 and replace with these 3. This works for this example as we have 3 devices types but your discretion here what you do to your own environment. Most would create an SSD and HDD rule, for example.

Code:
# rules
rule replicated_hdd {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_ssd {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_nvme {
    id 2
    type replicated
    min_size 1
    max_size 10
    step take default class nvme
    step chooseleaf firstn 0 type host
    step emit
}

Don't forget to compile and import your new crush file!

You can test the crushfile like so, but --show-X fails for me for some reason as of this writing:

crushtool -i crushmap.compiled.new --test --show-X

Create Pool
Go into Proxmox UI Host >> Ceph >> Pools

Create a pool name, like pool-ssd, and bind the new rule to that pool name. This will result in any new device that is associated with a particular class, automagically get associated with a pool based on your need. I think this is pretty damn cool. Don't you?

Your new rules are available in the pool creation:
Screen Shot 2019-11-09 at 9.05.25 PM.png

Screen Shot 2019-11-09 at 9.14.15 PM.png

Screen Shot 2019-11-09 at 9.14.39 PM.png


Now you can do the following:
  • Upload ISOs to a pool resource and ALL nodes will be able to install from it. Woohoo!
    • Ceph is RBD storage only in this case and therefore doesn't support ISO Image content type, Only Container and Disk Image
  • I believe now HA is feasible as you can push VMs to these shared pools. Which is what I'm going to test out tomorrow.
  • Install VMs on media that works best for your need
Enjoy!
 
Last edited:
Index:
1. Create Classes
2. Crushmap
3. Creating Rules
4. Adding Pools

--------

Piecing together information i've obtained from various threads, thought I'd compile it here for hopefully, easier consumption and use.

If you're visiting this topic you probably already know what ceph is and what it can do for you. ceph so far for me, has been really cool and proxmox really overall is pretty cool. A couple things we'll address in this small tutorial:
  • OSD classes for ceph are by default HDD and SSD.
    • Even if you have NVME drives, they'll get associated with the SSD class. That's kind of crappy. That's like making SSDs defaulting to the HDD class. We really need 3 class types here.
    • Classes will help isolate a way to assign VMs or LXC to these classes, which means you can put some things on HDD, others on SSD and even others on super fast NVMe storage.
    • Me for example, I want to put large media storage on HDD, but I want containers that are processing the data on VMs on NVMe drives. SSD runs other things that aren't storage or speed intensive, like a distil instance checking on prices of things for me around the internet.
  • By default, ceph adds all OSDs to one pool that you create.
    • There's no way to isolate devices to a specific pool out of the box.
    • If you're like some of us, we have multiple nodes with multiple devices types. Let's take ssd, hdd and nvme to name a few storage devices types.
    • As mentioned above, we want to group these into pools.

Creating a New Class
Classes, if you don't know, are a way to categorize devices based really on their speed. You could also create classes for things like their size as well I suppose. There's a few different ways to leverage classes that are useful. For this tutorial, we're going to add an nvme class.

List OSDs and their classes:
ceph osd crush tree --show-shadow

Code:
ID  CLASS WEIGHT    TYPE NAME
-16  nvme   4.54498 root default~nvme
-10        37.75299     host host1
13   hdd   9.09499         osd.13
14   hdd   9.09499         osd.14
15   hdd   9.09499         osd.15
16   hdd   9.09499         osd.16
20   ssd   0.90900         osd.20     (These .90900 drives, 1TB, are the nvme disks but see how they're ssd? Note the osd number, osd.20 in this case, 20.
17   ssd   0.46500         osd.17
-7        37.74199     host host2
  8   hdd   9.09499         osd.8
  9   hdd   9.09499         osd.9
10   hdd   9.09499         osd.10
11   hdd   9.09499         osd.11
  7   ssd   0.90900         osd.7       ***
18   ssd   0.45399         osd.18
-3        39.56000     host host3
  0   hdd   9.09499         osd.0
  1   hdd   9.09499         osd.1
  2   hdd   9.09499         osd.2
  3   hdd   9.09499         osd.3
  4   ssd   0.90900         osd.4       ***
  5   ssd   0.90900         osd.5       ***
  6   ssd   0.90900         osd.6       ***
19   ssd   0.45399         osd.19

*** Note the IDs of the devices you want to change, and unset and set the new class to the device:

Unset the class ssd for device 20:
ceph osd crush rm-device-class osd.20

Set the class to nvme for device 20:
ceph osd crush set-device-class nvme osd.20

Do the above for all OSD IDs that you want to change to a particular class.

Now we need to be sure ceph doesn't default the devices, so we disable the update on start. This will allow us to modify classes as they won't be set when they come up.

Modify your local /etc/ceph/ceph.conf to include an osd entry:
Code:
[osd]
osd_class_update_on_start = false

I THINK you have to hup the osd service but I'm not really sure to be honest right now
systemctl restart ceph-osd.target


Crushmap

You can see the crushmap in the UI @ Host >> Ceph >> Configuration

To understand a crushmap we need to do the following.
  1. Export and decompile a crush map
  2. Read and edit a crush map
  3. Import a crush map
  4. Rules to Pools

Get crushmap to compiled file
ceph osd getcrushmap -o crushmap.compiled

Decompile the crushmap
crushtool -d crushmap.compiled -o crushmap.text

Compile the crushmap, maintaining original compiled crushmap file (you choose how you want to manage the files)
crushtool -c crushmap.text -o crushmap.compiled.new

Import the new crushmap
ceph osd setcrushmap -i crushmap.compiled.new

Now that we know how to edit a crushmap, let's edit a crushmap!


Creating New Rules

Alright, we have a new class created but not really doing anything. So now we need to associate a class to a rule.

Using the above crushmap instructions, decompile and edit your crush map rules. The default rule name with ID 0 is something like replicated.

I decided to delete the default rule and replace it with 3 rules. The rules are what link the classes to the pool, which we'll do in the last step. Delete the original rule id 0 and replace with these 3. This works for this example as we have 3 devices types but your discretion here what you do to your own environment. Most would create an SSD and HDD rule, for example.

Code:
# rules
rule replicated_hdd {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_ssd {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_nvme {
    id 2
    type replicated
    min_size 1
    max_size 10
    step take default class nvme
    step chooseleaf firstn 0 type host
    step emit
}

Don't forget to compile and import your new crush file!

You can test the crushfile like so, but --show-X fails for me for some reason as of this writing:

crushtool -i crushmap.compiled.new --test --show-X

Crear grupo
Vaya a la interfaz de usuario de Proxmox Host >> Ceph >> Pools

Cree un nombre de grupo, como pool-ssd, y vincule la nueva regla a ese nombre de grupo. Esto dará como resultado que cualquier dispositivo nuevo que esté asociado con una clase en particular, automáticamente se asocie con un grupo según su necesidad. Creo que esto es bastante genial. ¿no?

Sus nuevas reglas están disponibles en la creación del grupo:
View attachment 12733

View attachment 12735

View attachment 12734


Ahora puedes hacer lo siguiente:
  • Cargue ISO en un recurso de grupo y TODOS los nodos podrán instalarse desde él. ¡Guau!
    • Ceph es almacenamiento RBD solo en este caso y, por lo tanto, no es compatible con el tipo de contenido de imagen ISO, solo contenedor e imagen de disco
  • Creo que ahora HA es factible ya que puede enviar máquinas virtuales a estos grupos compartidos. Que es lo que voy a probar mañana.
  • Instale máquinas virtuales en los medios que mejor se adapten a sus necesidades
¡Disfrutar!
Excelente me quedo ok, te felicito un gran aporte. Sin embargo aqlgo mas para anadir estos valores ya no se pueden tomar en la compilacion.
1675787704675.png
 
i am not seeing here how to create class but only to define existing one?

i want to add new SSDs but i dont want them to share the replicated default rule , so i created a new replicated rule but how to create a new class let say : ssd_group2 ?

is it possible ?
 
i want to add new SSDs but i dont want them to share the replicated default rule , so i created a new replicated rule but how to create a new class let say : ssd_group2 ?
You can just specify it. If you create a new OSD on the GUI, make sure that the "Advanced" checkbox is enabled. Then either select an existing one, or just type in what you want to call it.
 
From a testing perspective, I also have nvme,ssd,and sas hhd so i want to set this up

Quick question though cause i like to tinker, instead of replacing the crush rules completely, couldn't i just add these three new selections as options?

ie, leave the large OSD pool that was created by default add these so that i can have 4 install options? I can then do some easy side by side testing?

EDIT:
Well, went ahead and tested this out with some interesting results:

NVME - Spread across the cluster
3x Samsung Evo 500gb
2x Hynix 250gb

Total time run: 20.1506
Total writes made: 2777
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 551.249
Stddev Bandwidth: 107.437
Max bandwidth (MB/sec): 692
Min bandwidth (MB/sec): 288
Average IOPS: 137
Stddev IOPS: 26.8592
Max IOPS: 173
Min IOPS: 72
Average Latency(s): 0.115974
Stddev Latency(s): 0.0759412
Max latency(s): 0.457073
Min latency(s): 0.0163699

SSD - Spread across the cluster - 2 drives per node

Total time run: 21.554
Total writes made: 810
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 150.32
Stddev Bandwidth: 45.7521
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 28
Average IOPS: 37
Stddev IOPS: 11.438
Max IOPS: 56
Min IOPS: 7
Average Latency(s): 0.412821
Stddev Latency(s): 0.547319
Max latency(s): 3.28206
Min latency(s): 0.0228456

HHD - ALL on one Machine
Didn't expect a close 2nd to the NVME
This is an 8 Drive array (2x15k 6x10k) SAS drives

Total time run: 20.2257
Total writes made: 2023
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 400.085
Stddev Bandwidth: 104.448
Max bandwidth (MB/sec): 536
Min bandwidth (MB/sec): 116
Average IOPS: 100
Stddev IOPS: 26.1119
Max IOPS: 134
Min IOPS: 29
Average Latency(s): 0.158852
Stddev Latency(s): 0.239267
Max latency(s): 3.03524
Min latency(s): 0.0337089
 
Last edited:
After you change a pool to a new rule, did you face that the "Optimal # of PGs" went to "n/a"?

Screenshot from 2023-11-15 16-24-43.png

Package Versions:
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.126-1-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-7
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.111-1-pve: 5.15.111-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.27-1-pve: 5.4.27-1
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: not correctly installed
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-network-perl: 0.7.3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: not correctly installed
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
After you change a pool to a new rule, did you face that the "Optimal # of PGs" went to "n/a"?
Once you start to assign device specific rules, you should assign each pool to one. Otherwise you will most likely have an overlap, some rules targeting specific device classes, while the default replicated_rule does not make such a distinction. This results in an unsolvable situation for the autoscaler, and therefore the N/A optimal PG num.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!