[SOLVED] crushmap rule for nvme

RobFantini · Jun 22, 2019

Hello

I am trying to add a ceph crushmap rule for nvme .

I add this :

Code:

rule replicated_nvme {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take default class nvme
    step chooseleaf firstn 0 type host
    step emit
}

then run

Code:

crushtool -c crushmap.txt -o crushmap.new

no error is returned,
however a crushmap.new is not created .

I've tried a few things and have been reading docs like http://docs.ceph.com/docs/master/rados/operations/crush-map/ , searching etc..

I am hoping to set up a rule so that nvme osd's have a rule to group them at.
also perhaps that ceph-volume --crush-device-class could use. I am still researching..

we are using version ceph: 12.2.12-pve1

Does anyone have a suggestion on how to set up a rule for nvme ?

RobFantini · Jun 22, 2019

i was able to add a rule, however could not specify class type
the following wounded ceph , do not do like this:

Code:

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated_nvme {
        id 1
        type replicated
        min_size 1
        max_size 10
        step chooseleaf firstn 0 type host
        step emit
}

we have 3 nvme's on order .. when i add osd's to use replicated_nvme , perhaps specifying 'step take default class' will work..

RobFantini · Jun 22, 2019

well using the above rules then adding a pool to use nvme made ceph break. reversing the changes fixed ceph.

RobFantini · Jun 22, 2019

How would I go about setting up a ceph pool that uses only osds set up on nvme ?

From searching this forum it seems a rule needs to be set up 1ST.

Any suggestions on where to look to figure this out? I'll of course do more research later on..

RobFantini · Jun 23, 2019

It looks like I was following old docs for old way to make rules.

looking at https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_device_classes

Code:

ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>

so will try this when on site :

Code:

ceph osd crush rule create-replicated replicated_nvme  default  host nvme

RobFantini · Jun 23, 2019

1ST try:

Code:

# ceph osd crush rule create-replicated replicated_nvme  default  host nvme
Error EINVAL: device class nvme does not exist

so will figure out how to add 'device class nvme' .

It looks like an osd needs to be created on a nvme 1ST? then the class will get made automatically? the drives come in tomorrow..

RokaKen · Jun 23, 2019

RobFantini said:
It looks like an osd needs to be created on a nvme 1ST? then the class will get made automatically?

Yes, that is my understanding. You can see your currently recognized classes with:

Code:

ceph osd crush class ls
[
    "hdd"
]

It is possible to "re-class" an existing OSD such that you could create your CRUSH rule, but I don't recommend it.

RobFantini · Jun 23, 2019

RokaKen said:
Yes, that is my understanding. You can see your currently recognized classes with:

Code:

ceph osd crush class ls [ "hdd" ]

It is possible to "re-class" an existing OSD such that you could create your CRUSH rule, but I don't recommend it.

thank you for the reply!

I think reclassify could break an existing pool, so I'll stay away from that. we have ceph in production since 2017 , and do not have a testing cluster.

also I'll try this to create osd on nvme. lets see if the class gets created at the same time:

Code:

ceph-volume lvm batch --osds-per-device 2  --crush-device-class  nvme  /dev/nvme0n1

RobFantini · Jun 24, 2019

ok this worked:

Code:

# ceph-volume lvm batch --osds-per-device 2  --crush-device-class  nvme  /dev/nvme0n1

Total OSDs: 2

  Type            Path                                                    LV Size         % of device
----------------------------------------------------------------------------------------------------
  [data]          /dev/nvme0n1                                            447.13 GB       50%
----------------------------------------------------------------------------------------------------
  [data]          /dev/nvme0n1                                            447.13 GB       50%
--> The above OSDs would be created if the operation continues
--> do you want to proceed? (yes/no) yes
Running command: vgcreate --force --yes ceph-dbd94557-9eaf-4dda-b4cd-ccd985e0152b /dev/nvme0n1
 stdout: Physical volume "/dev/nvme0n1" successfully created.
 stdout: Volume group "ceph-dbd94557-9eaf-4dda-b4cd-ccd985e0152b" successfully created
Running command: lvcreate --yes -l 114464 -n osd-data-6f088d28-a0be-4186-a1b6-29dc7e1de375 ceph-dbd94557-9eaf-4dda-b4cd-ccd985e0152b
 stdout: Logical volume "osd-data-6f088d28-a0be-4186-a1b6-29dc7e1de375" created.
...
Running command: systemctl enable --runtime ceph-osd@4
 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@4.service → /lib/systemd/system/ceph-osd@.service.
Running command: systemctl start ceph-osd@4
--> ceph-volume lvm activate successful for osd ID: 4
--> ceph-volume lvm create successful for: ceph-dbd94557-9eaf-4dda-b4cd-ccd985e0152b/osd-data-410ce862-7653-473a-bf9a-dfa867982b41

and there is a nvme class

Code:

# ceph osd crush class ls
[
    "ssd",
    "nvme"
]

ceph -s shows the osd's seem to be added to existing pool:

Code:

# ceph -s
  cluster:
    id:     220b9a53-4556-48e3-a73c-28deff665e45
    health: HEALTH_WARN
            65352/731910 objects misplaced (8.929%)
  services:
    mon: 3 daemons, quorum pve3,pve10,pve14
    mgr: pve3(active), standbys: sys8, pve14, pve10
    osd: 44 osds: 44 up, 44 in; 221 remapped pgs
  data:
    pools:   1 pools, 1024 pgs
    objects: 243.97k objects, 901GiB
    usage:   2.21TiB used, 13.9TiB / 16.2TiB avail
    pgs:     0.195% pgs not active
             65352/731910 objects misplaced (8.929%)
             798 active+clean
             219 active+remapped+backfill_wait
             5   active+remapped+backfilling
             2   activating
  io:
    client:   0B/s rd, 1.67MiB/s wr, 0op/s rd, 83op/s wr
    recovery: 562MiB/s, 157objects/s

so next - How do I make a new pool and use just class nvme .

RokaKen · Jun 24, 2019

Yes, PVE was being "helpful" by automatically adding the new OSD to the existing pools. You can do:

Code:

ceph osd out {osd-num}

and wait for the PG movement to complete.

Then, there are a couple of examples already in the forum, like this one:
Ceph - Setting existing pool device class

and

ceph SSD and HDD pools

RobFantini · Jun 24, 2019

Hello RokaKen - thanks for the reply. I was in the middle of writing the following when you sent .
I'll theck the post you just worte in addition :

This is what I intend to try:

Code:

ceph osd crush rule create-replicated replicated_nvme  default  host nvme

then make new pool at pve using the new rule.

RokaKen · Jun 24, 2019

See the first link I posted -- you can create the new rule, but the default rule still applies and will use ALL OSDs until you change it.

RobFantini · Jun 24, 2019

RokaKen said:
See the first link I posted -- you can create the new rule, but the default rule still applies and will use ALL OSDs until you change it.

I see , thank you!

RobFantini · Jun 25, 2019

OK getting close. I have an issue with the osd's created:

nvme pool shows 0 storage avail [ i tried to move a vm's disk , showed 0 byte avail]
from pve when try to move disk:

Code:

create full clone of drive scsi1 (ceph_vm:vm-9109-disk-1)
TASK ERROR: storage migration failed: error with cfs lock 'storage-nvme_vm': rbd error: rbd: list: (95) Operation not supported

df shows strange mount info:

Code:

# df
Filesystem                     Type      Size  Used Avail Use% Mounted on
/dev/sdj1                      xfs        94M  5.5M   89M   6% /var/lib/ceph/osd/ceph-41
/dev/sdb1                      xfs        94M  5.5M   89M   6% /var/lib/ceph/osd/ceph-25
...
tmpfs                          tmpfs     103G   48K  103G   1% /var/lib/ceph/osd/ceph-3
tmpfs                          tmpfs     103G   48K  103G   1% /var/lib/ceph/osd/ceph-4

I had done the following to create the osd's:

Code:

ceph-volume lvm batch --osds-per-device 2  /dev/nvme0n1

suggestions?

RobFantini · Jun 25, 2019

searched and found https://forum.proxmox.com/threads/p...ion-created-with-only-10gb.55291/#post-255110

"If someone wants to use ceph-volume with PVE 5.x, then our tooling can't be used anymore and the OSD creation/destruction has to go through the ceph tooling directly."

so will need to re think how to create 2 osd's on a disk

Alwin · Jun 25, 2019

RobFantini said:
so will need to re think how to create 2 osd's on a disk

Why would you like to create two OSDs on the same NVMe? For performance? In my tests, a NVMe only pool with 4x OSDs/NVMe didn't give me better performance. Sadly I never got a NVMe that actually supports namespaces that might get things going.

RobFantini · Jun 25, 2019

I choose 2 after doing research..

I'll stick with one osd per nvme

thank you.

RobFantini · Jun 25, 2019

with one osd per nvme , still have an issue using the nvme pool.

after removing the 2 OSD's per nvme, zapping the osds and reboot of nodes [ had to ] - I made one osd per nvme .

df shows:

Code:

/dev/nvme0n1p1                 xfs        97M  5.5M   92M   6% /var/lib/ceph/osd/ceph-5

that is good .

storage set up:

Code:

rbd: nvme-vm
        content images
        krbd 0
        pool rbd_nvme

however pve shows 0 space when i tried to move a disk

and there is a '?' next to rbd_nvme at left on pve.

Alwin · Jun 25, 2019

RobFantini said:
and there is a '?' next to rbd_nvme at left on pve.

You need to place the keyring file for every pool. On a hyper-converged cluster, there is a convenience switch to directly add the storage when you create the pool.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_client

RobFantini · Jun 25, 2019

Alwin said:
You need to place the keyring file for every pool. On a hyper-converged cluster, there is a convenience switch to directly add the storage when you create the pool.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_ceph_client

thank you.

since we do not use keyrings I moved the key to /etc/pve/priv/ceph/old .
the storage is now usable..

[SOLVED] crushmap rule for nvme

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Active Member

Famous Member

Famous Member

Active Member

Famous Member

Active Member

Famous Member

Famous Member

Famous Member

Proxmox Retired Staff

Famous Member

Famous Member

Proxmox Retired Staff

Famous Member

We value your privacy