Created an erasure code pool in ceph , but cannot work with it in proxmox

danielc

Member
Feb 28, 2018
35
2
13
42
Hello

Created an erasure code pool in ceph , but cannot work with it in proxmox.
I simply used RBD(PVE) to moint it.
The pool can show up under proxmox correctly, with size as well, but cannot move disk to there:

create full clone of drive virtio0 (hdd:vm-100-disk-1)
error adding image to directory: (95) Operation not supported
TASK ERROR: storage migration failed: error with cfs lock 'storage-backup_erasure': rbd create vm-100-disk-1' error: error adding image to directory: (95) Operation not supported

I also cannot create VM to it:
rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support
TASK ERROR: create failed - error with cfs lock 'storage-backup_erasure': rbd create vm-107-disk-1' error: rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support


But if i run rados to test, it is totally fine:
root@ceph1:~# rados -p backup_erasure bench 20 write -t 32 -b 4096 --no-cleanup
INFO: op_size has been rounded to 12288
hints = 1
Maintaining 32 concurrent writes of 12288 bytes to objects of size 12288 for up to 20 seconds or 0 objects
Object prefix: benchmark_data_ceph1_1265629
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 32 700 668 7.82797 7.82812 0.0178338 0.0462036
2 32 1392 1360 7.96808 8.10938 0.0725665 0.046
3 32 2065 2033 7.94071 7.88672 0.0169972 0.0467111
.....

What can i do here with this case?
Thanks
 
Hi Daniel,
you need an cache tier pool, or enable additinal settings. See here: http://docs.ceph.com/docs/mimic/rados/operations/erasure-code/#erasure-coding-with-overwrites

Udo

Hello Udo

Thanks for your information, while i aware this tier information during create the erasure pool, i did not know that it is a requirment.
Now i created the tire pool, but i am not sure if this result is correct? Let say i moved a 50G image to this pool:

But it looks like the 50G stays in the tier pool and not moved to the erasure pool. Is this supposed to be a correct result?
Thank you.
 

Attachments

  • abc.png
    abc.png
    4 KB · Views: 53
I'm pretty sure as of Luminous a cache tier is no longer a requirement:
https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/

However, the issue I think is the header and metadata must still be stored in a replicated pool, with only the data in the erasure pool. An example rbd image creation is:
rbd create rbd/myimage --size 1T --data-pool ec42

To my knowledge, ProxMox has not yet enabled this level of sophistication, so you'd have to manually migrate the images then update the vm configuration directly.
 
  • Like
Reactions: danielc
Looks like you simply have to enable overwriting. Herewith my notes from when I setup an erasure coded pool (8 months of production use) and a compressed erasure coded pool (3 months of production use):

NB: We run the pool with a min_size of 4 (3 data and 2 parity shards) and subsequently require a minimum of 6 hosts, of which 3 are monitors.

Create Eraure Coded pool:
Code:
ceph osd erasure-code-profile set ec32_nvme \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=nvme \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_nvme 16 erasure ec32_nvme;
ceph osd pool set ec_nvme allow_ec_overwrites true;
ceph osd pool application enable ec_nvme rbd;


Optionally enable compression:
Code:
ceph osd pool set ec_compr_nvme compression_algorithm snappy;
ceph osd pool set ec_compr_nvme compression_mode aggressive;


Create a replicated pool for metadata:
Code:
ceph osd crush rule create-replicated replicated_nvme default host nvme;
ceph osd pool create rbd_nvme 64 64 replicated replicated_nvme;
ceph osd pool application enable rbd_nvme rbd;


Create RBD image in replicated pool (metadata) but place data in erasure coded pool:
Code:
rbd create rbd_nvme/test_ec --size 100G --data-pool ec_nvme;


Update Proxmox storage configuration (/etc/pve/storage.cfg):
Code:
rbd: rbd_nvme
        monhost 10.254.1.3;10.254.1.4;10.254.1.5
        content images,rootdir
        krbd 1
        pool rbd_nvme
        username admin


PS: Don't forget to copy 'admin' Ceph key:
Code:
cp /etc/pve/priv/ceph.client.admin.keyring /etc/pve/priv/ceph/rbd_nvme.keyring;


Fastest way to manually transfer images (skips unused or trimmed sections):
Code:
qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp rbd:rbd_hdd/vm-213-disk-1 rbd:rbd_nvme/vm-213-disk-1_new


I've used the following Perl monster for almost 20 years, it essentially reads two block devices in 4MB chunks and only transfers chunks when they don't match. Great for incremental snapshot backups or copying between any block device:
Code:
export dev1=`rbd map rbd_hdd/vm-213-disk-1 --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;
export dev2=`rbd map rbd_nvme/vm-213-disk-1_new --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;

perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};print md5($_)' $dev2 |
  perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 |
      perl -ne 'BEGIN{$/=\1} if ($_ eq"s") {$s++} else {if ($s) {
        seek STDOUT,$s*4194304,1; $s=0}; read ARGV,$buf,4194304; print $buf}' 1<> $dev2;



rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support
TASK ERROR: create failed - error with cfs lock 'storage-backup_erasure': rbd create vm-107-disk-1' error: rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support
 
Last edited:
Hi Alwin,

We manage erasure coded and compressed erasure coded Ceph pools via the CLI and subsequently manually edit the VM configuration files. It would be nice if the GUI obtained information on the images (rbd info <pool>/<image>) to know that data is located in an alternate pool.


The following is a screenshot of the pool data utilisation, showing no usage in the metadata rbd pool (rbd_nvme):
ceph-pools.jpg


Sample view of RBD images stored in rbd_nvme (data for all images is in ec_nvme, apart from vm-172-disk-3 and vm-213-disk-3 who store their data in the ec_compr_nvme pool):
Code:
[root@kvm5a priv]# rbd ls rbd_nvme -l
NAME                       SIZE PARENT                            FMT PROT LOCK
base-210-disk-1           4400M                                     2
base-210-disk-1@__base__  4400M                                     2 yes
base-210-disk-2          30720M                                     2
base-210-disk-2@__base__ 30720M                                     2 yes
base-210-disk-3          20480M                                     2
base-210-disk-3@__base__ 20480M                                     2 yes
vm-100-disk-1            81920M                                     2      excl
vm-101-disk-1            81920M                                     2      excl
vm-172-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-172-disk-2            61440M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-172-disk-3             3072G                                     2      excl
vm-211-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-211-disk-2            61440M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-211-disk-3            20480M rbd_nvme/base-210-disk-3@__base__   2      excl
vm-212-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-212-disk-2            61440M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-212-disk-3            20480M rbd_nvme/base-210-disk-3@__base__   2      excl
vm-213-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-213-disk-2            30720M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-213-disk-3              750G                                     2      excl
vm-238-disk-1            81920M                                     2      excl


rbd info rbd_nvme/vm-172-disk-3
Code:
rbd image 'vm-172-disk-3':
        size 3072 GB in 786432 objects
        order 22 (4096 kB objects)
        data_pool: ec_compr_nvme
        block_name_prefix: rbd_data.17.279de52ae8944a
        format: 2
        features: layering, exclusive-lock, data-pool
        flags:
        create_timestamp: Fri Jul  6 22:12:29 2018




 
  • Like
Reactions: carnyx.io
Looks like you simply have to enable overwriting. Herewith my notes from when I setup an erasure coded pool (8 months of production use) and a compressed erasure coded pool (3 months of production use):

NB: We run the pool with a min_size of 4 (3 data and 2 parity shards) and subsequently require a minimum of 6 hosts, of which 3 are monitors.

Create Eraure Coded pool:
Code:
ceph osd erasure-code-profile set ec32_nvme \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=nvme \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_nvme 16 erasure ec32_nvme;
ceph osd pool set ec_nvme allow_ec_overwrites true;
ceph osd pool application enable ec_nvme rbd;


Optionally enable compression:
Code:
ceph osd pool set ec_compr_nvme compression_algorithm snappy;
ceph osd pool set ec_compr_nvme compression_mode aggressive;


Create a replicated pool for metadata:
Code:
ceph osd crush rule create-replicated replicated_nvme default host nvme;
ceph osd pool create rbd_nvme 64 64 replicated replicated_nvme;
ceph osd pool application enable rbd_nvme rbd;


Create RBD image in replicated pool (metadata) but place data in erasure coded pool:
Code:
rbd create rbd_nvme/test_ec --size 100G --data-pool ec_nvme;


Update Proxmox storage configuration (/etc/pve/storage.cfg):
Code:
rbd: rbd_nvme
        monhost 10.254.1.3;10.254.1.4;10.254.1.5
        content images,rootdir
        krbd 1
        pool rbd_nvme
        username admin


PS: Don't forget to copy 'admin' Ceph key:
Code:
cp /etc/pve/priv/ceph.client.admin.keyring /etc/pve/priv/ceph/rbd_nvme.keyring;


Fastest way to manually transfer images (skips unused or trimmed sections):
Code:
qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp rbd:rbd_hdd/vm-213-disk-1 rbd:rbd_nvme/vm-213-disk-1_new


I've used the following Perl monster for almost 20 years, it essentially reads two block devices in 4MB chunks and only transfers chunks when they don't match. Great for incremental snapshot backups or copying between any block device:
Code:
export dev1=`rbd map rbd_hdd/vm-213-disk-1 --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;
export dev2=`rbd map rbd_nvme/vm-213-disk-1_new --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;

perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};print md5($_)' $dev2 |
  perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 |
      perl -ne 'BEGIN{$/=\1} if ($_ eq"s") {$s++} else {if ($s) {
        seek STDOUT,$s*4194304,1; $s=0}; read ARGV,$buf,4194304; print $buf}' 1<> $dev2;
David, this is really cool. I'm trying to replicate this on my setup. I have 7 servers all with a single 1tb NVME drive that I'm using for ceph. I know it's not the ideal setup but I'm limited by the hosting company and cost.

I did the following:
Code:
ceph osd erasure-code-profile set CephEC \
  plugin=jerasure k=2 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=nvme \
  directory=/usr/lib/ceph/erasure-code;

ceph osd pool create CephEC 128 erasure CephEC;
ceph osd pool set CephEC allow_ec_overwrites true;
ceph osd pool application enable CephEC rbd;

ceph osd crush rule create-replicated CephRep default host nvme;
ceph osd pool create CephECMeta 64 64 replicated CephRep;
ceph osd pool application enable CephECMeta rbd;

When I get to:
rbd create CephECMeta/test_ec --size 100G --data-pool CephEC;

It just hangs. Any idea's?

Also what does the line in the vm configuration file look like? Is it something like this?
virtio0: CephEC:test_ec,discard=on,size=100G

And last, has anyone tested this much with LXC containers or using an EC pool as a destination for a kubernetes storage class?
 
Last edited:
It's 2024, still not working.
It works for me.

I installed the ceph-mgr-dashboard, and I was able to create the erasure coded crush rule, and I've been successful in creating both an erasure coded RBD and an erasure coded CephFS mount.

I think that the metadata has to be using replicated crush rule, but the data pool/set itself, can be erasure coded.
 
It works for me.

I installed the ceph-mgr-dashboard, and I was able to create the erasure coded crush rule, and I've been successful in creating both an erasure coded RBD and an erasure coded CephFS mount.

I think that the metadata has to be using replicated crush rule, but the data pool/set itself, can be erasure coded.

and how do you use this in proxmox without manually editing the config files afterwards?
 
and how do you use this in proxmox without manually editing the config files afterwards?
0.o???

I don't have to manually edit the config files ex post facto.

Once I have ceph-mgr-dashboard installed and I log in to that admin screen, create the erasure coded rules, create the pool(s) and/or CephFS pools, then I will have Ceph RBD for block storage (which is where VMs and LXCs will reside) and I can use CephFS for the storage of, for example, the installation ISOs. (I don't actually use it for that, but this is just an example. In my actual deployment, I have a script that creates symbolic links to a central repository of installation ISOs so that I only need to store said copy of the ISO once and use it many many many times, between all of my Proxmox installations/nodes. But that's a separate topic.)

Once the erasure coded pools are created, you can mount them and/or add it as a storage location.

(I forget how I have it set up, whether it's a directory or something else and I'm not at home right now, so I can't check what my config is, at the moment, but I can do that when I get back home from work, later on tonight.)

So for example, I think that my mount point ended up being something like /mnt/pve/ceph-erasure or something like that, with the last "name" being whatever it is that you want to call it.

(I have it separated out as being ceph-replicate and ceph-erasure because the metadata for erasure coded pools needs to live on a replicate rule pool, but the actual data can live on the erasure coded pool.)

But once that is all set up, and it is available as a storage option, then when you go to create a VM/LXC (or move a VM/LXC disk), it will show up as an option in the storage location drop down.

There is no editing of config files ex post facto.

I also don't use the Ceph command line interface (as shown above) to create my pools neither. I did it all via the ceph-mgr-dashboard.

Apalrd's adverture's YouTube video is what taught me how to do this (basically). So LOTS of kudos to him, for teaching me how to do this.
 
  • Like
Reactions: 4920441
Once the erasure coded pools are created, you can mount them and/or add it as a storage location.

...okay... that part is not working for me right now. I can see the pools but I always get an error when trying to create a container or vm ...
C:
2025-03-26T22:06:15.287+0100 781d3ffff6c0 -1 librbd::image::CreateRequest: 0x594f70dee0f0 handle_add_image_to_directory: error adding image to directory: (95) Operation not supported
TASK ERROR: unable to create VM 103 - rbd create 'vm-103-disk-0' error: 2025-03-26T22:06:15.287+0100 781d3ffff6c0 -1 librbd::image::CreateRequest: 0x594f70dee0f0 handle_add_image_to_directory: error adding image to directory: (95) Operation not supported

the pool itself works fine when not in use with proxmox though....

I think I got something wrong with the metadata / data configuration, it seems.
 
...okay... that part is not working for me right now. I can see the pools but I always get an error when trying to create a container or vm ...
C:
2025-03-26T22:06:15.287+0100 781d3ffff6c0 -1 librbd::image::CreateRequest: 0x594f70dee0f0 handle_add_image_to_directory: error adding image to directory: (95) Operation not supported
TASK ERROR: unable to create VM 103 - rbd create 'vm-103-disk-0' error: 2025-03-26T22:06:15.287+0100 781d3ffff6c0 -1 librbd::image::CreateRequest: 0x594f70dee0f0 handle_add_image_to_directory: error adding image to directory: (95) Operation not supported

the pool itself works fine when not in use with proxmox though....

I think I got something wrong with the metadata / data configuration, it seems.
Without knowing how you deployed it, it is difficult for me to try and assist.

But what I can say is that these are the steps that I took when I deployed Ceph in my cluster:

1) I'm using Ceph version 17.2.7 because at the time when I deployed it and watched apalrd's adverture's video - there was an issue with Ceph 18.2. I would imagine that issue has been resolved now (and probably quite some time ago), but I haven't bothered to upgrade because things are working for me, and I don't want to break something that works.

2) Per said apalrd's adventure's video, I installed ceph-mgr-dashboard via command line.

3) Once that's installed, then I logged in to the admin panel by going to the <<Server_IP>>:8443 and logging in there.

4) Once I am in the Ceph admin panel, I started clicking through each of the options to look at the settings. When I got to Pools, I created a pool where I set the pool type to be "erasure". From there, under CRUSH, I created a new erasure code profile where I specify what I want for the number of data chunks, k, and then number of coding chunks, m. And then for the application, I set it to rbd as I was preparing this pool to be a block storage pool.

5) I then repeat the process again, where I create another new pool, but this time, the application is going to be cephfs, and because I only have three nodes in my cluster, therefore; only three storage devices, so the number of data chunks (k) for me = 2, and the number of coding chunks (m) for me = 1.

What this also meant was that when I created the CephFS pool, I used the same erasure coding profile for CephFS as I did for RBD.

6) From there, save and wait for all of the changes to apply in the Ceph admin panel/dashboard. Once it is all done, then I add the storages to Proxmox (Datacenter --> Storage --> Add --> RBD for the RBD pool, and CephFS for the CephFS pool.

So these are roughly, the high level steps that I took to deploy Ceph, using an erasure coded pool, for Proxmox.

The only time that I really used the command line was to install ceph-mgr-dashboard. That's it.

Everything else was pretty much done via the Ceph admin panel/dashboard.

Again though, as a friendly reminder, the metadata for both RBD and CephFS, needs to reside on a pool that uses replicate rule rather that erasure coded rule. (You might already know this, but I didn't when I was creating it, so it took a little experimenting to get it all up and running.)

But I did, ultimately, get it all up and running, in the end.

If you're having troubles with it and since you can't put VMs/LXCs disks on it yet anyways, I would make the polite suggestion of trying to see if you can somehow revert back to a point in time where you had installed Proxmox, but you're about to start the Ceph install and configuration process.

And trying setting it up with the GUI admin panel/dashboard/manager instead.

There might be a chance where that might work better for you. I'm not sure how you set up your current system, but whatever method you might have used, it would appear that it isn't working for you, so again, I would politely suggest that you try using this method instead and see if that works better for you.

Thanks.