Created an erasure code pool in ceph , but cannot work with it in proxmox

danielc

Member
Feb 28, 2018
35
2
13
41
Hello

Created an erasure code pool in ceph , but cannot work with it in proxmox.
I simply used RBD(PVE) to moint it.
The pool can show up under proxmox correctly, with size as well, but cannot move disk to there:

create full clone of drive virtio0 (hdd:vm-100-disk-1)
error adding image to directory: (95) Operation not supported
TASK ERROR: storage migration failed: error with cfs lock 'storage-backup_erasure': rbd create vm-100-disk-1' error: error adding image to directory: (95) Operation not supported

I also cannot create VM to it:
rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support
TASK ERROR: create failed - error with cfs lock 'storage-backup_erasure': rbd create vm-107-disk-1' error: rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support


But if i run rados to test, it is totally fine:
root@ceph1:~# rados -p backup_erasure bench 20 write -t 32 -b 4096 --no-cleanup
INFO: op_size has been rounded to 12288
hints = 1
Maintaining 32 concurrent writes of 12288 bytes to objects of size 12288 for up to 20 seconds or 0 objects
Object prefix: benchmark_data_ceph1_1265629
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 32 700 668 7.82797 7.82812 0.0178338 0.0462036
2 32 1392 1360 7.96808 8.10938 0.0725665 0.046
3 32 2065 2033 7.94071 7.88672 0.0169972 0.0467111
.....

What can i do here with this case?
Thanks
 
Hi Daniel,
you need an cache tier pool, or enable additinal settings. See here: http://docs.ceph.com/docs/mimic/rados/operations/erasure-code/#erasure-coding-with-overwrites

Udo

Hello Udo

Thanks for your information, while i aware this tier information during create the erasure pool, i did not know that it is a requirment.
Now i created the tire pool, but i am not sure if this result is correct? Let say i moved a 50G image to this pool:

But it looks like the 50G stays in the tier pool and not moved to the erasure pool. Is this supposed to be a correct result?
Thank you.
 

Attachments

  • abc.png
    abc.png
    4 KB · Views: 44
I'm pretty sure as of Luminous a cache tier is no longer a requirement:
https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/

However, the issue I think is the header and metadata must still be stored in a replicated pool, with only the data in the erasure pool. An example rbd image creation is:
rbd create rbd/myimage --size 1T --data-pool ec42

To my knowledge, ProxMox has not yet enabled this level of sophistication, so you'd have to manually migrate the images then update the vm configuration directly.
 
  • Like
Reactions: danielc
Looks like you simply have to enable overwriting. Herewith my notes from when I setup an erasure coded pool (8 months of production use) and a compressed erasure coded pool (3 months of production use):

NB: We run the pool with a min_size of 4 (3 data and 2 parity shards) and subsequently require a minimum of 6 hosts, of which 3 are monitors.

Create Eraure Coded pool:
Code:
ceph osd erasure-code-profile set ec32_nvme \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=nvme \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_nvme 16 erasure ec32_nvme;
ceph osd pool set ec_nvme allow_ec_overwrites true;
ceph osd pool application enable ec_nvme rbd;


Optionally enable compression:
Code:
ceph osd pool set ec_compr_nvme compression_algorithm snappy;
ceph osd pool set ec_compr_nvme compression_mode aggressive;


Create a replicated pool for metadata:
Code:
ceph osd crush rule create-replicated replicated_nvme default host nvme;
ceph osd pool create rbd_nvme 64 64 replicated replicated_nvme;
ceph osd pool application enable rbd_nvme rbd;


Create RBD image in replicated pool (metadata) but place data in erasure coded pool:
Code:
rbd create rbd_nvme/test_ec --size 100G --data-pool ec_nvme;


Update Proxmox storage configuration (/etc/pve/storage.cfg):
Code:
rbd: rbd_nvme
        monhost 10.254.1.3;10.254.1.4;10.254.1.5
        content images,rootdir
        krbd 1
        pool rbd_nvme
        username admin


PS: Don't forget to copy 'admin' Ceph key:
Code:
cp /etc/pve/priv/ceph.client.admin.keyring /etc/pve/priv/ceph/rbd_nvme.keyring;


Fastest way to manually transfer images (skips unused or trimmed sections):
Code:
qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp rbd:rbd_hdd/vm-213-disk-1 rbd:rbd_nvme/vm-213-disk-1_new


I've used the following Perl monster for almost 20 years, it essentially reads two block devices in 4MB chunks and only transfers chunks when they don't match. Great for incremental snapshot backups or copying between any block device:
Code:
export dev1=`rbd map rbd_hdd/vm-213-disk-1 --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;
export dev2=`rbd map rbd_nvme/vm-213-disk-1_new --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;

perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};print md5($_)' $dev2 |
  perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 |
      perl -ne 'BEGIN{$/=\1} if ($_ eq"s") {$s++} else {if ($s) {
        seek STDOUT,$s*4194304,1; $s=0}; read ARGV,$buf,4194304; print $buf}' 1<> $dev2;



rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support
TASK ERROR: create failed - error with cfs lock 'storage-backup_erasure': rbd create vm-107-disk-1' error: rbd: create error: (22) Invalid argument2018-07-04 16:57:35.778203 7fc5627fc700 -1 librbd::image::CreateRequest: 0x561876d2cb90 handle_validate_overwrite: pool missing required overwrite support
 
Last edited:
Hi Alwin,

We manage erasure coded and compressed erasure coded Ceph pools via the CLI and subsequently manually edit the VM configuration files. It would be nice if the GUI obtained information on the images (rbd info <pool>/<image>) to know that data is located in an alternate pool.


The following is a screenshot of the pool data utilisation, showing no usage in the metadata rbd pool (rbd_nvme):
ceph-pools.jpg


Sample view of RBD images stored in rbd_nvme (data for all images is in ec_nvme, apart from vm-172-disk-3 and vm-213-disk-3 who store their data in the ec_compr_nvme pool):
Code:
[root@kvm5a priv]# rbd ls rbd_nvme -l
NAME                       SIZE PARENT                            FMT PROT LOCK
base-210-disk-1           4400M                                     2
base-210-disk-1@__base__  4400M                                     2 yes
base-210-disk-2          30720M                                     2
base-210-disk-2@__base__ 30720M                                     2 yes
base-210-disk-3          20480M                                     2
base-210-disk-3@__base__ 20480M                                     2 yes
vm-100-disk-1            81920M                                     2      excl
vm-101-disk-1            81920M                                     2      excl
vm-172-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-172-disk-2            61440M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-172-disk-3             3072G                                     2      excl
vm-211-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-211-disk-2            61440M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-211-disk-3            20480M rbd_nvme/base-210-disk-3@__base__   2      excl
vm-212-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-212-disk-2            61440M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-212-disk-3            20480M rbd_nvme/base-210-disk-3@__base__   2      excl
vm-213-disk-1             4400M rbd_nvme/base-210-disk-1@__base__   2      excl
vm-213-disk-2            30720M rbd_nvme/base-210-disk-2@__base__   2      excl
vm-213-disk-3              750G                                     2      excl
vm-238-disk-1            81920M                                     2      excl


rbd info rbd_nvme/vm-172-disk-3
Code:
rbd image 'vm-172-disk-3':
        size 3072 GB in 786432 objects
        order 22 (4096 kB objects)
        data_pool: ec_compr_nvme
        block_name_prefix: rbd_data.17.279de52ae8944a
        format: 2
        features: layering, exclusive-lock, data-pool
        flags:
        create_timestamp: Fri Jul  6 22:12:29 2018




 
  • Like
Reactions: carnyx.io
Looks like you simply have to enable overwriting. Herewith my notes from when I setup an erasure coded pool (8 months of production use) and a compressed erasure coded pool (3 months of production use):

NB: We run the pool with a min_size of 4 (3 data and 2 parity shards) and subsequently require a minimum of 6 hosts, of which 3 are monitors.

Create Eraure Coded pool:
Code:
ceph osd erasure-code-profile set ec32_nvme \
  plugin=jerasure k=3 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=nvme \
  directory=/usr/lib/ceph/erasure-code;
ceph osd pool create ec_nvme 16 erasure ec32_nvme;
ceph osd pool set ec_nvme allow_ec_overwrites true;
ceph osd pool application enable ec_nvme rbd;


Optionally enable compression:
Code:
ceph osd pool set ec_compr_nvme compression_algorithm snappy;
ceph osd pool set ec_compr_nvme compression_mode aggressive;


Create a replicated pool for metadata:
Code:
ceph osd crush rule create-replicated replicated_nvme default host nvme;
ceph osd pool create rbd_nvme 64 64 replicated replicated_nvme;
ceph osd pool application enable rbd_nvme rbd;


Create RBD image in replicated pool (metadata) but place data in erasure coded pool:
Code:
rbd create rbd_nvme/test_ec --size 100G --data-pool ec_nvme;


Update Proxmox storage configuration (/etc/pve/storage.cfg):
Code:
rbd: rbd_nvme
        monhost 10.254.1.3;10.254.1.4;10.254.1.5
        content images,rootdir
        krbd 1
        pool rbd_nvme
        username admin


PS: Don't forget to copy 'admin' Ceph key:
Code:
cp /etc/pve/priv/ceph.client.admin.keyring /etc/pve/priv/ceph/rbd_nvme.keyring;


Fastest way to manually transfer images (skips unused or trimmed sections):
Code:
qemu-img convert -f raw -O raw -t unsafe -T unsafe -nWp rbd:rbd_hdd/vm-213-disk-1 rbd:rbd_nvme/vm-213-disk-1_new


I've used the following Perl monster for almost 20 years, it essentially reads two block devices in 4MB chunks and only transfers chunks when they don't match. Great for incremental snapshot backups or copying between any block device:
Code:
export dev1=`rbd map rbd_hdd/vm-213-disk-1 --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;
export dev2=`rbd map rbd_nvme/vm-213-disk-1_new --name client.admin -k /etc/pve/priv/ceph.client.admin.keyring;`;

perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};print md5($_)' $dev2 |
  perl -'MDigest::MD5 md5' -ne 'BEGIN{$/=\4194304};$b=md5($_);
    read STDIN,$a,16;if ($a eq $b) {print "s"} else {print "c" . $_}' $dev1 |
      perl -ne 'BEGIN{$/=\1} if ($_ eq"s") {$s++} else {if ($s) {
        seek STDOUT,$s*4194304,1; $s=0}; read ARGV,$buf,4194304; print $buf}' 1<> $dev2;
David, this is really cool. I'm trying to replicate this on my setup. I have 7 servers all with a single 1tb NVME drive that I'm using for ceph. I know it's not the ideal setup but I'm limited by the hosting company and cost.

I did the following:
Code:
ceph osd erasure-code-profile set CephEC \
  plugin=jerasure k=2 m=2 technique=reed_sol_van \
  crush-root=default crush-failure-domain=host crush-device-class=nvme \
  directory=/usr/lib/ceph/erasure-code;

ceph osd pool create CephEC 128 erasure CephEC;
ceph osd pool set CephEC allow_ec_overwrites true;
ceph osd pool application enable CephEC rbd;

ceph osd crush rule create-replicated CephRep default host nvme;
ceph osd pool create CephECMeta 64 64 replicated CephRep;
ceph osd pool application enable CephECMeta rbd;

When I get to:
rbd create CephECMeta/test_ec --size 100G --data-pool CephEC;

It just hangs. Any idea's?

Also what does the line in the vm configuration file look like? Is it something like this?
virtio0: CephEC:test_ec,discard=on,size=100G

And last, has anyone tested this much with LXC containers or using an EC pool as a destination for a kubernetes storage class?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!