Ceph: Erasure coded pools planned?

gkovacs · Apr 20, 2017

Ceph provides erasure coded pools for a several years now (was introduced in 2013), and according to many sources the technology is quite stable. (Erasure coded pools provide much more effective storage utilization for the same number of drives that can fail in a pool, quite similarly as RAID5 relates to RAID1, for the price of increased CPU usage.)

More info:
http://ceph.com/planet/erasure-coding-in-ceph/
http://ceph.com/geen-categorie/ceph-erasure-coding-overhead-in-a-nutshell/

I wonder when will Proxmox catch up to this, and provide the possibility to create erasure coded pools from the web interface?

PigLover · Apr 20, 2017

There's not really anything to "catch up" with.

If you look deeper into the capabilities of Ceph's EC pools, you find that they are not currently suitable for use with RBD (virtual block device). In order to use EC pools with RBD you needed to use a "cache tier" using regular replicated pools - but Ceph never really got this working and pretty much abandoned it.

Since Proxmox only exposes RBD for VM images EC pools are not yet useful. Today EC pools are really only useful for "regular" object store and CephFS, neither of which are the target use case for Proxmox.

This should change with the final release of Luminous. There are capabilities in Luminous and Bluestore that resolve the issues using EC pools for RBD. Assuming they deliver as promised I'm sure the Proxmox team will do what it takes to facilitate its use.

Also - for what its worth - you have to have a pretty large cluster before EC pools give you much benefit. In order to use them effectively and still retain resiliancy you need to have have 7 or more nodes - from a practical perspective they don't really add much value until you are at 13 or more separate servers running OSD (long detailed debate needed here - so I won't start it).

gkovacs · Apr 20, 2017

PigLover said:
There's not really anything to "catch up" with.

If you look deeper into the capabilities of Ceph's EC pools, you find that they are not currently suitable for use with RBD (virtual block device). In order to use EC pools with RBD you needed to use a "cache tier" using regular replicated pools - but Ceph never really got this working and pretty much abandoned it.

Since Proxmox only exposes RBD for VM images EC pools are not yet useful. Today EC pools are really only useful for "regular" object store and CephFS, neither of which are the target use case for Proxmox.

Thanks for your detailed answer. I didn't know that the cache tier was mandatory for RBD, however CephFS would be very useful for Proxmox - think backups or installation images.

This should change with the final release of Luminous. There are capabilities in Luminous and Bluestore that resolve the issues using EC pools for RBD. Assuming they deliver as promised I'm sure the Proxmox team will do what it takes to facilitate its use.

Also - for what its worth - you have to have a pretty large cluster before EC pools give you much benefit. In order to use them effectively and still retain resiliancy you need to have have 7 or more nodes - from a practical perspective they don't really add much value until you are at 13 or more separate servers running OSD (long detailed debate needed here - so I won't start it).

I'm not familiar with your quoted figures, even though I have read many articles about EC pools. In theory, 3 nodes all running 4 OSDs (so 12 all together) loses half the capacity and can stay online when one node fails with a replicated pool of size=2. If using an EC pool with k=8 and m=4, you would only lose one third capacity, and also tolerate the failure of one node. Why would you need 7 or even 13 nodes, please give me some details on that...

PigLover · Apr 20, 2017

gkovacs said:
I'm not familiar with your quoted figures, even though I have read many articles about EC pools. In theory, 3 nodes all running 4 OSDs (so 12 all together) loses half the capacity and can stay online when one node fails with a replicated pool of size=2. If using an EC pool with k=8 and m=4, you would only lose one third capacity, and also tolerate the failure of one node. Why would you need 7 or even 13 nodes, please give me some details on that...

Because you have to be able to describe the fault domains in Ceph to ensure that your failure happens exactly that way. And your choices to create the fault domain are the OSD, the host (with a collection of OSDs) or a rack (with a collection of hosts, etc). Your example works using the OSD as the fault domain, but only if you have exactly the right combination of hosts and OSDs to ensure that a fault happens to end up in a working state (e.g., your example of 8+4 with 3 servers and exactly 4 OSDs per host). But it starts to break down if you need to add capacity with additional hosts or OSDs. Also - you really don't maintain resilience if your failure mode cannot re-balance the pool into stable operation (i.e., you need to have enough capacity that the pgs can be recovered).

If you are going to use "OSD" as your failure domain then you ultimately only protect against OSD faults (and then you might as well be using ZFS). If you use "host" as your fault domain them you need m + k hosts (or really m + k + 1 to maintain resilience) and in your example would need 13 hosts.

I would argue pretty strongly that while this configuration can be made to work it is not one that should be depended upon. This is the same reason that I argue pretty strongly that while the minimum for a replication pool is 3 the smallest cluster that actually maintains resilience is 5. Not everything that can be done should be done...

udo · Apr 20, 2017

Hi,
one thing to EC-Pools. I had an cluster with an EC pool and ssd-cache tier. The normal ceph performance can be better - and rbd on EC-pools are even much slower than the normal rbd-volumes.
In this case it was for an archive server (64TB filesystem) and ok... but if you offer such things in the gui I guess the user expect an higher performance.

Perhaps the performance is better with a smaller set of rereaded data and an much bigger cache tier… how often it's depends on the usage I guess.

Udo

spirit · Apr 21, 2017

Hi.

Kraken have (experimental) support for rbd direct write to EC pool (without a cache pool in front)

but it still experimental

mgiammarco · Sep 30, 2017

I have tried on proxmox erasure coded pool with cache and it is too slow.
I have searched for an explanation and it is very simple: the cache uses chunks very big. So if chace misses one byte it reads at least 1mb from erasure coded pool.
End of the story.
But now ceph luminous has ovewrites support for erasure coded pools.
Problem, it is not transparent to proxmox: when you create a rbd image you should do:
rbd create --size 1G --data-pool ec_pool replicated_pool/image_name

To specify a replicated pool that keeps metadata.
Will you support it in rbd creation menu?
It seems quite easy to me this feature to add.
Thanks,
Mario

casparsmit · Oct 2, 2017

mgiammarco said:
I have tried on proxmox erasure coded pool with cache and it is too slow.
I have searched for an explanation and it is very simple: the cache uses chunks very big. So if chace misses one byte it reads at least 1mb from erasure coded pool.
End of the story.
But now ceph luminous has ovewrites support for erasure coded pools.
Problem, it is not transparent to proxmox: when you create a rbd image you should do:
rbd create --size 1G --data-pool ec_pool replicated_pool/image_name

To specify a replicated pool that keeps metadata.
Will you support it in rbd creation menu?
It seems quite easy to me this feature to add.
Thanks,
Mario

Well it's actually the other way around:

To use erasure coded pools for RBD you need to create an RBD image on a replicated pool and specify a seperate data pool (which is erasure coded). This ensures all metadata to be written on the replicated pool and the actual data is written to the erasure coded pool.

I'm curious what the opinion of the devs is about the current state of erasure coded pools in CEPH Luminous.

I was looking into erasure coding too because we have the need for a large (~750TB) archive VM which we would like to host on CEPH storage. But replicating 750TB 3 times is fairly expensive. In this case it would be worth it to be able to use erasure coded pools.

To add this feature is actually more difficult then Mario states i think, since there is no "rbd creation menu" but it has to be added to the VM creation wizard/Add hard disk wizard.

Another (probably better) way is to add an extra "data pool" field to the RBD (PVE) storage entry.

mgiammarco · Oct 12, 2017

I see in ceph/pools that I can create a pool.
In datacenter/storage I can create a rbd storage specifying a pool. There you can add the second pool choice for data pool.

Ashley · Nov 3, 2017

Any update on this?

I am struggling to see anywhere in GUI to set this.

Have tried manually to create a RBD setting the data pool and set in a KVM config but won't boot.

aychprox · Nov 18, 2017

I hope to see this feature available too.
Tested on old machine, direct EC on RBD pool without cache in front, rados bench showed 30% - 40% faster then pool with cache-tiering.

cristian.ion · Jul 26, 2018

Are there any news on EC Pools?

swatcats · Jul 18, 2019

Any update on this?

Search

Search

Ceph: Erasure coded pools planned?

gkovacs

Renowned Member

PigLover

Renowned Member

gkovacs

Renowned Member

PigLover

Renowned Member

udo

Distinguished Member

spirit

Distinguished Member

mgiammarco

Renowned Member

casparsmit

Renowned Member

mgiammarco

Renowned Member

Ashley

Member

aychprox

Renowned Member

cristian.ion

New Member

swatcats

New Member