Observations and questions concerning erasure coding

DKTL

New Member
Jun 9, 2022
5
0
1
Hi'

Our cluster consists of 8 servers, each with 6 disks worth of 450GB each, totaling an aggregated storage capacity of about 21TB.

The overall design idea is to use erasure coding with the jerasure plugin and the liber8tion technique, thus going with k=6 and m=2.

When creating the pool with plain ceph tools, things look as follows.

Code:
root@hugin-1:~# ceph osd erasure-code-profile set cephtest k=6 m=2 plugin=jerasure technique=liber8tion
root@hugin-1:~# ceph osd erasure-code-profile get cephtest
crush-device-class=
crush-failure-domain=host
crush-root=default
k=6
m=2
packetsize=2048
plugin=jerasure
technique=liber8tion
w=8
root@hugin-1:~# ceph osd pool create cephtest erasure cephtest
pool 'cephtest' created
root@hugin-1:~# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 13 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 13 'cephtest' erasure profile cephtest size 8 min_size 7 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 423 flags hashpspool stripe_width 393216
root@hugin-1:~# ceph df detail
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    21 TiB  21 TiB  2.3 GiB   2.3 GiB       0.01
TOTAL  21 TiB  21 TiB  2.3 GiB   2.3 GiB       0.01
 
--- POOLS ---
POOL                   ID  PGS  STORED  (DATA)  (OMAP)  OBJECTS  USED  (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1     0 B     0 B     0 B        0   0 B     0 B     0 B      0     10 TiB            N/A          N/A    N/A         0 B          0 B
cephtest               13   32     0 B     0 B     0 B        0   0 B     0 B     0 B      0     15 TiB            N/A          N/A    N/A         0 B          0 B

The output of command "ceph osd pool ls detail" and the min_size setting puzzles me a bit, but the output of command "ceph df detail" with its MAX AVAIL limit of 15TB seems as expected with k=6 and m=2.

Now I get to try to create the same pool setup with the pveceph tool, which also nicely integrates the pool within the pve cluster for it to be usable for vm's and ct's. This gives rise to the following.

Code:
root@hugin-1:~# ceph osd erasure-code-profile set cephtest k=6 m=2 plugin=jerasure technique=liber8tion
root@hugin-1:~# ceph osd erasure-code-profile get cephtest
crush-device-class=
crush-failure-domain=host
crush-root=default
k=6
m=2
packetsize=2048
plugin=jerasure
technique=liber8tion
w=8
root@hugin-1:~# pveceph pool create cephtest --erasure-coding profile=cephtest
400 Parameter verification failed.
erasure-coding: invalid format - format error
erasure-coding.k: property is missing and it is not optional
erasure-coding.m: property is missing and it is not optional


pveceph pool create <name> [OPTIONS]
root@hugin-1:~# pveceph pool create cephtest --erasure-coding k=6,m=2,profile=cephtest
pool cephtest-data: applying allow_ec_overwrites = true
pool cephtest-data: applying application = rbd
pool cephtest-data: applying pg_autoscale_mode = warn
pool cephtest-data: applying pg_num = 128
pool cephtest-metadata: applying size = 3
pool cephtest-metadata: applying application = rbd
pool cephtest-metadata: applying min_size = 2
pool cephtest-metadata: applying pg_autoscale_mode = warn
pool cephtest-metadata: applying pg_num = 32
root@hugin-1:~# ceph df detail
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    21 TiB  21 TiB  2.4 GiB   2.4 GiB       0.01
TOTAL  21 TiB  21 TiB  2.4 GiB   2.4 GiB       0.01
 
--- POOLS ---
POOL                   ID  PGS  STORED  (DATA)  (OMAP)  OBJECTS  USED  (DATA)  (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
device_health_metrics   1    1     0 B     0 B     0 B        0   0 B     0 B     0 B      0     10 TiB            N/A          N/A    N/A         0 B          0 B
cephtest-data          16  128     0 B     0 B     0 B        0   0 B     0 B     0 B      0     15 TiB            N/A          N/A    N/A         0 B          0 B
cephtest-metadata      17   32     0 B     0 B     0 B        0   0 B     0 B     0 B      0    6.6 TiB            N/A          N/A    N/A         0 B          0 B
root@hugin-1:~# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 13 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 16 'cephtest-data' erasure profile cephtest size 8 min_size 7 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 446 flags hashpspool,ec_overwrites stripe_width 393216 application rbd
pool 17 'cephtest-metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 452 flags hashpspool stripe_width 0 application rbd
root@hugin-1:~# pvesm status
Name             Type     Status           Total            Used       Available        %
cephtest          rbd     active      7123198464               0      7123198464    0.00%
local             dir     active        57225328         3471220        50814820    6.07%
local-lvm     lvmthin     active       147238912               0       147238912    0.00%

With this a number of observations and questions concerning pveceph arise.
  • Despite that my erasure coding profile does contain k=6 and m=2, pveceph appears to not recognize this, hence I need to specify this explicitly on the command line.
  • As compared with the result when using the plain ceph commands, pg numbers are set to 128, which "ceph health detail" warns about should be 32 instead, just as it is when the plain ceph commands are used for creating the pool.
  • I am very unsure about how to interpret the MAX AVAIL limits of 15TB and 6.6TB for pools cephtest-data and cephtest-metadata, respectively.
  • The pvesm command is reporting rather unexpected availability numbers, pretty much one third of the total aggregated storage available, which could suggest that pvesm does not see the pool as an erasure coding pool but merely a replication pool
Confessing of being a novice in this game, I hope that some can spare me some comments on the above items so as for me to be somewhat ensured that I have indeed arrived at my intended setup,

Thanks.

Best regards.

Thomas.
 
With k=6 and m=2 you will get 8 chunks for every object, hence the size=8 and min_size=7 to not lose any data.

The pveceph command created the metadata pool as replicated pool with a size of 3. The CephFS metadata cannot be erasure coded. This is why its max_avail space is 6.6TB (21TB/3).
 
Hi'

My understanding is that with k=6 and m=2 I can indeed survive with the outage of up to two osd drives, hence I find myself in bewilderment about min_size=7.

Take note that I am not doing cephFS. Confessing that my ceph skills are not that high that I know of what the two pools "cephtest-data" and "cephtest-metadata" are for, but it appears to me that "cephtest-data" is where my data are actually written to, whereas "cephtest-metadata" only contains a very few objects containing metadata of my containers. Thus, in view of output from pvesm, I would expect it to report from the "cephtest-data" pool and not from the "cephtest-metadata" pool.

Best regards.

Thomas.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!