Erasure Code and failure-domain=datacenter

Vasilisc

Well-Known Member
Jun 29, 2017
34
3
48
45
vasilisc.com
Please help me with some advice. In my test scheme with three data centers, I need to create an Erasure Code pool for cold data.
I used the documentation
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#pve_ceph_ec_pools

My chosen k=6,m=3 (I also tried from the documentation k=4,m=2) in the command
Bash:
# pveceph pool create ECpool --erasure-coding k=6,m=3,failure-domain=datacenter
Rich (BB code):
created new erasure code profile 'pve_ec_ECpool'
pool ECpool-data: applying allow_ec_overwrites = true
pool ECpool-data: applying application = rbd
pool ECpool-data: applying pg_autoscale_mode = warn
pool ECpool-data: applying pg_num = 128
pool ECpool-metadata: applying size = 3
pool ECpool-metadata: applying application = rbd
pool ECpool-metadata: applying min_size = 2
pool ECpool-metadata: applying pg_autoscale_mode = warn
pool ECpool-metadata: applying pg_num = 32

Change the state of the cluster

Rich (BB code):
HEALTH_WARN Reduced data availability: 128 pgs inactive, 128 pgs incomplete
[WRN] PG_AVAILABILITY: Reduced data availability: 128 pgs inactive, 128 pgs incomplete
    pg 8.40 is stuck inactive for 85s, current state creating+incomplete, last acting [28,NONE,NONE,6,13,NONE,NONE,NONE,NONE] (reducing pool ECpool-data min_size from 7 may help; search ceph.com/docs for 'incomplete')
    pg 8.41 is creating+incomplete, acting [21,12,28,NONE,NONE,NONE,NONE,NONE,NONE] (reducing pool ECpool-data min_size from 7 may help; search ceph.com/docs for 'incomplete')
....

Rich (BB code):
  cluster:
    id:     dfce5bc5-428f-4ede-af8d-2d801e84578e
    health: HEALTH_WARN
            Reduced data availability: 128 pgs inactive, 128 pgs incomplete

  services:
    mon: 3 daemons, quorum pn1,pn2,pn3 (age 85m)
    mgr: pn1(active, since 85m), standbys: pn2, pn3
    osd: 34 osds: 34 up (since 77m), 34 in (since 45h)

  data:
    pools:   4 pools, 289 pgs
    objects: 1.10k objects, 4.2 GiB
    usage:   14 GiB used, 116 GiB / 131 GiB avail
    pgs:     44.291% pgs not active
             161 active+clean
             128 creating+incomplete

Rich (BB code):
    "active": true,
    "last_optimize_duration": "0:00:00.000194",
    "last_optimize_started": "Fri Mar 15 11:27:46 2024",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Some PGs (0.442907) are inactive; try again later",
    "plans": []

Rich (BB code):
ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -1         0.12711  root default                                 
-33         0.04410      datacenter dc1                           
 -3         0.01556          host pn1                             
  0    hdd  0.00389              osd.0       up   1.00000  1.00000
  1    hdd  0.00389              osd.1       up   1.00000  1.00000
  2    hdd  0.00389              osd.2       up   1.00000  1.00000
  3    hdd  0.00389              osd.3       up   1.00000  1.00000
-23         0.00130          host pn13                            
 25    hdd  0.00130              osd.25      up   1.00000  1.00000
-25         0.00389          host pn16                            
 26    hdd  0.00389              osd.26      up   1.00000  1.00000
-27         0.00389          host pn17                            
 27    hdd  0.00389              osd.27      up   1.00000  1.00000
-29         0.00389          host pn18                            
 28    hdd  0.00389              osd.28      up   1.00000  1.00000
-11         0.01556          host pn5                             
 16    hdd  0.00389              osd.16      up   1.00000  1.00000
 17    hdd  0.00389              osd.17      up   1.00000  1.00000
 18    hdd  0.00389              osd.18      up   1.00000  1.00000
 19    hdd  0.00389              osd.19      up   1.00000  1.00000
-34         0.04799      datacenter dc2                           
-17         0.00778          host pn10                            
 22    hdd  0.00389              osd.22      up   1.00000  1.00000
 32    hdd  0.00389              osd.32      up   1.00000  1.00000
-19         0.00778          host pn11                            
 23    hdd  0.00389              osd.23      up   1.00000  1.00000
 33    hdd  0.00389              osd.33      up   1.00000  1.00000
-21         0.00389          host pn12                            
 24    hdd  0.00389              osd.24      up   1.00000  1.00000
 -5         0.01556          host pn2                             
  4    hdd  0.00389              osd.4       up   1.00000  1.00000
  5    hdd  0.00389              osd.5       up   1.00000  1.00000
  6    hdd  0.00389              osd.6       up   1.00000  1.00000
  7    hdd  0.00389              osd.7       up   1.00000  1.00000
-13         0.00908          host pn6                             
 20    hdd  0.00130              osd.20      up   1.00000  1.00000
 30    hdd  0.00389              osd.30      up   1.00000  1.00000
 31    hdd  0.00389              osd.31      up   1.00000  1.00000
-15         0.00389          host pn9                             
 21    hdd  0.00389              osd.21      up   1.00000  1.00000
-37         0.03502      datacenter dc3                           
-31         0.00389          host pn19                            
 29    hdd  0.00389              osd.29      up   1.00000  1.00000
 -7         0.01556          host pn3                             
  8    hdd  0.00389              osd.8       up   1.00000  1.00000
  9    hdd  0.00389              osd.9       up   1.00000  1.00000
 10    hdd  0.00389              osd.10      up   1.00000  1.00000
 11    hdd  0.00389              osd.11      up   1.00000  1.00000
 -9         0.01556          host pn4                             
 12    hdd  0.00389              osd.12      up   1.00000  1.00000
 13    hdd  0.00389              osd.13      up   1.00000  1.00000
 14    hdd  0.00389              osd.14      up   1.00000  1.00000
 15    hdd  0.00389              osd.15      up   1.00000  1.00000
 
It may be worth actually explaining what you're trying to do; to @gurubert point, you dont have enough DCs to do EC they way you're describing.

You also have very lopsided OSD distribution. this is problematic regardless of how you set your crush rules. make sure every node that serves OSDs is the SAME SIZE AND WEIGHT or you'll end up risking running out of space when an overlarge node fails, never mind performance issues. Lastly, you have a relatively small OSD count for EC to perform anywhere near adequately. just a thought.
 
  • Like
Reactions: gurubert
I need to implement fault tolerance at the datacenter level in the Proxmox VE hyperconverged cluster (pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.13-1-pve)) and Ceph Reef 18.2.1.

To test future changes, I created a virtual test bench in VirtualBox that closely mimics my cluster in production. I have reproduced exactly the number of servers in the cluster (17), completely recreated the network configuration and IP addresses on the network interfaces. I was unable to recreate the hard drive volumes due to the limitations of my work computer. Real disks with a volume of 4 terabytes were simulated by 4 gigabytes. With a replicated rbd pool, everything is clear to me and fault tolerance has been tested in case of a breakdown of one datacenter.

For cold data, I would like to create an Erasure Codes pool with fault tolerance at the datacenter level. I need to understand for k+m how many servers and OSD should be in each datacenter. I naively believed that the formula k=6,m=3 would give me two pieces of data (k) and one piece of m in each datacenter of the three.

Thank you for your comment about the uneven distribution of OSD. This historical misunderstanding can and will be corrected by purchasing servers and hard drives. But I need to understand exactly what needs to be done to create a pool of cold data.
 
If you do erasure coding with k and m chunks you need at least k+m failure zones to distribute these parts, better are k+m+2 failure zones.
I don't have enough knowledge and examples from the official documentation. I don't understand how to implement sufficient redundancy for the Erasure Codes pool to allow one data center out of three to crash.

At the test site, I achieved an even distribution of servers in each datacenter. Five servers in each of the three datacenters.
Rich (BB code):
ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -1         0.12711  root default                                 
-33         0.04280      datacenter dc1                           
 -3         0.01556          host pn1                             
  0    hdd  0.00389              osd.0       up   1.00000  1.00000
  1    hdd  0.00389              osd.1       up   1.00000  1.00000
  2    hdd  0.00389              osd.2       up   1.00000  1.00000
  3    hdd  0.00389              osd.3       up   1.00000  1.00000
-25         0.00389          host pn16                            
 26    hdd  0.00389              osd.26      up   1.00000  1.00000
-27         0.00389          host pn17                            
 27    hdd  0.00389              osd.27      up   1.00000  1.00000
-29         0.00389          host pn18                            
 28    hdd  0.00389              osd.28      up   1.00000  1.00000
-11         0.01556          host pn5                             
 16    hdd  0.00389              osd.16      up   1.00000  1.00000
 17    hdd  0.00389              osd.17      up   1.00000  1.00000
 18    hdd  0.00389              osd.18      up   1.00000  1.00000
 19    hdd  0.00389              osd.19      up   1.00000  1.00000
-34         0.04410      datacenter dc2                           
-17         0.00778          host pn10                            
 22    hdd  0.00389              osd.22      up   1.00000  1.00000
 32    hdd  0.00389              osd.32      up   1.00000  1.00000
-19         0.00778          host pn11                            
 23    hdd  0.00389              osd.23      up   1.00000  1.00000
 33    hdd  0.00389              osd.33      up   1.00000  1.00000
-21         0.00389          host pn12                            
 24    hdd  0.00389              osd.24      up   1.00000  1.00000
 -5         0.01556          host pn2                             
  4    hdd  0.00389              osd.4       up   1.00000  1.00000
  5    hdd  0.00389              osd.5       up   1.00000  1.00000
  6    hdd  0.00389              osd.6       up   1.00000  1.00000
  7    hdd  0.00389              osd.7       up   1.00000  1.00000
-13         0.00908          host pn6                             
 20    hdd  0.00130              osd.20      up   1.00000  1.00000
 30    hdd  0.00389              osd.30      up   1.00000  1.00000
 31    hdd  0.00389              osd.31      up   1.00000  1.00000
-37         0.04021      datacenter dc3                           
-23         0.00130          host pn13                            
 25    hdd  0.00130              osd.25      up   1.00000  1.00000
-31         0.00389          host pn19                            
 29    hdd  0.00389              osd.29      up   1.00000  1.00000
 -7         0.01556          host pn3                             
  8    hdd  0.00389              osd.8       up   1.00000  1.00000
  9    hdd  0.00389              osd.9       up   1.00000  1.00000
 10    hdd  0.00389              osd.10      up   1.00000  1.00000
 11    hdd  0.00389              osd.11      up   1.00000  1.00000
 -9         0.01556          host pn4                             
 12    hdd  0.00389              osd.12      up   1.00000  1.00000
 13    hdd  0.00389              osd.13      up   1.00000  1.00000
 14    hdd  0.00389              osd.14      up   1.00000  1.00000
 15    hdd  0.00389              osd.15      up   1.00000  1.00000
-15         0.00389          host pn9                             
 21    hdd  0.00389              osd.21      up   1.00000  1.00000

I was able to create a pool of Erasure Codes k=2,m=1. I could not achieve fault tolerance when simulating an accident at the dc3 datacenter.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!