Ceph: change size/min replica on existing pool | issue

vongehlens

New Member
Sep 5, 2021
18
6
3
49
Switzerland
Hello Proxmox/Ceph admins,

well, i´m not a newbie anymore with Proxmox, still like to learn and now need little help with Ceph pools.
i´ve got three 3-node Ceph clusters here, all separate and in different sites. - all nodes are on Ceph 12.2.13 and PVE 6.4-13
all have one pool and 3/2 replica size config, 128 PGs, 5tb data, 12 OSDs.
But i like to have 5/3 replica size.

if i change to 5/3 Ceph will tell me that i have 40% degraded PGs.
~# ceph health
HEALTH_WARN Degraded data redundancy: 959332/2398330 objects degraded (40.000%), 128 pgs degraded, 128 pgs undersized
~#

it does not matter what i configure, only 3/2 size will lead to a "green" health.

this is how i start implementing the change:
1639033561565.png


Could somebody give me a hint to understand what i´m doing wrong please? - (i´d done a lot of reading already but have not found a way forward)

thank you for helping me out.
Stephan
 

Attachments

  • 1639033401271.png
    1639033401271.png
    4.3 KB · Views: 6
Last edited:
It does not make sense to have more then 3 replica in a 3 node setup. Your ceph usually replicates objects on host-level that means every host gets one "replica". Means 3 servers 3 objects. Thats what the default crush rule looks like:

Code:
# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

Whats the reason why u think you need a 5/3, especially on a 3 node ceph cluster? I m not sure if that works with ceph

all separate and in different sites. - all nodes are on Ceph 12.2.13 and PVE 6.4-13

You should definetly upgrade your CEPH and PVE to the current release, I dont think Ceph 12.2.13 is still supported on PVE 6.4? (Ceph 12 is from PVE 5.4!) you also should have cabling from each site to the others. So if site 2 fails, site 3 should still see site 1 for example.
 
Last edited:
Hi Jsterr,
i like to have 5 copies to have a chance surviving the failure of up to 3 OSDs.... that was my intention.

cheers
Stephan

So 3 osds (one per node) will fail at the same time without a chance to recover, is that the thought behind this? Normally ceph would recover objects on the remaining osds available in the node. Im not sure howto modify the crush rule in this case. I guess best practice would be to increase the replica count by adding more hosts (+ osds) and tuning up the SIZE/MINSIZE. Maybe some of proxmox staff can answer this.
 
Last edited:
...hmmm. i understood the concept like this: the replication rule will store 3 replicas across all the 12 OSD across the cluster.
if my interpretion is wrong then i´m fine. Like to have some confirmation / explanation of some more experienced Ceph admin than me.


however, Ceph is quite interesting and i´ll start digging deeper :)
 
...hmmm. i understood the concept like this: the replication rule will store 3 replicas across all the 12 OSD across the cluster.
if my interpretion is wrong then i´m fine. Like to have some confirmation / explanation of some more experienced Ceph admin than me.


however, Ceph is quite interesting and i´ll start digging deeper :)

Nope in this case ceph could possibly also store 3 replicas on a single node (because the chosen osds are all in one node) if it would handle replica placement like you said. That would cause a storage outage if you loose that node that holds more then one replica.

Thats why ceph places ONE replica per host, so you can loose a complete host without downtime and 2 nodes without data loss (but downtime).
 
ok, thanks.
so 5 nodes with OSDs would then make sense to run replica of 5/3 as then each host carries one replic. - well, or i would need to deal with different CRUSH maps host/domains and stuff... not wanted at the moment.

thanks for your guidance on the original request.
Stephan
 
ok, thanks.
so 5 nodes with OSDs would then make sense to run replica of 5/3 as then each host carries one replic. - well, or i would need to deal with different CRUSH maps host/domains and stuff... not wanted at the moment.

thanks for your guidance on the original request.
Stephan

That totally depends on where you place the nodes and how many in which location. Because if one site fails you need to make sure that the majority of servers are still online.
 
@vongehlens I asked on stack exchange there was a good answer to it short: its possible doesnt mean its a common usecase. With this it chooses host first and puts 2 replicas on osds on that host. The system does that aslong as it can, means if you have size 5 it will put 2 replica on 2 nodes each and on the last node it will put only one.

Code:
step choose firstn 0 type host
step chooseleaf firstn 2 type osd

But dont forget to test that maybe in a virtual environment first.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!