Ceph Expansion: From Single Host w/OSD-Level Replication to Multiple Hosts w/Host-Level Replication

jimbothigpen

New Member
Feb 5, 2024
1
0
1
Currently, I have a two-node PVE cluster, and one of those two nodes (srv00) has 5 HDDs devoted to a Ceph RBD and CephFS. The second node (srv01) now has 5 identical disks that I'd like to add to the cluster. By some time next week (barring any shipping delays), I'll have a third node to add to the PVE cluster, with another 5 disks just like the first two.

The current ceph.conf:

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.200.254/24
     fsid = b2ad983c-5b4d-443d-93f8-a4be22300341
     mon_allow_pool_delete = true
     mon_host = 192.168.3.254 192.168.3.253
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 2
     public_network = 192.168.3.254/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.srv00]
     host = srv00
     mds_standby_for_name = pve

[mds.srv01]
     host = srv01
     mds_standby_for_name = pve

[mon.srv00]
     public_addr = 192.168.3.254

[mon.srv01]
     public_addr = 192.168.3.253

And the current crush map:

Code:
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host srv00 {
    id -3        # do not change unnecessarily
    id -4 class hdd        # do not change unnecessarily
    # weight 36.38280
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 10.91408
    item osd.1 weight 10.91408
    item osd.2 weight 5.45798
    item osd.3 weight 5.45798
    item osd.4 weight 3.63869
}
root default {
    id -1        # do not change unnecessarily
    id -2 class hdd        # do not change unnecessarily
    # weight 36.38286
    alg straw2
    hash 0    # rjenkins1
    item srv00 weight 36.38286
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type osd
    step emit
}

# end crush map

What I'd like to do is the following, in two stages:

Right now, I'd like to create OSDs on the disks in srv01 and add them to the pools, while switching from osd-level to host-level replication, changing to a default 2/minimum 1 replication settings. I am aware that, long-term, this is a "Very Bad Idea" -- but it's temporary, the data that lives in those pools is backed up (though it would be a major PITA to recover from a catastrophic failure). To achieve this, how should my crush map and/or config file be changed, assuming I want a resiliency target allowing one of the two hosts to be down at any time (maintenance, or whatever) AND one of the OSDs on the surviving host to be offline? If I'm understanding ceph replication correctly, this should permit me the same amount of available storage I have currently, but with somewhat better resilience.

When the remaining parts of the third host arrive, srv02 will be added to the PVE cluster, and 5 identical disks will then be available to add to as OSDs the pools. At that point, I would change my minimum replicas to 2, leaving the default at 2 -- which (again, if I'm understanding ceph replication correctly) should double my available storage, leaving my resiliency targets at one host down AND one OSD offline on each surviving host. What crush map and/or config changes would need to be made at this stage?

And finally, can changes to the crush map and/or configuration files be made while the pools are in use? Or is there expected down time?
 
As you already noticed, you should avoid min_size = 1. Stating from [1]
Do not set a min_size of 1. A replicated pool with min_size of 1 allows I/O on an object when it has only 1 replica, which could lead to data loss, incomplete PGs or unfound objects.

This would also happen if one node is merely shut off!

A much more resilient setup would be to deploy Ceph also on the 3rd PVE node and distribute the disks and OSDs across all nodes.

[1] https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_create_and_edit_pools
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!