Impact of Changing Ceph Pool hdd-pool size from 2/2 to 3/2

rtsx

New Member
Feb 19, 2025
4
4
3

Scenario


I have a Proxmox VE 8.3.1 cluster with 12 nodes, using CEPH as distributed storage. The cluster consists of 96 OSDs, distributed across 9 servers with SSDs and 3 with HDDs. Initially, my setup had only two servers with HDDs, and now I need to add a third node with HDDs so the pool can remain consistent.


However, I’m not sure about the impact of this change, my google-fu wasn’t strong enough to make me feel confident.
(Note: I have production VMs running.)

Questions and Help Request:
  1. What is the impact of this change?
  2. What would be the recommendation for my scenario?

Pool:
Bash:
pool #|Name|Size|# of Placement group|Optimal # of PGs|Autoscaler Mode| CRUSH RULE(ID)| Used %

3 - | ssd-pool| 3/2| 1024| N/A| On| ssd-replicated-rule(1)| 20.78 TiB(36.21%)
4 - | hdd-pool| 2/2| 512| 512| On| hdd-replicated-rule(1)| 114.83 TiB(48.47%)
6 - | .mgr| 3/2| 1| N/A| On| replicated_rule(0)| 444.39 MiB(0.00%)

OSD Tree and Crushmap in attachment.

Configuration:
Bash:
[global]
    auth_client_required = cephx
    auth_cluster_required = cephx
    auth_service_required = cephx
    cluster_network = X.X.X.X/24
    fsid = 52d10d07-2f32-41e7-b8cf-7d7282af69a2
    mon_allow_pool_delete = true
    mon_host = X.X.X.X X.X.X.X X.X.X.X X.X.X.X X.X.X.X X.X.X.X
    ms_bind_ipv4 = true
    ms_bind_ipv6 = false
    osd_pool_default_min_size = 2
    osd_pool_default_size = 3
    public_network = X.X.X.X/24

[client]
    keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
    keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve114]
    public_addr = X.X.X.X

[mon.pve115]
    public_addr = X.X.X.X

[mon.pve117]
    public_addr = X.X.X.X

[mon.pve118]
    public_addr = X.X.X.X

[mon.pve119]
    public_addr = X.X.X.X

[mon.pve142]
    public_addr = X.X.X.X

Configuration Database:
1760565107417.png

Used(%) HDDs:
1760565161060.png

Used(%) SSD:
1760565185569.png

1760565211050.png

1760565230467.png

References:
https://docs.ceph.com/en/reef/rados/operations/pools/#setting-the-number-of-rados-object-replicas

Any help from the community would be greatly appreciated!

I can provide logs or additional command outputs if needed.

Thanks in advance for your support!
 

Attachments

Last edited:
Changing the HDD pool to 3/2 will make another replica of all blocks so will increase used disk space by 50%. With 114 TB that will probably take a while to complete copying the "additional" 57 TB.

> 12 nodes
Do you have another to have an odd number? Add a QDevice? It's recommended to have an odd number. Otherwise if 6 go down (dead switch, etc.) neither side has over 50% for quorum and all 12 will reboot to try to recover.
 
  • Like
Reactions: Johannes S and rtsx
1 - I did a quick test in a lab environment with a small 3-node cluster (each node having 3 OSDs).
When I changed the pool from 2/2 to 3/2, the cluster froze for a short moment right after applying the new size.
1760664381337.png

I’m pretty sure it happened because of the poor SSD performance, I sliced one SSD into small parts across 3 VMs just for testing.
1760664807642.png

After a while, the cluster recovered by itself and the VM continued to run normally.

I’m now setting up a more reliable lab (no slow OSD alerts) to compare results and will also fix the quorum setup before making any changes in production.

Do you have another to have an odd number? Add a QDevice?
2 - Yes, I do have another node
About the QDevice, can it be just a simple Linux host running the quorum service only (to make the cluster odd = 13) and nothing else?
I’d like to keep things as simple as possible

Thanks again for the great insights!
 
About the QDevice, can it be just a simple Linux host running the quorum service only (to make the cluster odd = 13) and nothing else?
This would work for corosync:
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

I'm not sure regarding Ceph too, it's quorum is independent from corosync.

The easiest route would propably to add the Linux host as another ProxmoxVE-Node but without vms or osds on it, so basically just for maintaining quorum.
 
The QDevice can be anything. Note adding it allows passwordless SSH to it from cluster nodes, so maybe not a PBS server. It doesn't have to be local either IIRC. Note the doc bits about removing it before adding or removing cluster nodes.

Proxmox recommends 3 Ceph Monitors, and Manager on the Monitor nodes. I would keep an odd number so 3 or 5.
https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster#pve_ceph_monitors
 
  • Like
Reactions: rtsx and Johannes S
Note adding it allows passwordless SSH to it from cluster nodes, so maybe not a PBS server. It doesn't have to be local either IIRC. Note the doc bits about removing it before adding or removing cluster nodes.
@aaron described his setup for his private infrastructure here, he installs PBS parallel to a single-node PVE which is NOT added to the cluster so he can run the qdevice in a Debian container:



The result is that the passwordless ssh from the cluster can only get access to the qdevice-lxc, not the PBS. I think that's a quite elegant solution.
 
  • Like
Reactions: rtsx
I’m pleased to update this topic by confirming that the resize to 3/2 was successfully completed.
When applying the configuration, the behavior was the same as in the lab scenario, the Ceph cluster entered recovery/rebalance mode and, after a long process (around 5 days, mainly due to some OSDs showing latency above 500 ms), it finally completed successfully.


The utilization of the HDD OSDs increased from 40–49% to around 60–69%.


I’m also sharing a useful calculator to help estimate the required space (always consider at least an additional 8 TB for safety).
https://www.virtualizationhowto.com/2024/09/ceph-storage-calculator-to-find-capacity-and-cost/
As a rule of thumb, never apply any changes if the OSD utilization (even on a single node) exceeds 80–85% — you’ll thank yourself later.