Proxmox/Ceph - Disable OSD rebalancing

makwa

New Member
Mar 23, 2026
5
0
1
Hello,

I have several Proxmox clusters of 3 dedicated servers each. Previously, I used GlusterFS to replicate the VM disks between the three servers in the cluster so that if one server went down, the VMs would restart on the two remaining servers without data loss. This method worked very well, but GlusterFS is no longer maintained, so I need to switch to Ceph to achieve equivalent functionality.

I have therefore installed a Proxmox cluster of 3 dedicated servers with a pair of disks in RAID 1 for the system and two additional 1TB disks for Ceph, for a total of 2TB per server and 6TB in total across the cluster. The cluster is working correctly. I therefore have 6TB in total for Ceph, theoretically 2TB of which is usable.

If one of the hypervisors encounters a problem, resulting in two OSDs being unavailable, Ceph will rebalancing the PGs present on those missing OSDs to the remaining OSDs. This will generate a lot of I/O and consume a lot of disk space unnecessarily. Therefore, I should theoretically use a maximum of 1.3 TB for Ceph to have enough space to rebalancing the missing PGs (i.e., a "Safe near-full ratio" of 0.67).

Given that:
- These three servers are strictly identical in terms of hardware.
- The number of servers on the cluster will always remain the same (3). No servers will be added or removed during the cluster's lifetime.
- The number of OSDs on the cluster will always remain the same (6). No OSDs will be added or removed during the cluster's lifetime.
- If one of the three servers fails, it will generally be repaired within an hour, or at worst, within 3 days.
- If the server is truly beyond repair, a new cluster will be created with three new servers, and the VMs will be migrated to it.

How can I disable the function that rebalances PGs to the remaining OSDs if one server fails, resulting in the loss of two OSDs?

I would prefer not to completely disable the "balancer" module, as it's important for rebalancing PGs when the cluster is functioning normally, if I understand correctly. I've seen that it's possible to set a specific date and time for the balancer to run, defining a maximum rebalancing percentage per execution. If the rebalancing amount is very limited, this could give me time to either repair the down server or create a new cluster.

Thank you for your help.
 
The "noout" option is reserved for maintenance. Using it puts the cluster into warning mode. I want the cluster status to be "HEALTH_OK" when everything is working correctly. Furthermore, I'll never know if a disk is experiencing a problem if all OSDs are in "noout".

Won't changing the "my_osd_down_out_interval" to 1 hour risk delaying cluster alerts for a faulty OSD? Also, 1 hour seems too short given my needs. As mentioned, a server can be down for several days, although this is rare.

In my configuration, I don't need the rebalancing of missing OSDs onto the remaining ones. It's only this functionality that I want to disable, or at the very least, severely limit.
 
In a 3 node cluster with size=3, min_size=2 and the default replica rule, Ceph won't rebalance anything if one node fails. To comply with the default rule "three replicas en three OSD located in three different servers" you need 3 servers, if there's only 2 your PG will stay undersized until the third server is back.

A completely different scenario happens if a single disk of a server fails: after 10 minutes, the replicas in the failed disk will be created in the remaining OSD of that host, with the risk of filling it up. Your effective "Safe near-full ratio" is around 0.40. Two OSD per host is the worst design: either use a single OSD or use at least 3 to lower the failure domain size and raise "Safe near-full ratio".

You can fully disable Ceph's self healing using ceph config set mon mon_osd_min_in_ratio 1.01 or as mentioned, raise mon_osd_down_out_interval as much as needed.
 
  • Like
Reactions: UdoB and gurubert
Thanks VictorSTS. I understand better now, I think.

If I set "mon_osd_min_in_ratio" to 1.01, will I still get an alert if an OSD becomes unavailable?

Actualy, mon_osd_min_in_ratio is 0.750000. If I lose one disk out of six, the ratio is 0.83333, which is greater than 0.75. Why would Ceph rebalancing the PGs from the faulty disk to the functional disk of the same node in this case?
 
Last edited:
Okay. So if I understand correctly, if I set my_osd_min_in_ratio to 1.01, my OSDs will never go OUT. Is that it?
I'll get an alert from Ceph if an OSD is faulty?
 
Okay. That could be it.

From what I understand, it is simply not possible to disable the balancer only in the case of a broken OSD?

Can't I just adjust the balancer's date and time by setting a very low percentage per run? For example, with:
- target_max_misplaced_ratio set to 0.01
- mgr/balancer/begin_weekday set to 6
- mgr/balancer/end_weekday set to 6
 
Last edited:
Seems you are mixing concepts here: the balancer MGR module doesn't do the recovery/backfill when an OSD goes IN/OUT, that is a core feature of Ceph managed by MONs and OSDs, not a MGR module. The balancer function is to spread PGs among all available OSDs and try to assign similar amount of PGs to each OSD so both the used space and the amount of PGs that each OSD is primary for is similar.

In fact, the balancer won't act if the cluster isn't healthy [1]

IMHO, you should install a test cluster using nested virtualization and practice every setting and scenario so you can get Ceph to do exactly what you need (although ceph config set mon mon_osd_min_in_ratio 1.01 is all that you need to get what you're asking for).

[1] https://docs.ceph.com/en/squid/rados/operations/balancer/#throttling
 
A few clarifications that might help, building on @VictorSTS's points:

Your node-failure concern is already handled by CRUSH​


With `size=3` and the default `chooseleaf host` rule, each PG requires one copy on each of 3 distinct hosts. When a full node goes down and its OSDs are eventually marked out, Ceph tries to restore 3 copies — but with only 2 remaining hosts it simply can't. PGs go degraded (2/3) and stay that way until the node comes back. No cross-node data movement happens. So the rebalancing you're worried about for the node-failure case doesn't occur in this topology.

The single-disk case is the real concern​


With 2 OSDs per node, if one disk fails, the surviving OSD on that same node is still up. CRUSH keeps the host's replica assignment and shifts it to the surviving OSD — after `mon_osd_down_out_interval` (default 10 min), the failed OSD goes out and the surviving OSD absorbs all its PGs. That's a real intra-node data movement that can bring the surviving OSD close to full.

A more targeted knob: `mon_osd_down_out_subtree_limit`​


There's a setting that distinguishes between the two cases automatically:

Bash:
ceph config set mon mon_osd_down_out_subtree_limit host

Before marking any OSD out, the monitor checks whether the OSD's entire containing CRUSH bucket of that type is down. Set to `host`:

- Full node failure (both OSDs down): the entire host bucket is down → the out-timer is reset rather than expiring → OSDs never go OUT → no recovery triggered
- Single-disk failure (one OSD down, other still up): the host bucket is only partially down → timer proceeds normally → OSD goes OUT after the interval → intra-node recovery runs as usual

The default is `rack`, which has no effect in a cluster without explicit rack buckets. Changing it to `host` is a runtime config change, no restart needed.

`mon_osd_min_in_ratio 1.01` (as suggested above) also prevents out-marking, but it's cluster-wide — it would also suppress the single-disk recovery you want to keep. `mon_osd_down_out_subtree_limit = host` is more precise for your use case.

See https://docs.ceph.com/en/squid/rado...ction/#confval-mon_osd_down_out_subtree_limit .

On alerts​


`mon_osd_down_out_interval` and `mon_osd_min_in_ratio` only affect the OUT transition, not the DOWN detection. An OSD going DOWN is reflected in `ceph status` and the PVE WebUI immediately — the health warning appears regardless of any of these settings. For email/pager alerts you'd need to configure the MGR alerts module, but you won't "miss" a failure just because out-marking is disabled.
 
Last edited:
  • Like
Reactions: UdoB