Proxmox/Ceph - Disable OSD rebalancing

makwa · Mar 23, 2026

Hello,

I have several Proxmox clusters of 3 dedicated servers each. Previously, I used GlusterFS to replicate the VM disks between the three servers in the cluster so that if one server went down, the VMs would restart on the two remaining servers without data loss. This method worked very well, but GlusterFS is no longer maintained, so I need to switch to Ceph to achieve equivalent functionality.

I have therefore installed a Proxmox cluster of 3 dedicated servers with a pair of disks in RAID 1 for the system and two additional 1TB disks for Ceph, for a total of 2TB per server and 6TB in total across the cluster. The cluster is working correctly. I therefore have 6TB in total for Ceph, theoretically 2TB of which is usable.

If one of the hypervisors encounters a problem, resulting in two OSDs being unavailable, Ceph will rebalancing the PGs present on those missing OSDs to the remaining OSDs. This will generate a lot of I/O and consume a lot of disk space unnecessarily. Therefore, I should theoretically use a maximum of 1.3 TB for Ceph to have enough space to rebalancing the missing PGs (i.e., a "Safe near-full ratio" of 0.67).

Given that:
- These three servers are strictly identical in terms of hardware.
- The number of servers on the cluster will always remain the same (3). No servers will be added or removed during the cluster's lifetime.
- The number of OSDs on the cluster will always remain the same (6). No OSDs will be added or removed during the cluster's lifetime.
- If one of the three servers fails, it will generally be repaired within an hour, or at worst, within 3 days.
- If the server is truly beyond repair, a new cluster will be created with three new servers, and the VMs will be migrated to it.

How can I disable the function that rebalances PGs to the remaining OSDs if one server fails, resulting in the loss of two OSDs?

I would prefer not to completely disable the "balancer" module, as it's important for rebalancing PGs when the cluster is functioning normally, if I understand correctly. I've seen that it's possible to set a specific date and time for the balancer to run, defining a maximum rebalancing percentage per execution. If the rebalancing amount is very limited, this could give me time to either repair the down server or create a new cluster.

Thank you for your help.

spirit · Mar 23, 2026

I think what you want is to simply enable noout flag or set mon_osd_down_out_interval to 1hour instead 10min

makwa · Mar 23, 2026

The "noout" option is reserved for maintenance. Using it puts the cluster into warning mode. I want the cluster status to be "HEALTH_OK" when everything is working correctly. Furthermore, I'll never know if a disk is experiencing a problem if all OSDs are in "noout".

Won't changing the "my_osd_down_out_interval" to 1 hour risk delaying cluster alerts for a faulty OSD? Also, 1 hour seems too short given my needs. As mentioned, a server can be down for several days, although this is rare.

In my configuration, I don't need the rebalancing of missing OSDs onto the remaining ones. It's only this functionality that I want to disable, or at the very least, severely limit.

VictorSTS · Mar 23, 2026

In a 3 node cluster with size=3, min_size=2 and the default replica rule, Ceph won't rebalance anything if one node fails. To comply with the default rule "three replicas en three OSD located in three different servers" you need 3 servers, if there's only 2 your PG will stay undersized until the third server is back.

A completely different scenario happens if a single disk of a server fails: after 10 minutes, the replicas in the failed disk will be created in the remaining OSD of that host, with the risk of filling it up. Your effective "Safe near-full ratio" is around 0.40. Two OSD per host is the worst design: either use a single OSD or use at least 3 to lower the failure domain size and raise "Safe near-full ratio".

You can fully disable Ceph's self healing using ceph config set mon mon_osd_min_in_ratio 1.01 or as mentioned, raise mon_osd_down_out_interval as much as needed.

makwa · Mar 23, 2026

Thanks VictorSTS. I understand better now, I think.

If I set "mon_osd_min_in_ratio" to 1.01, will I still get an alert if an OSD becomes unavailable?

Actualy, mon_osd_min_in_ratio is 0.750000. If I lose one disk out of six, the ratio is 0.83333, which is greater than 0.75. Why would Ceph rebalancing the PGs from the faulty disk to the functional disk of the same node in this case?

VictorSTS · Mar 23, 2026

Because that's the minimum ratio of OSD's that will be kept IN no matter what, not the ratio that decides when to trigger a rebalance, which happens as soon as an OSD is marked OUT.

makwa · Mar 23, 2026

Okay. So if I understand correctly, if I set my_osd_min_in_ratio to 1.01, my OSDs will never go OUT. Is that it?
I'll get an alert from Ceph if an OSD is faulty?

VictorSTS · Mar 23, 2026

Correct.
PVE webUI and ceph status will report that an OSD is DOWN and the Ceph status will be "WARN" as soon as any OSD is DOWN. There will be no email alerts, though. You'll have to monitor Ceph somehow or at least configure MGR alerts module [1].

[1] https://docs.ceph.com/en/quincy/mgr/alerts/

makwa · Mar 23, 2026

Okay. That could be it.

From what I understand, it is simply not possible to disable the balancer only in the case of a broken OSD?

Can't I just adjust the balancer's date and time by setting a very low percentage per run? For example, with:
- target_max_misplaced_ratio set to 0.01
- mgr/balancer/begin_weekday set to 6
- mgr/balancer/end_weekday set to 6

VictorSTS · Mar 23, 2026

Seems you are mixing concepts here: the balancer MGR module doesn't do the recovery/backfill when an OSD goes IN/OUT, that is a core feature of Ceph managed by MONs and OSDs, not a MGR module. The balancer function is to spread PGs among all available OSDs and try to assign similar amount of PGs to each OSD so both the used space and the amount of PGs that each OSD is primary for is similar.

In fact, the balancer won't act if the cluster isn't healthy [1]

IMHO, you should install a test cluster using nested virtualization and practice every setting and scenario so you can get Ceph to do exactly what you need (although ceph config set mon mon_osd_min_in_ratio 1.01 is all that you need to get what you're asking for).

[1] https://docs.ceph.com/en/squid/rados/operations/balancer/#throttling

tchaikov · Mar 24, 2026

A few clarifications that might help, building on @VictorSTS's points:

Your node-failure concern is already handled by CRUSH

With `size=3` and the default `chooseleaf host` rule, each PG requires one copy on each of 3 distinct hosts. When a full node goes down and its OSDs are eventually marked out, Ceph tries to restore 3 copies — but with only 2 remaining hosts it simply can't. PGs go degraded (2/3) and stay that way until the node comes back. No cross-node data movement happens. So the rebalancing you're worried about for the node-failure case doesn't occur in this topology.

The single-disk case is the real concern

With 2 OSDs per node, if one disk fails, the surviving OSD on that same node is still up. CRUSH keeps the host's replica assignment and shifts it to the surviving OSD — after `mon_osd_down_out_interval` (default 10 min), the failed OSD goes out and the surviving OSD absorbs all its PGs. That's a real intra-node data movement that can bring the surviving OSD close to full.

A more targeted knob: ``mon_osd_down_out_subtree_limit``

There's a setting that distinguishes between the two cases automatically:

Bash:

ceph config set mon mon_osd_down_out_subtree_limit host

Before marking any OSD out, the monitor checks whether the OSD's entire containing CRUSH bucket of that type is down. Set to `host`:

- Full node failure (both OSDs down): the entire host bucket is down → the out-timer is reset rather than expiring → OSDs never go OUT → no recovery triggered
- Single-disk failure (one OSD down, other still up): the host bucket is only partially down → timer proceeds normally → OSD goes OUT after the interval → intra-node recovery runs as usual

The default is `rack`, which has no effect in a cluster without explicit rack buckets. Changing it to `host` is a runtime config change, no restart needed.

`mon_osd_min_in_ratio 1.01` (as suggested above) also prevents out-marking, but it's cluster-wide — it would also suppress the single-disk recovery you want to keep. `mon_osd_down_out_subtree_limit = host` is more precise for your use case.

See https://docs.ceph.com/en/squid/rado...ction/#confval-mon_osd_down_out_subtree_limit .

On alerts

`mon_osd_down_out_interval` and `mon_osd_min_in_ratio` only affect the OUT transition, not the DOWN detection. An OSD going DOWN is reflected in `ceph status` and the PVE WebUI immediately — the health warning appears regardless of any of these settings. For email/pager alerts you'd need to configure the MGR alerts module, but you won't "miss" a failure just because out-marking is disabled.

rf_pmx · Mar 26, 2026

I'm thinking of enabling mon_osd_min_in_ratio 1.01 on a 3 node, 2 OSD per-host cluster as I'd like to avoid automatic intra-node recoveries of a single OSD as that could possibly backfill_full the remaining OSD.

My approach would be to:

enable mon_osd_min_in_ratio 1.01
if a failure of a single OSD happens set nobackfill and norecover
destroy the failed OSD
create a new OSD from a fresh disk
clear nobackfill and norecover

This worked as expected, the recovery of the new OSD was done without overwhelming the remaining OSD.

I know that adding more OSDs to the hosts would be the preferred solution, however I wonder if the approach mentioned above causes other problems I haven't considered.

Proxmox/Ceph - Disable OSD rebalancing

makwa

New Member

spirit

Distinguished Member

makwa

New Member

VictorSTS

Distinguished Member

makwa

New Member

VictorSTS

Distinguished Member

makwa

New Member

VictorSTS

Distinguished Member

makwa

New Member

VictorSTS

Distinguished Member

tchaikov

New Member

Your node-failure concern is already handled by CRUSH

The single-disk case is the real concern

A more targeted knob: ``mon_osd_down_out_subtree_limit``

On alerts

rf_pmx

Member

We value your privacy

Proxmox/Ceph - Disable OSD rebalancing

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Your node-failure concern is already handled by CRUSH​

The single-disk case is the real concern​

A more targeted knob: `mon_osd_down_out_subtree_limit`​

On alerts​

Member

We value your privacy

Your node-failure concern is already handled by CRUSH

The single-disk case is the real concern

A more targeted knob: ``mon_osd_down_out_subtree_limit``

On alerts