Ceph OSD slow ops on HDD-backed pool after enabling balancer — normal or misconfig?

atlas32

New Member
Jun 15, 2026
2
0
1
Hi,

3-node hyperconverged cluster, PVE 8.x, Ceph Reef. Each node has:
- 2x NVMe (DB/WAL devices)
- 6x 4TB SAS HDD (OSDs, bluestore, db on NVMe partition)
- separate 10GbE cluster network

I enabled the ceph balancer in upmap mode a few days ago because PG distribution
was uneven (some OSDs at 65% used, some at 38%):

ceph balancer mode upmap
ceph balancer on

Since then I see periodic "slow ops" warnings during balancer activity, mostly on
the HDD OSDs:

HEALTH_WARN 1 slow ops, oldest one blocked for 34 sec, osd.7 has slow ops

Cluster is otherwise healthy, no scrub errors, network looks clean (no
retransmits, MTU 9000 end-to-end and verified).

Is this expected on HDD-backed OSDs during rebalancing, or am I missing a tuning
knob? VMs on the pool are not screaming yet but I'd rather fix it before they do.
 
I had the same when I first turned on the balancer, and occasionally when replacing a failed OSD. It never caused VMs problems other than sometimes a read would take a long time to return data. Normally the balancer makes just one or a few small changes to the upmap, and then checks again a while later (every hour maybe, I'm not sure). So it will sometimes look like it is done, then it will move more data around again. Eventually it will settle down.

You can make the cluster be less aggressive about moving data around during backfill and rebuild, so there's more I/O available for regular workloads (VMs and such). Check out options like osd_recovery_max_active_hdd, and osd_recovery_sleep_hdd, that let you tune recovery and backfill operations. https://docs.ceph.com/en/latest/rad...nfig-ref/#confval-osd_recovery_max_active_hdd
 
Quick follow-up — solved (or rather, tamed) it. Posting the full picture for the
archives.

Root cause was a mix of three things:

1) Default balancer pace is fine for all-flash, too aggressive for spinning rust.
2) osd_max_backfills and osd_recovery_max_active defaults assume your OSDs can
keep up. HDDs can't.
3) I had osd_op_queue still on "wpq" — switching to "mclock" with an HDD profile
made a big difference.

What I changed:

# throttle backfill/recovery on HDD OSDs
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_op_priority 1

# use mclock scheduler with a profile tuned for client priority
ceph config set osd osd_op_queue mclock_scheduler
ceph config set osd osd_mclock_profile balanced

# slow down balancer
ceph config set mgr mgr/balancer/sleep_interval 120

After ~24h of letting it work in the background, slow ops stopped appearing and
the standard deviation across OSDs dropped from ~12% to ~3%.

One more thing worth saying: on HDD pools, "fast rebalance" is the wrong goal.
You want "invisible rebalance" — slow, but never impacting client IO. The
defaults are tuned for SSD/NVMe and need to be relaxed on HDD-backed setups.

Reference for anyone going deeper:
- Ceph docs: mClock scheduler
- "ceph daemon osd.X dump_historic_slow_ops" is your friend for identifying
what's actually slow.