Ceph design (Number of OSDs per NVMe Disk), rebalance problems

Balancer still reports:
{
"active": true,
"last_optimize_duration": "0:00:00.020016",
"last_optimize_started": "Fri Jun 16 14:21:11 2023",
"mode": "upmap",
"optimize_result": "Optimization plan created successfully",
"plans": []
}

ceph osd get-require-min-compat-client reports jewel

and ceph versions
{
"mon": {
"ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)": 3
},
"mgr": {
"ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)": 2
},
"osd": {
"ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)": 140
},
"mds": {},
"overall": {
"ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)": 145
}
}

Should I try and set ceph osd set-require-min-compat-client luminous?
 
the reason I ask is that I'm worried that it might go crazy again and start rebalance stuff killing all bandwidth again. It is Friday :)
Should I tune mclock settings in any way before doing this?
 
Should I try and set ceph osd set-require-min-compat-client luminous?
yes.

the reason I ask is that I'm worried that it might go crazy again and start rebalance stuff killing all bandwidth again. It is Friday :)
The balancer won't cause a lot of load and will move PGs (replicas thereof) slowly between the OSDs.

Changing the pg_num of the pool is what can cause a lot of load ;)
 
Tried ceph osd set-require-min-compat-client luminous but the number of backfills increased very fast together with bandwidth so I had to pause it by setting global flag no_backfill.

I ended up tuning the mclock setting as posted earlier, again this setting calmed things down and it was possible to complete, the balance is now much better, 10-12% diff compared to nearly 50% before.

I'm a bit worried though that we have to do this manual mclock tuning every so often to prevent the cluster from going down.
 
Last edited: