Ceph adding OSDs best practice

matsnlc · Sep 7, 2023

Hey,

Doing some maintenance on our 5 node ceph cluster running on 7.4-3 (Linux 5.15.104-1-pve #1 SMP PVE 5.15.104-2 (2023-04-12T11:23Z)) with ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable) where we have had some fun challenges.

Our NVMe cluster currently has 28 OSDs / node where we split each NVMe in 4. Cluster and Public network 25Gb

We have now added 2 disks to every node which will mean 8 new OSDs / node. Given the challanges we've had with the cluster going a bit crazy just by outing and destroying one single osd I need to be a bit careful

In the past we've used "ceph-volume lvm batch --osds-per-device 4 /dev/" for a new disk, works fine but the new OSDs get online and in instantly even if I set global flags like
"noin". Not sure how well this will end if I do this on all 5 nodes.

Tried to mess around with some scenarios in a lab cluster but I couldn't really find a way to add all new OSDs on all nodes as UP but out. Tried things like
ceph-volume lvm batch --osds-per-device 4 --prepare /dev/ and then issued ceph-volume lvm activate --all on each node. The only way I could get them to not start was
with the flag noup but I'm not sure if that's the best approach, all was down and about half got in with this flag....

Any suggestions how to create the least amount of pain and not having to rebalance, rebalance and rebalance which I guess will be the case if I do this with one disk and node at a time. The average load during a more calm period is about 900 MiB/s reads and 300 MiB /s writes

--Mats

matsnlc · Sep 11, 2023

using this option and setting it to 0 seems to cause the least amount of data movement (initially zero, I don't even have to set any global flags).
https://docs.ceph.com/en/quincy/rad...itial_weight#confval-osd_crush_initial_weight

I then re-weight the osd with crush reweight in small amounts. I'm still not sure if I should start to re-weight all OSDs on all nodes or node by node. Theoretically I guess doing them all but I'm really not sure. Anyone with experience using this setting and approach, does it make sense in any way?

--Mats

Search

Search

Ceph adding OSDs best practice

matsnlc

New Member

matsnlc

New Member