Ceph OSD - Expand Current OSDs or Add New OSDs

beav

New Member
Aug 8, 2024
8
1
3
Hello, existing Proxmox cluster is 3 nodes using 1TB NVME drives, recently cloned and expanded with 2TB replacements.

Partition table was cloned as from the source that includes a 750GB partition on each node serving as a single OSD for the Ceph Cluster.

Question is, do I:
1. Expand the existing 750GB partition keeping (1x) OSD per host or,
2. create a new, additional partition to be used for an additional OSD - making it (2x) OSD's per host?

Are there pros/cons for either?

For #1, will Ceph automatically recognize the additional space once I resize the partition?

Thanks ahead for any insight!
 
No need to do all that... If using the default replicated pool with 3/2 size/min_size, just add one 2TB drive to each 3 servers. Add 3 OSD and wait for rebalance. Then, remove 1TB OSD from one node (down, out, destroy), one by one, waiting for the cluster to rebalance.

If you already replaced the drives with 2TB ones (as it seems), destroy one current OSD and recreate it using the whole drive. Wait for rebalance to finish. The do the same for the other two, one by one and waiting for the rebalance to finish.
 
  • Like
Reactions: beav
No need to do all that... If using the default replicated pool with 3/2 size/min_size, just add one 2TB drive to each 3 servers. Add 3 OSD and wait for rebalance. Then, remove 1TB OSD from one node (down, out, destroy), one by one, waiting for the cluster to rebalance.

If you already replaced the drives with 2TB ones (as it seems), destroy one current OSD and recreate it using the whole drive. Wait for rebalance to finish. The do the same for the other two, one by one and waiting for the rebalance to finish.
Unfortunately, I can only use one NVME Slot in this config.

Yes, I have already replaced with the 2TB. So, I will likely go with this method as you have described, but is there any advantage of having two OSD's per host over one per host in the Proxmox cluster? For example, is there a performance boost by having additional spindles like you get with adding physical drives to a storage array?

Thank you for the reply!!!
 
if your bottleneck is you CPU instead of your drive, having more than one OSD per drive may help getting more IOPs. It must be benchmarked, as some drives simply have a single thread controller and will not benefit at all from this. It also means using more ram to run OSD processes, a harder to maintain Ceph cluster as you have to monitor more drives/partitions for free space and balancing. IMHO unless you really need those extra IOPs and have benchmarked them it isn't worth it.

In your situation, adding a second partition is the safest procedure: you always keep 3 replicas of your data. Meanwhile, removing an OSD and adding back a 2TB one implies that you are left with "just" 2 copies of your data for a short time while the replicas are recreated in the new OSD.
 
  • Like
Reactions: gurubert and beav
if your bottleneck is you CPU instead of your drive, having more than one OSD per drive may help getting more IOPs. It must be benchmarked, as some drives simply have a single thread controller and will not benefit at all from this. It also means using more ram to run OSD processes, a harder to maintain Ceph cluster as you have to monitor more drives/partitions for free space and balancing. IMHO unless you really need those extra IOPs and have benchmarked them it isn't worth it.

In your situation, adding a second partition is the safest procedure: you always keep 3 replicas of your data. Meanwhile, removing an OSD and adding back a 2TB one implies that you are left with "just" 2 copies of your data for a short time while the replicas are recreated in the new OSD.
Perfect answer, thanks for the insight!
 
Nah, it's easily possible:

I've previously followed
https://stackoverflow.com/questions/68884564/how-to-expand-ceph-osd-on-lvm-volume

Bash:
resized 680GB OSDs on 800SSD disk on pve2.  https://stackoverflow.com/questions/68884564/how-to-expand-ceph-osd-on-lvm-volume
LV expand to rest of free space on PVs, then osd offline, then
ceph-bluestore-tool bluefs-bdev-expand --path <osd path>
Default <osd path> is /var/lib/ceph/osd/ceph-12345/
ignore error message about
2022-08-28T21:00:43.313+0000 7f59c45ad700 -1 bluestore(/var/lib/ceph/osd/ceph-<ID>) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-<ID>: (21) Is a directory
2022-08-28T21:00:43.313+0000 7f59c45ad700 -1 bluestore(/var/lib/ceph/osd/ceph-<ID>) unable to read label for /var/lib/ceph/osd/ceph-<ID>: (21) Is a directory
https://github.com/rook/rook/issues/2997
then osd online.  free space will be incorrect per https://tracker.ceph.com/issues/63858, so kill -9 the OSD then restart

For an OSD with a DB device that I had to expand, I followed this:

https://k0ste.ru/how-to-expand-bluefs-db-device-of-ceph-bluestore-osd.html
https://documentation.suse.com/fr-fr/ses/7.1/html/ses-all/bp-troubleshooting-status.html

Bash:
# lvextend -L+2G /dev/ceph-dbwal/ceph-slowssd-dbwal
# ceph tell osd.5 compact
# systemctl stop ceph-osd@5.service
# systemctl status ceph-osd@5.service
# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-5
# systemctl start ceph-osd@5.service
# ceph tell osd.5 compact

(er, I assume the second `ceph tell compact` was from when I was still encountering the same "spillover of db onto slow device", and then I discovered I had expanded the wrong db)



Fortunately this time around, I had notes from prior, because this old question was the first hit I got for how to repeat the exercise!