Multiple OSDs for NVMe drives?

ozdjh

Well-Known Member
Oct 8, 2019
116
27
48
Hi

We'll be evaluating proxmox & ceph over the coming weeks and want to ensure we have a good starting point for benchmarking. We've been running a hyperconverged all-flash platform for about 7 years but it's not based on ceph. We're reading heaps trying to understand the best deployment model.

The tuning guide for all-flash deployments on the ceph.com site states that running a single OSD per physical NVMe device cannot take advantage of the performance available. We will be running 100% NVMe devices for storage (2TB drives) so this is important to us. That article was posted over 2 years ago so I'm wondering if it's still valid with the improvements to ceph?

The article recommends running 4 OSDs per device. If that's the best configuration I assume we'll have to set that up manually as I haven't seen any way to define an OSD through the GUI that doesn't reference the entire disk. Also, it looks like Ceph uses 2 partitions per OSD (metadata and storage). If we need to create 8 partitions to support 4 OSDs is there a defined size ratio between metadata and storage partitions?

Any feedback on getting the most out of an all NVMe platform would be appreciated.


Thanks

David
...
 
That article was posted over 2 years ago so I'm wondering if it's still valid with the improvements to ceph?

It can still help quite a bit, so if you have the time it probably would be the best to test it out and compare yourself for your specific setup.
But as you probably have seen in the article, there are quite a few other tunings possible too. I'd maybe go easy first, some of them can be change now live anyway, so you can play a bit around later too. Rounding up pg(p) numbers and the Threads per shard (osd_op_num_threads_per_shard) config option can help too.

Maybe @Alwin has some advise in mind, he played around with some more beefy ceph setups here.

The article recommends running 4 OSDs per device. If that's the best configuration I assume we'll have to set that up manually as I haven't seen any way to define an OSD through the GUI that doesn't reference the entire disk. Also, it looks like Ceph uses 2 partitions per OSD (metadata and storage). If we need to create 8 partitions to support 4 OSDs is there a defined size ratio between metadata and storage partitions?

You can also just use the ceph-volume lvm batch --osds-per-device <numberofosd> /dev/sdX command after you've done the ceph installation and configuration over the Proxmox VE interface, then all should be well integrated and it's not much extra effort to do manually.
 
Thanks Thomas, I hadn't seen the '--osds-per-device' option to ceph-volume. That simplifies things a lot. We'll start running this up on Friday. Once we're happy with the configuration we'll contribute back to the ceph benchmark thread to share our results.


David
...
 
I'm about to go down the road of deploying a 3 node Proxmox VE HA system using ceph. I have 1 x 4TB Nvme drives in each node that were going to be used for the ceph pool. If I configure the 4 x OSD's i.e.
Code:
ceph-volume lvm batch --osds-per-device 4 /dev/nvmeX
does that mean my pools total capacity across all 3 PVE Nodes would be reduced to 1TB in size?
 
does that mean my pools total capacity across all 3 PVE Nodes would be reduced to 1TB in size?

I can't see why that'd impact on the pool size. You still have the raw space in the OSDs, even if they share the same physical device. You'll burn more RAM per node by doing this as each OSD uses 4GB of RAM by default. We didn't end up doing this. We just run 1 OSD per device and it's been rock solid for many years.

If you have a choice it'd be better to run multiple SSDs per node rather than multiple OSDs on the one SSD. Even 2 x 2TB per node would help reduce your failure domain. If you only run 1 device per node then having an SSD fail is the same as having the entire node fail (which may or may not be important to you).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!