New all flash Proxmox Ceph Installation

Jul 1, 2022
7
0
1
Italy
Hi all,
for our lab project we want to install Hyperconverged Proxmox environment with Ceph.

We bought 3 server with these specs and in brackets i wrote how i thought about using them:
  • 2 x CPU AMD Rome 7452 32Cores/64Thread @2.35Ghz (https://www.amd.com/en/products/cpu/amd-epyc-7452)
  • 16 x 64Gbyte RDIMM (total 1TB RAM)
  • 2 x 480Gbyte M.2 SATA (zfs mirror for OS and local-storage for snippets and isos)
  • 2 x 3.2 TB nVME 2.5” high performance (RAW 6.4TB, should be used for cache pourpose)
  • 10 x 7.68TB nVME 2.5” (RAW 76.8 TB, should be used for store VM disks)
  • 2 x 10GB RJ45 (LACP - Proxmox Cluster and VMs networks vlan aware)
  • 2 x 10GB SFP+ (LACP - Ceph Cluster Network)
  • 2 x 1GB RJ45 (LACP - Ceph Public Network)
  • 1200W Redundant Power Supplies Titanium Level (96%)
I'll appreciate if you can help me to understand the best configuration i can do about storage, DB Disks and WAL Disks.

I've read a lot of threads and saw a lot of video-example explain how to configure Ceph Cluster and i have understood that best practice is to create 1 osd for each disk, but i really don't understand how to use DB and WAL Disk and if it's is possible to use "2x3.2TB nVME high performance" for caching.

I also installed a proxmox ceph cluster (still there powered-off) on my Virtualbox to trying to understand something more about db-wal disks configuration but it didn't help so much.

Someone can help me with some suggestions?
Thank you so much.
 
That's some serious gear for a lab!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Yes it is :). We have some Proxmox clusters running. My team wants to create a lab for internal projects and tests and we thought that at the same time we can get familiar with the Proxmox hyperconverged infrastructure. That's why.
Premium subscription price is a lot cheaper than others vendors license, and our clusters work wonderfully!
I just need to understand if that HW is ok and how to use that cache disks
 
I don't think you will gain anything from having nvme cache disks when you already have nvme storage disks. There exist both cache storage (tiered) for Ceph and the concept of using different disks for the different storages in an OSD (DB+WAL+STORAGE). In both of those scenarios I think you will gain nothing and only introduce more complexity and more risk. NVME for DB+WAL makes sense for mechanical drives and sometimes maybe for SATA/SAS SSDs. Other people may disagree but you can read for example Red Hat's or SUSE's white papers on this.

Your Ceph public network is actually the network the hypervisor storage stack (ceph clients) will use to access the OSDs so you should not use a 1 GBE network for that because it will demolish your performance. I would use the 2x10GBE for both ceph public and private networks or get another 10GBE NIC (or 25) for that purpose. The 1 GBE NIC could be used as a secondary PVE cluster network for corosync.

Minor comments are:
  • You could use CephFS for ISOs since it works very well.
  • You could run 3-4 OSD per NVME disk to get more performance out of your drives since Ceph OSD are mostly CPU bound due to heavy threading/buffering/locking in the current OSD code.
  • You probably will be a bit disappointed with the current Ceph NVME performance when it comes to single threaded IOPS. This will be improved when Ceph Crimson is ready for use. But this may still take a few years.
 
I don't think you will gain anything from having nvme cache disks when you already have nvme storage disks. There exist both cache storage (tiered) for Ceph and the concept of using different disks for the different storages in an OSD (DB+WAL+STORAGE). In both of those scenarios I think you will gain nothing and only introduce more complexity and more risk. NVME for DB+WAL makes sense for mechanical drives and sometimes maybe for SATA/SAS SSDs. Other people may disagree but you can read for example Red Hat's or SUSE's white papers on this.

Your Ceph public network is actually the network the hypervisor storage stack (ceph clients) will use to access the OSDs so you should not use a 1 GBE network for that because it will demolish your performance. I would use the 2x10GBE for both ceph public and private networks or get another 10GBE NIC (or 25) for that purpose. The 1 GBE NIC could be used as a secondary PVE cluster network for corosync.

Minor comments are:
  • You could use CephFS for ISOs since it works very well.
  • You could run 3-4 OSD per NVME disk to get more performance out of your drives since Ceph OSD are mostly CPU bound due to heavy threading/buffering/locking in the current OSD code.
  • You probably will be a bit disappointed with the current Ceph NVME performance when it comes to single threaded IOPS. This will be improved when Ceph Crimson is ready for use. But this may still take a few years.
Thanks for your answer.
I guess i can try to get one more NIC 2x 10GB for Ceph public net.
I understood what you told me about caching nvme to others nvme, maybe it will be not helping at all.

At this point i have some questions:
  • You could run 3-4 OSD per NVME disk to get more performance out of your drives since Ceph OSD are mostly CPU bound due to heavy threading/buffering/locking in the current OSD code.
How can i do that?
Regarding my 2x high performance nvme, what can i do with those disks?
Regarding WAL-DB Disks, how can i size that disks in this kind of environment?

Thank you
 
Drop 10Gbps rj45 and use sfp+ only (10,25Gbps variant, dac or fibre). Power consumption and latency are better.
You can use extra 2x nvme disks per node as other data pool.
 
Drop 10Gbps rj45 and use sfp+ only (10,25Gbps variant, dac or fibre). Power consumption and latency are better.
You can use extra 2x nvme disks per node as other data pool.
Sadly i cannot drop RJ45 nics. That hardware has been already buyed.
Some advices regarding DB-WAL sizing? Maybe i could you nvme high performance as DB-WAL disks.
 
Thanks for your answer.
I guess i can try to get one more NIC 2x 10GB for Ceph public net.
I understood what you told me about caching nvme to others nvme, maybe it will be not helping at all.

At this point i have some questions:

How can i do that?
Regarding my 2x high performance nvme, what can i do with those disks?
Regarding WAL-DB Disks, how can i size that disks in this kind of environment?

Thank you
You can use the (very high?) performance drives as an additional pool. You may mix them in with the other drives but it is recommended to have drives of similar performance in each pool.

In order to have more than 1 OSD per disk you partition the disks and use each partition for the whole OSD (DB+WAL+Store). Depending on the performance of the NVMe and your workload, 2-3, maybe 4 OSDs per disk could yield improvements over 1 per drive. This is however not a configuration I would recommend to new Ceph users. You should probably start with 1 OSD per disk because it is unclear if your hardware setup and workload would provide benefits. Benchmark first so you have something to compare to. Ceph is very complex.
 
You can use the (very high?) performance drives as an additional pool. You may mix them in with the other drives but it is recommended to have drives of similar performance in each pool.

In order to have more than 1 OSD per disk you partition the disks and use each partition for the whole OSD (DB+WAL+Store). Depending on the performance of the NVMe and your workload, 2-3, maybe 4 OSDs per disk could yield improvements over 1 per drive. This is however not a configuration I would recommend to new Ceph users. You should probably start with 1 OSD per disk because it is unclear if your hardware setup and workload would provide benefits. Benchmark first so you have something to compare to. Ceph is very complex.
According with this configurations (https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#Server-Tuning) seems that i could use 4 OSDs for each nVME disk. In that case i have to partition each disk. I'll try in this way
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!