Sizing for an HA cluster with Ceph

Do let me know the inter-VM performance you get, try a iperf3 between 2 VM. a. on the same host, b. across VM on different hosts.

We have a similar setup with Dual 100G for CEPH, 10G for LAN. We too run EPYC 64 Core x2 = 128 Cores and 2TB of RAM on each host. We get only 2-3 Gbps for inter-VM traffic which is sad.
 
your drawing - do i understand it correct as conclusion that on your servers will run the VM with the OS and on the Storage will run the Data Drives ?
 
Do let me know the inter-VM performance you get, try a iperf3 between 2 VM. a. on the same host, b. across VM on different hosts.

We have a similar setup with Dual 100G for CEPH, 10G for LAN. We too run EPYC 64 Core x2 = 128 Cores and 2TB of RAM on each host. We get only 2-3 Gbps for inter-VM traffic which is sad.
what do you expect ? in the mentioned configuration is the Network the Bussystem and the CPU, RAM Resources comes over the Network. in ESX you can say that 1 VM needs 1 GB Network bandwidth to work in an acceptable performance (it just one factor). i guess proxmox/KVM have simelar numbers. i do not understand why somebody still uses such config. a while ago i calculated with netapp storage against a locally Storage configuration and the performance difference is multiplied 11 (as far as i can remember).
with Ceph the redundancy is already given (data), for CPU, RAM redundancy is some other pieces of hardware needed.
my cluster is just a 4 Node, i have no LACP because if a network card fails i need to remove the server so or so. the switch of an actually copy of the VM on the failed Node takes round about 1 1/2 Minute (i am sure it can be adusted to 30 Secounds), and everything is running again (tested and worked flawless).

somebody here has a 4000 VM Cluster running on a 10gb network and he says it runs smoothly.
 
Do let me know the inter-VM performance you get, try a iperf3 between 2 VM. a. on the same host, b. across VM on different hosts.

We have a similar setup with Dual 100G for CEPH, 10G for LAN. We too run EPYC 64 Core x2 = 128 Cores and 2TB of RAM on each host. We get only 2-3 Gbps for inter-VM traffic which is sad.
btw: a 100 gb is 12500mb/sec. it means if you run 10 vms for each is 1250 mb bandwith, 20 vm 625 mb, 50 vm 250 mb/sec (which would be 2 gb) transfer - this is a not a very exactly numbers but it shows the direction.
i have seen datacenters with storage systems for 10mil dollar, and the VM copies with 70MB/Sec - the reason - overloaded.
 
The network is not your problem. The CPU overprovisioning has much more impact.
I have a Customer with 1250 VMs and 10GBit Network. It runs smoothly.
We have add many additional CPU Cores in the Cluster, but for next year we change the Network to 100 GBit.
@pille99 when a Node Failed, you need always 2 Minutes. The HA Deamon wait 60 Seconds to get sure, the failed Node comes not back. After this, an additional 60 Seconds timer runs, to get sure, all VMs are stopped or killed on the isolated Node.
 
Do let me know the inter-VM performance you get, try a iperf3 between 2 VM. a. on the same host, b. across VM on different hosts.

We have a similar setup with Dual 100G for CEPH, 10G for LAN. We too run EPYC 64 Core x2 = 128 Cores and 2TB of RAM on each host. We get only 2-3 Gbps for inter-VM traffic which is sad.
What type of hard disks are you using?
What type of network card do these VMs have?
I can saturate a 25GBit link with Virtio NICs and all NVMe.
 
Last edited:
We have a similar setup with Dual 100G for CEPH, 10G for LAN. We too run EPYC 64 Core x2 = 128 Cores and 2TB of RAM on each host. We get only 2-3 Gbps for inter-VM traffic which is sad.
While inter-vlan traffic test isn't often a useful metric, would be curious as to your test methodology. Please include exact benchmark string, vm core count, bridge type (linux/ovs,) link speed to router, etc.
 
sorry for the delayed response... the type of disks are 6.4TB WD SN640 NVME PCIE Gen 3. we have about 8 disks per node * 3 Nodes = 24 NVME in the CEPH Cluster and Mellanox CX455/456 100G Cards so it clearly is some drivers issue or something of the sort.

a new cluster is being planned with 8x HP DL385 Nodes with below config.
AMD EPYC 7002 (64 Core * 2 CPU) nodes with 1TB RAM Each (64GB * 16 DIMM), Mellanox CX456 Dual 100G Card (one to each 100G Switch), 3x 15.36TB WD SN650 PCIE Gen 4 NVME SSD. Any suggestions on getting optimum performance out of the cluster.

Shall report benchmarks.

Any suggestions on enabling RDMA / DPDK etc to improve performance.

The 100G will be dedicated for the CEPH network and 10G for the inter VM / LAN/ WAN communication

Thanks in advance
 
btw: a 100 gb is 12500mb/sec. it means if you run 10 vms for each is 1250 mb bandwith, 20 vm 625 mb, 50 vm 250 mb/sec (which would be 2 gb) transfer - this is a not a very exactly numbers but it shows the direction.
i have seen datacenters with storage systems for 10mil dollar, and the VM copies with 70MB/Sec - the reason - overloaded.
Hi, Having 200 VM does not mean that all are transfering huge data at the same time, its during benchmarking using iperf3 that we saw only 2-3 Gbps which i felt was low and could be better
 
While inter-vlan traffic test isn't often a useful metric, would be curious as to your test methodology. Please include exact benchmark string, vm core count, bridge type (linux/ovs,) link speed to router, etc.
Hi Alex, this is Inter VM traffic i am referring too and not Inter VLAN traffic. both are different. both VM in this case are in the same VLAN
 
If you want to max out your nvme with ceph, you need to create multiple osd by nvme disk
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

The current osd performance is around 30-40k 4k iops by osd.

4 osd by nvme is good). Be carefull than you need around 4GB memory by osd.

and if you need to do lot of small iops, you need a lof of cores (try to use biggest frequencies possible for lower latency).

If your workload is more blgger block (video streaming for example, less iops but bigger through), cpu is less critical.

The cpu is main bottleneck depend of number of iops (not the size of the iops), because the crush algorithm need to be used. (Both on client side, in qemu . and osd side, for osd->osd replication).


Personnally, I'm running with 2x25gb or 2x40gb my ceph clusters, with web/database traffic, I'm far to reach the full bandwith , because I'm doing small iops. Only the recovery/rebalancing can really push the network throughtput to the max.

Reaching 2x100GB is really not so easy, without fine tuning, cores pining,etc..(even more difficult with dual-socket && numa).
As you can reach internal memory bus limit with copy from nvme->cpu->nic
 
Getting the maximum performance for a VM is not that easy and does not correspond to the basic idea of Ceph.
Ceph is meant for massive scale out.
Each VM is an initiator and each OSD is a target, with vSAN the host is the target and better optimized for single workloads.
If you have a lot of VMs and a lot of OSD preferably distributed on a lot of hosts, it scales really well.
If you really only want to optimize throughput for one VM, then Ceph is not the optimal product and you have to resort to the tunings described above.
I like to use AMD single socket systems to get around the problem with NUMA and only use NVMe.
Furthermore, only 25 or 100 GBit network is used because the latencies are much better than 10 / 40 GBit.
If you have more than 3 nodes and many VMs, you can easily saturate 100 GBit.
 
If you have more than 3 nodes and many VMs, you can easily saturate 100 GBit.
Hasnt been my experience. the only time I can get anywhere NEAR saturation is during a major fault when a bunch of OSDs are out. I would be curious to see both how you're measuring link utiilization and workload for those saturation events.
 
I am only talking about Ceph Public and Ceph Cluster here.
I will leave Proxmox Cluster (Corosync) and VM traffic out of consideration for now.
So, after testing, what we found was that Ceph is not going to work for us at least as part of Phase 1.

Phase 1 will be the test phase and will have 3 AMD Epyc 9004 series processors with 96 cores each and 768GB of RAM.

For phase 2, we will be adding 4 more hosts for a total of 7. At that point, we will reevaluate Ceph
 
So, after testing, what we found was that Ceph is not going to work for us at least as part of Phase 1.

Phase 1 will be the test phase and will have 3 AMD Epyc 9004 series processors with 96 cores each and 768GB of RAM.

For phase 2, we will be adding 4 more hosts for a total of 7. At that point, we will reevaluate Ceph
hello, can i ask about the considerations ? for what reason ? in your point of view, what is the disadvantages ? with what solution you are going ? what was your main point for the solution you are going ?
dont get me wrong / i dont want yo question your decissions. as you manage bigger environment its always nice to learn something new and see some points on different views. did you check vsan ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!