3 Bridges for Management, Ceph and SDN/VMs over 1 bond with 2x physical 100 GBE NICs

Apr 19, 2022
28
5
8
Hello,

We are currently looking at Promox as an alternative to our Microsoft HCI infrastructures and have the problem that we are in an HA infrastructure (2 stackable switches, so that one could fail without causing a systemfailure). Our nodes have 2x physical 100 GbE LAN and we would like to use them in a team mode (bond?) as a base for multiple virtual network connections (vmbr). So we have Ceph, Management and SDN for VMs for which we need a bridge, but want to enable this over only one network connection (bond of 2x 100 GBE NICs).

Microsoft implements this via a vswitch that is created from the two network cards on the hosts (similar to bond) and then any number of vNICs are created on the hosts (not usable for VMs, only usable for host connections) which then use the VSwitch as a connector to the physical network. Then you create a vNIC for storage S2D (Ceph), one for management (vmbr) which you then also use as a gateway for the SDN infrastructure. What is the best practice approach at Proxmox for such a HA HCI infrastructure with only two physical network cards?

Network topology: As you can see in the schematic in the attachment, we use 2x 100 GBE NICs per node and map the Mngmt LAN (host connectrion), two storage networks and the SDN over them. In the case of MS, the two pNICs are combined with a VSwitch to form a switch-independent team. Several vNICs are then generated on this VSwitch, which can then be used for the individual functions (SDN + S2D). Since the video tutorials only cover LAB environments without HA Network infrastructure, I need to evaluate in advance if Proxmox can be used in our topology.

Thank you so far. Best regards!
 

Attachments

Our nodes have 2x physical 100 GbE LAN and we would like to use them in a team mode (bond?)
If you switch supports LACP over you stackable switch, use that. You can then use the bond as a base for the bridge and create VLANs, so that you can separate everything.

So we have Ceph, Management and SDN for VMs for which we need a bridge, but want to enable this over only one network connection (bond of 2x 100 GBE NICs).
So, build VLANs. If you management able to run over 100 GBE? I haven't seen that so far. Most BMIs have 0,1 or 1 GBE. What hardware is that?

What is the best practice approach at Proxmox for such a HA HCI infrastructure with only two physical network cards?
vSwitch is a similar SDN solution that may help you. If you don't want to use that, you can also use the plain linux network stack that is capable of almost everything there is.

Is you system currently using any type of traffic control / bandwidth control?

For a good proof of concept, just use 3 old server with two nics each and setup everything. I don't know if you can simulate everything in VMs, but you can try and play around with it. PVE runs great inside of all other hypervisors (better with nested virtualization).
 
Thank you for the fast und usefull respond.

Our stackable Switches use MLAG for LACP solutions. i guess that would work fine.

The VLAN Solution sould practical if we are able to use seperate IP-Adresses. I will try it and get you an respond.
We are using Arista DCS7170-64C Switches and Mellanox Connect-X5 NICs. Our first test on 4x Intel M50CYP1UR212 Node seems to work, but we are just at the beginning.
For traffic control we use RoCE for our MS infrasturctures. RoCE would be great because it is highly supported by the Switches und NICs we use. Does RoCE work on Proxmox? How is the best practice zu use it? Is there an alternate?
 
For traffic control we use RoCE for our MS infrasturctures. RoCE would be great because it is highly supported by the Switches und NICs we use. Does RoCE work on Proxmox? How is the best practice zu use it? Is there an alternate?
From what I read yes, but I haven't used it. We dont limit the traffic in our setups.
 
Actually i configured the setup as discribed but bekome very poor storage performance on the VM. We have 2.600 MB/sec read and 1.400 MB write performance on a windows Server 2022 vm. All driver and vm setting are to Virt-IO as recomended. The Mellanox 100GBE Interfaces are supported and right installed to the nodes.
With the same setup we got 5.500 MB/Sec. read and 3.500 Mb /sec. wirte with the MS S2D Storagecluster.

We havent used RoCE and use the standard MTU of 1500. Maybe thats the problem?
 
Actually i configured the setup as discribed but bekome very poor storage performance on the VM. We have 2.600 MB/sec read and 1.400 MB write performance on a windows Server 2022 vm. All driver and vm setting are to Virt-IO as recomended. The Mellanox 100GBE Interfaces are supported and right installed to the nodes.
With the same setup we got 5.500 MB/Sec. read and 3.500 Mb /sec. wirte with the MS S2D Storagecluster.

We havent used RoCE and use the standard MTU of 1500. Maybe thats the problem?
Did you benchmark the local Read / Writes inside the VM ? I am assuming your Read/writes given are transferspeeds via the 100 Gig Mellanox.
 
Last edited:
Youre right, i did the benchmark inside the VM via Crystal Discmark. The Mellanox Intefaces Connect-X 5 [EX] were detected by the Debian OS und the transferspeed on the Host has been given out as 100000. The VM just have 10000 but that shouldn´t effect the Storage which is maged on Host Side.
I did vlan für public and private networks on the ceph storage, so each of them run on a dedicated vlan. The VLAN use the 1 bridge of the bond, which hold the 2 nics via LACP. We use Arista DCS-7170-64C switches with IPL and MLAG on the other side.
 
Have you benchmarked the link speeds from Host to Host ?
e.g. using this method ? https://fasterdata.es.net/performan...ubleshooting-tools/iperf/multi-stream-iperf3/

did you use "linux" bonds or did you use openvswitch based bonds ?

I ran into a similar issue a while back:
https://forum.proxmox.com/threads/mellanox-connectx-5-en-100g-running-at-40g.106095/#post-463271 (this was using connectX-5 in a 3 node mesh without switches tho)

We havent used RoCE and use the standard MTU of 1500. Maybe thats the problem?
with an MTU of 1500 your Frames have a 1460 Byte payload.

to acheive 2600 MB/s you are basically blasting about 1.78 million packets through your network connection.

if you have a MTU of 9000 your payload size is 8960 bytes, which makes 2600 MB/s acheiveable with 290k packets.

See
https://en.wikipedia.org/wiki/Jumbo_frame
for reference.
 
Last edited:
Hi Wolff,

we are using linux bond... is it better to use openvswitch based bonds?
The ceph is running on the same network as cluster on 100GB network cards

I have installed cpupower setted the following configuration and ran the iperf test

apt install linux-cpupower
cpupower idle-set -d 2
cpupower frequency-set -g performance

I also ran benchmark test on the pve (benchmark.png)

I also ran benchmark test on the VM and VM has Write Back Cache enabled

The Arista switch has the maxed MTU setted on 9214 and the network cards (vmbr0 and bond0) on the hosts has 9000 MTU

Any suggestions on how to increase read/write performance?
 

Attachments

  • iperf3-test.png
    iperf3-test.png
    103.3 KB · Views: 13
  • benchmark host.png
    benchmark host.png
    52 KB · Views: 11
  • benchmark host 2.png
    benchmark host 2.png
    114.1 KB · Views: 11
  • benchmach VM.png
    benchmach VM.png
    116.1 KB · Views: 13
I am in no way an expert on the subject matter.
All i know is this:

1. When doing Iperf's you can find your self CPU bottle necked unless you use mutliple instances (see https://fasterdata.es.net/performan...ubleshooting-tools/iperf/multi-stream-iperf3/).

2. For us using Openvswitch was returning a way better performance using AMD CPU's and 100G Mellanox Connect-5's, but this might be due to our mesh setup (we don't use a switch in between).

3. I have always felt that openvswitch is less CPU intensive compared to the standard "linux networking". (and easier to read, but that is subjective). But i can not defnitively answer this for you. Easiest way to tell is to make a backup copy of your network config and set it up using openvswitch. My experience is Proxmox >= 5.0 to current version. For me the performance ha salways been better with openvswitch



Regarding your response:

A) I can not see if you are in fact using a AMD CPU (clock speed would be interesting, we have 2.6 Ghz without boosts)

B) I can also not tell if you are CPU-core bottlenecked (during the iperf test) , a simple HTOP read-out during a iperf will tell you.

C) i can not tell if you did run iperf on multiple threads and aggregated the results, as you cut off the commands from the screenshot. Even if you did (into 3 threads) you'd only end up around 55-60 Gbit/s based on your screenshot.

D) you should ALWAYS split cluster communication into a seperate network, see : https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network. It is more about Corosync needing low latency, rather then corosync interefring with your ceph communication.

E) Last time i worked witH CEPH was on Proxmox 5.0 ( i think)(sometime AROUND 2015/16) If i remember correctly, speed for a VM was determined by the number of MON's and OSD's present in the network as well by how "fast" the drives are in relation to the network. Without knowing the number of MONs, the number of OSD's, the type of disks used and the network setup; i have no way of telling whether the results of your crystal disk-mark are to be expected or not.
In any case, unless you have 512 MB of RAM assigned to that VM your benchmark set is to small (you always want to use RAM x 2)
 
Last edited: