Very Slow iPerf performance from Proxmox VM to VMs on different host

carpenike

New Member
Jul 9, 2019
4
0
1
38
I started this thread on Reddit (can't post link) as well, hopefully someone has some thoughts!

Strange issue with cross-host VM-to-VM communication... First my setup:
  • 3x Proxmox hosts (5.4-10)
  • 10GB networking
  • Unifi US-16-XG switch​
  • Qlogic BR-1860 NICs. Single port from each host connected to switch, no LACP. Configured as NICs (no CNA)​
  • Jumbo frames enabled on CEPH ports, not enabled on VLAN network​
  • Networking configured with OpenvSwitch​
  • CEPH storage within the Proxmox cluster -- very slow, suspecting it's due to this slow communication between hosts
  • Only network connected to the hosts is the 10GB port.
  • VMs are using all virtio.

When using iPerf from host-to-host I get good speeds:
Screen Shot 2019-07-08 at 11.50.01 PM.png

When using iPerf from VM to its local host speeds are good too:
Screen Shot 2019-07-08 at 11.54.32 PM.png

However, going from vm to a different host in the cluster results are quite bad:
Screen Shot 2019-07-08 at 11.57.26 PM.png

VM to VM iPerf on the same host is good:
Screen Shot 2019-07-09 at 12.03.06 AM.png

VM to VM iPerf when they're on different hosts is not good:
Screen Shot 2019-07-09 at 12.05.10 AM.png

Connectivity from outside the network into the VM is good though (1GB)
Screen Shot 2019-07-08 at 11.58.53 PM.png

Here's one of the host's /etc/network/interfaces. They're all the same with the exception being the network interface names:

Code:
# Loopback interface
auto lo
iface lo inet loopback
# Bridge for our bond and vlan interfaces (our VMs will also attach to this bridge
auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports enp129s0f0 vlan20 vlan55
mtu 9000
allow-vmbr0 vlan20
iface vlan20 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=20
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.20.0.10
netmask 255.255.0.0
gateway 10.20.0.1
mtu 1500
# Physical interface for traffic coming into the system. Retag untagged
# traffic into vlan 1, but pass through other tags.
auto enp129s0f0
allow-vmbr0 enp129s0f0
iface enp129s0f0 inet manual
ovs_bridge vmbr0
ovs_type OVSPort
ovs_options tag=1 vlan_mode=native-untagged
# Alternatively if you want to also restrict what vlans are allowed through
# you could use:
# ovs_options tag=1 vlan_mode=native-untagged trunks=10,20,30,40
mtu 9000
# Ceph cluster communication vlan (jumbo frames)
allow-vmbr0 vlan55
iface vlan55 inet static
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=55
ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
address 10.55.0.10
netmask 255.255.0.0
mtu 9000
 
It certainly appears to be related to the 10GB NIC.
I replaced the 10GB NIC with one of the onboard NICs in the hosts, connected to the same switch through the copper port.
Is there anything in particular that could be configured on the NIC it self?
 
The best thing is that you use two switches and without OpenvSwitch. A switch for the nodes cluster and the other switch for Ceph cluster, the two 10 GB / s NICs must be physically separated.
Please check this guide, then you must have the full network speed.

best regards,
roman
 
Thanks!

In this case though, I've got 2 VMs running in the cluster and abysmal performance on the 10 GB NICs when the guest leaves the local vSwitch destined for another host. When switching the 10GB to a 1GB onboard NIC with everything else the same, speeds go up to 1GB standard consistently.
 
I know this is an old thread and probably dead but Carpenike did you resolve your issue.

I'm running into a very similar issue and would greatly benefit if you found any resolution to your problem.

I have 2 VLANs for node cluster and one for Ceph, all on the same switch... the same Unifi US-16-XG. With a 3 node cluster, I am using 6 interface on the switch. Is this switch the issue?

Again sorry to revive a dead post but hoping a resolution was found.

Thanks!

In this case though, I've got 2 VMs running in the cluster and abysmal performance on the 10 GB NICs when the guest leaves the local vSwitch destined for another host. When switching the 10GB to a 1GB onboard NIC with everything else the same, speeds go up to 1GB standard consistently.
 
I know this is an old thread and probably dead but Carpenike did you resolve your issue.

I'm running into a very similar issue and would greatly benefit if you found any resolution to your problem.

I have 2 VLANs for node cluster and one for Ceph, all on the same switch... the same Unifi US-16-XG. With a 3 node cluster, I am using 6 interface on the switch. Is this switch the issue?

Again sorry to revive a dead post but hoping a resolution was found.

Hi there!

Sorry I won’t be much help... I moved to bare metal and now running everything in Kubernetes. Fixed my 10gb perf problems.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!