iperf3 retries on VMs

barryyancey

Member
Nov 4, 2022
6
0
6
I am running a PVE cluster (3) on 6.8.12-4. I have a Dell R730XD, and 2 R630s. 2 boxes have the Intel(R) 10GbE 4P X710 rNDC, and the other has the BRCM 10G/GbE 2+2P 57800 rNDC. iperf3 runs great between my NAS and PVE hosts (and their containers), but not on the VMs runnings on those hosts. The throughput is good, but as you can see, the retries are through the roof!

Any ideas???

Performance between the NAS (10.0.1.100) and a PVE host (apollo - 10.0.1.102) is great.

Screenshot 2025-01-08 at 12.29.34 PM.png

Performance between the PVE host (apollo - 10.0.1.102) and a VM (prod-plex - 10.0.1.105) is great. That VM is on the apollo PVE host.

Screenshot 2025-01-08 at 12.26.21 PM.png

Performance between a VM (prod-plex - 10.0.1.105) and the NAS (10.0.1.100) is poor. The NAS is a Dell R720XD with an Intel(R) 10GbE 4P X710 rNDC.

Screenshot 2025-01-08 at 12.25.51 PM.png

This is obviously not a switching issue as the PVE hosts and my NAS are communicating perfectly.

Please see the usage on the PVE hosts.

Screenshot 2025-01-08 at 12.17.49 PM.pngScreenshot 2025-01-08 at 12.17.40 PM.pngScreenshot 2025-01-08 at 12.17.31 PM.png
 
I've also just started noticing this.

Iperf3 has very high retries when transiting between VMs in different VLANs, meaning traffic leaving a VM in VLAN1 out to the router and back into a VM in VLAN2 has very high retry. All VMs are debian based and using Virtio drivers.

Iperf3 between VMs in the same VLAN on the same proxmox host have 0 retries, great speed.

I am using a Mellanox X3 card with SFP+ Twinax 10G cabling.

I can't tell if this was always the case or just came up in recent kernels.


Going to upgrade to a Mellanox X4, upgrade the firmware and see what happens.
 
I found the same issue between my VMs and another host, however I'm only seeing high retries when SENDING from the VM. Receiving to the VM from another host runs perfectly fine.
 
@john2069 @barryyancey - I've observed similar behaviour with my Debian VMs - specifically when initiating traffic from the VM to another host. Interestingly, running iperf3 in reverse mode (-R) shows no issue, which suggests the bottleneck is on the transmit path from the VM.

What helped significantly was tuning some TCP buffer and congestion control settings inside the VM itself. I'm not entirely sure if this is a true fix, as I don't fully understand the root cause of the issue - but it has definitely improved performance in my case:

Bash:
sudo sysctl -w net.core.rmem_max=12582912
sudo sysctl -w net.core.wmem_max=12582912
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 12582912"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 12582912"
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
sudo sysctl -w net.ipv4.tcp_mtu_probing=1

If this works for you too, you could persist the changes in a .conf file under /etc/sysctl.d/
 
To close out my issue reported above, it ended up being my Mikrotik router which has insufficient packet buffering at the interface hardware level. I implemented flow control between the router and switch which has helped significantly with packet drops.