iperf3 slower one direction

Springtime

Member
Nov 18, 2024
34
5
8
Hello,
I have 3-node PVE, running on two DL380 G10 (Nodes 1 and 2) and one DL380 G9 (Node 3).
In cases where I do iperf server on node 1 and 2, and client node 3, I get full 10GBit speed, 1.10 GBytes.
If I use Node 3 as server and Node 1 or 2 as client, only getting around 700MB/s, or around 70%.
The connection to the switch-stack is LACP, so each Node has one cable to each switch, which are aggregated, and in Proxmox are also same settings with bond0 and Linux bridge.
Basically all 3 nodes are configured the same way.
Also, when checking with Netstat -i, I am seeing some RX-ERR on the third node.
I checked the obvious things like another cable, or only running one cable and/or port on the NIC.
I would say it is the NIC, if I didn't get full 10G one direction and not the other.
Do you have any recommendations at what I could try?
Thanks
 
Please post the iperf (iperf3) commands you are running on the server and clients.

Please share your /etc/network/interfaces file.

What model of NICs are you using on each host?

What model are the switches you are connecting your PVE hosts to?
 
iperf3 -B 10.xxx.xxx.61 -s (IP is the local NIC, there is only one network)
iperf3 -B 10.xxx.xxx.63 -c 10.xxx.xxx.61 -R (if not using -R, I get full speed, if I use -R, I get 70%, and the result is the same if I reverse the server/client)

/etc/network/interfaces
Node1:
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto ens2f0np0
iface ens2f0np0 inet manual

auto ens2f1np1
iface ens2f1np1 inet manual

iface eno5np0 inet manual

iface eno6np1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves ens2f0np0 ens2f1np1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
#LACP-Bridge-to-CSW

auto VLANBridge
iface VLANBridge inet static
        address 10.xxx.xxx.61/24
        gateway 10.xxx.xxx.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#All VLAN Bridge

Node3:
Code:
auto lo
iface lo inet loopback

auto eno49
iface eno49 inet manual

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto eno50
iface eno50 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno49 eno50
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
#LACP-Bridge-to-CSW

auto VLANBridge
iface VLANBridge inet static
        address 10.xxx.xxx.63/24
        gateway 10.xxx.xxx.254
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
#All VLAN Bridge

Node1, HP DL380 Gen10, HPE Eth 10Gb 2p 535T.
Node3, HP DL380 Gen9, HP FlexFabric 10Gb 2-port 533FLR-T.

Switch: HPE1950 / JH295A, dual stack.
 
Thanks for sharing this information.

Please run the following variations on the iperf3 command:

Code:
# Node1
iperf3 -B 10.xxx.xxx.61 -s

# Node2 with extra option "-P 8"
iperf3 -B 10.xxx.xxx.63 -c 10.xxx.xxx.61 -P 8
iperf3 -B 10.xxx.xxx.63 -c 10.xxx.xxx.61 -P 8 -R

# Node2 with extra options "-P 8 -w 64k"
iperf3 -B 10.xxx.xxx.63 -c 10.xxx.xxx.61 -P 8 -w 64k
iperf3 -B 10.xxx.xxx.63 -c 10.xxx.xxx.61 -P 8 -w 64k -R

If these give the same results as before, then I would install bmon on both hosts.

While the tests are running, open bmon in a new window with: bmon -b. You might want to add -t 600 to the iperf3 command to keep it running for longer so you can inspect what is going on with bmon.

Current Working Theories

The CPU at one end is not able to keep up using iperf3. using -P and -w can help with that. This is less likely but easy enough to test.

The other theory is that you have an issue with a particular link. You are using layer2+3 I assume you are also using the same hash policy on the switches. This means the traffic will always be over the same links in a particular direction. The link chosen is determined by the sender, which means it could use different links. When reversing the traffic, it may be using a link that has an issue.

With bmon, you will be able to see which link it is using of the two LAG members and see if it is related to a specific link. If you identify a link that might be the issue, disable or unplug that link and run the test again. The traffic will be forced over the remaining link. Again, watch bmon.

Note: I am not aware of any issues with the NICs you are using.

Consideration

After you figure out the issue, I recommend you change from layer2+3 to layer3+4 on both your PVE hosts and your switches. With layer2+3 your traffic will always go down the same link of the LAG because it is hashed based on MAC and IP addresses, and traffic between your nodes will always have the same MAC and IP addresses.

If you use layer3+4 on your hosts and switches, it will consider the port numbers. Because most TCP traffic has a random source port, this will randomize your traffic over the two LAG members giving you better load balancing.
 
Hm, I believe there is a difference. Running:
iperf3 -B 10.x.x.63 -c 10.x.x.61 -P 8 -R
and
iperf3 -B 10.x.x.63 -c 10.x.x.61 -P 8 -w 64k -R
on Node3 actually results in full 10G speed, if I am interpreting correctly.
Code:
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   8.00-9.00   sec  75.7 MBytes   635 Mbits/sec               
[  7]   8.00-9.00   sec  76.3 MBytes   640 Mbits/sec               
[  9]   8.00-9.00   sec   225 MBytes  1.89 Gbits/sec               
[ 11]   8.00-9.00   sec   222 MBytes  1.86 Gbits/sec               
[ 13]   8.00-9.00   sec  75.9 MBytes   637 Mbits/sec               
[ 15]   8.00-9.00   sec   118 MBytes   988 Mbits/sec               
[ 17]   8.00-9.00   sec   220 MBytes  1.84 Gbits/sec               
[ 19]   8.00-9.00   sec   109 MBytes   912 Mbits/sec               
[SUM]   8.00-9.00   sec  1.09 GBytes  9.41 Gbits/sec               
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   9.00-10.00  sec  76.1 MBytes   638 Mbits/sec               
[  7]   9.00-10.00  sec  76.5 MBytes   642 Mbits/sec               
[  9]   9.00-10.00  sec   225 MBytes  1.89 Gbits/sec               
[ 11]   9.00-10.00  sec   221 MBytes  1.85 Gbits/sec               
[ 13]   9.00-10.00  sec  75.7 MBytes   635 Mbits/sec               
[ 15]   9.00-10.00  sec   119 MBytes   997 Mbits/sec               
[ 17]   9.00-10.00  sec   219 MBytes  1.84 Gbits/sec               
[ 19]   9.00-10.00  sec   108 MBytes   908 Mbits/sec               
[SUM]   9.00-10.00  sec  1.09 GBytes  9.40 Gbits/sec               
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   804 MBytes   675 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   804 MBytes   674 Mbits/sec                  receiver
[  7]   0.00-10.00  sec   795 MBytes   667 Mbits/sec    0             sender
[  7]   0.00-10.00  sec   795 MBytes   667 Mbits/sec                  receiver
[  9]   0.00-10.00  sec  2.16 GBytes  1.86 Gbits/sec    0             sender
[  9]   0.00-10.00  sec  2.16 GBytes  1.86 Gbits/sec                  receiver
[ 11]   0.00-10.00  sec  2.12 GBytes  1.82 Gbits/sec    0             sender
[ 11]   0.00-10.00  sec  2.12 GBytes  1.82 Gbits/sec                  receiver
[ 13]   0.00-10.00  sec   790 MBytes   663 Mbits/sec    0             sender
[ 13]   0.00-10.00  sec   790 MBytes   663 Mbits/sec                  receiver
[ 15]   0.00-10.00  sec  1.18 GBytes  1.02 Gbits/sec    0             sender
[ 15]   0.00-10.00  sec  1.18 GBytes  1.02 Gbits/sec                  receiver
[ 17]   0.00-10.00  sec  2.07 GBytes  1.78 Gbits/sec    0             sender
[ 17]   0.00-10.00  sec  2.07 GBytes  1.78 Gbits/sec                  receiver
[ 19]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec    0             sender
[ 19]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  10.9 GBytes  9.39 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec  10.9 GBytes  9.39 Gbits/sec                  receiver
I would interpret that as network throughput being okay (except I am wondering why doesn't it reach 10G on single thread), however still doesn't really explain why I would have constant RX-ERR.
Code:
root@s02p00a3103:~# netstat -i | column -t
Kernel      Interface  table                                                              
Iface       MTU        RX-OK      RX-ERR  RX-DRP  RX-OVR  TX-OK      TX-ERR  TX-DRP  TX-OVR  Flg
VLANBridge  1500       16064722   0       0       0       12233259   0       0       0       BMRU
bond0       1500       115542201  52077   0       52077   133630932  0       0       0       BMmRU
eno49       1500       49818473   25687   0       25687   73476997   0       0       0       BMsRU
eno50       1500       65723728   26390   0       26390   60153935   0       0       0       BMsRU
lo          65536      822145     0       0       0       822145     0       0       0       LRU


All this troubleshooting is actually to make the network as stable and as good as possible for Ceph.
 
Last edited:
I would interpret that as network throughput being okay (except I am wondering why doesn't it reach 10G on single thread),
Yes, it looks good. You would need to dig deeper. Might be hardware related.

still doesn't really explain why I would have constant RX-ERR
It does seem high. Might want to look into that.
 
Would there be a need to research a driver update for Proxmox for 533FLR-T NIC? Firmware is up to date. The other option is just to save myself possibly hours of troubleshooting and order the same NIC as other servers have and be done with it. 99€...
 
  • Like
Reactions: weehooey-bh
Would there be a need to research a driver update for Proxmox for 533FLR-T NIC? Firmware is up to date. The other option is just to save myself possibly hours of troubleshooting and order the same NIC as other servers have and be done with it. 99€...
I would try updating the drivers first.
 
Sorry, but I am totally failing to do this correctly, it seems.
Only drivers provided through HPE are for Windows, VMware, RedHat or SUSE.
I tried those from RedHat, converting with alien to .deb, and attempting to install with dpkg -i.
 
Sorry, but I am totally failing to do this correctly, it seems.
Only drivers provided through HPE are for Windows, VMware, RedHat or SUSE.
I tried those from RedHat, converting with alien to .deb, and attempting to install with dpkg -i.

Ah. Often see that with Dell and HP's older hardware. They released the hardware back before PVE was cool. :-)

Then, you need to rely on the drivers in the kernel.
 
Well, anyway, I tested Ceph today and moved the workloads onto it. Benchmark is following (fio):
randwrite 4k 19,5k iops
(seq) write 4M 1070MiB/s
randread 4k 41,9k IOPS
seqread 4M 1900MiB/s
Since the numbers are OK and VMs seem to work fine for now (the load is really minimal...), I am probably going to ignore the missing throughput. On the next order from that shop, might pick the card though.
 
  • Like
Reactions: weehooey-bh