Linux Bond slow

richieman

Member
Apr 16, 2021
13
0
6
54
Hi!
I'm looking for a way to speed up migration between 2 servers. I bought 2 Quad Port GbE cards and wanted to bond the 4 ports. The setup in Proxmox was easy enough and I could get the connection established, but the resulting speed just about 1.2 Gb/s while I was expecting at least 3.5 Gb/s.

The way I setup is as follows:

Linux Bond - balance-rr, I enter the 4 slaves: enp1s0f0 enp1s0f1 enp1s0f2 enp1s0f3. MTU 9000.

Then I add the bond to the bridge ports: eno1 bond0

I do the same on the other server. I connect directly with UTP cables (no switch).
I remove the cable that was eno1.

It works, but barely faster than a single NIC 1.2 Gb/s.

I have also done bonding using VyOS inside virtual machines where I pass-through the same NIC's. Then I get around 2.2 Gb/s. Still slower than expected but twice the speed I get on the host machine. So the hardware should be capable.

Any ideas where to look for a solution?
 
Hi,

which migration do you want to speed-up? VM live-migration, storage migration, ..?

If it's VM live migration, and you are connected on a trusted and private network then you could disable encryption, as that could be your new bottleneck.

Edit /etc/pve/datacenter.cfg and add/adapt the following property:

Code:
migration: type=insecure

If there was already a migration-network configured separate those with a comma:
Code:
migration: type=insecure,network=CIDR

Remember, that means the whole memory of the guest is sent over in plain, so anybody with access to the network can read out all secrets (encryption key/state, passwords, ...) from those VMs - so only do that if it really is a secure and trusted network.
 
It is mostly for offline migrations, and also for backups to a NAS. I don't think the encryption is the issue here because with an offline migration it reached the same throughput as I got with iperf3.
 
Hello! I did not test those ones. Actually they will not be suitable for increasing bandwidth of a single stream. According to this page: https://wiki.linuxfoundation.org/networking/bonding

balance-rr
  • This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the striping often results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments.
 
yes, for 802.3ad it'll not work as it's only use 1 tcp stream currently.

Technically, it could be possible to implement multiple streams for migration

with qmp commands:

"
multifd on
multifd-channels=XX
"
 
Have you tried to enable jumbo frames at the switch and at the node network interface? I did so for migration and ceph networks:
Code:
auto enp130s0f1
iface enp130s0f1 inet static
        address 192.168.122.1/24
        mtu 9000
#Migration network

auto enp130s0f0
iface enp130s0f0 inet static
        address 192.168.121.1/24
        mtu 9000
#Ceph network
 
also, note that bonding interface, loadbalance traffic for outgoing tranfert only.

for incoming traffic, this is the switch wich need to loadbalance his outgoing traffic (so traffic incoming to target host),
and AFAIK, I don't known any switch which is able to do round-robin balancing. (only lacp).
 
Have you checked whether your storage subsystem is the problem?

If you install iperf3 on both nodes, you can run that to get the raw throughput information out of your network connection and find out whether that is the issue - try with both single and parallel streams to verify whether the bond mode is doing what you expect as well.

I've got a dual (active-passive) 10Gb connection - I get about 650MB/s for the memory segment of a live migration, but only about 150MB/s for the storage based part.
 
I've got a dual (active-passive) 10Gb connection - I get about 650MB/s for the memory segment of a live migration
Why so slow? I have much more on a single 10Gb link.
Code:
2021-04-19 02:12:54 use dedicated network address for sending migration traffic (192.168.122.1)
2021-04-19 02:12:54 starting migration of VM 110 to node 'asr1' (192.168.122.1)
2021-04-19 02:12:54 starting VM 110 on remote node 'asr1'
2021-04-19 02:12:57 start remote tunnel
2021-04-19 02:12:57 ssh tunnel ver 1
2021-04-19 02:12:57 starting online/live migration on tcp:192.168.122.1:60000
2021-04-19 02:12:57 set migration_caps
2021-04-19 02:12:57 migration speed limit: 8589934592 B/s
2021-04-19 02:12:57 migration downtime limit: 100 ms
2021-04-19 02:12:57 migration cachesize: 134217728 B
2021-04-19 02:12:57 set migration parameters
2021-04-19 02:12:57 spice client_migrate_info
2021-04-19 02:12:57 start migrate command to tcp:192.168.122.1:60000
2021-04-19 02:12:58 migration speed: 1024.00 MB/s - downtime 63 ms
2021-04-19 02:12:58 migration status: completed
2021-04-19 02:12:59 Waiting for spice server migration
2021-04-19 02:13:01 migration finished successfully (duration 00:00:07)
 
Measure the network performance first, so as not to guess fortune-telling. Install iperf3 on nodes
Code:
apt install iperf3
On the node which act as a testing server type
Code:
iperf3 -s
On the node which will act as a client type
Code:
iperf3 -c 192.168.122.3
Where 192.168.122.3 is the server node address at the migration network.
Output will like this:
Code:
Connecting to host 192.168.122.3, port 5201
[  5] local 192.168.122.2 port 35636 connected to 192.168.122.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.15 GBytes  9.92 Gbits/sec   13   1.65 MBytes       
[  5]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    7   1.65 MBytes       
[  5]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec    6   1.65 MBytes       
[  5]   3.00-4.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.65 MBytes       
[  5]   4.00-5.00   sec  1.15 GBytes  9.90 Gbits/sec    6   1.65 MBytes       
[  5]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.65 MBytes
 
do you have add "migration: type=insecure" ? the ssh tunnel can't use more than 1core, so it can limit the speed.
Not yet, I keep forgetting when I'm somewhere with access... OK, just did that and the only VM I still have replicated migrated at 700MB/s instead of 640MB/s. :) Given that it only has 2GB of RAM allocated, it still only takes a couple of seconds - and the outage is still sub 100ms. I've done it while my son is playing league of legends (it's our firewall), and he didn't even notice the blip in the game.

@OP - give the iperf test a try, sidereus gave more complete instructions than I did, but missed out the parallel option.

Use '-P <n>' where 'n' is a multiple of the number of links you have bonded - e.g. for 2 links try 2 (or 4, or 6, etc). That will at least tell you whether you have any form of load balancing in use.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!