1st core is overloaded by Intel NIC IRQs

Jun 8, 2016
344
69
68
47
Johannesburg, South Africa
We have a dual core Intel Wildcat Pass server with Intel E5-2640 processors. We have Hyper Threading enabled so the system reports a total of 40 cores.

Reviewing individual core utilisation showed CPU 0 running at 100% which was causing packet loss. We've subsequently used the taskset utility to restrict KVM and Ceph processes from using core 0 and its Hyper Threaded sibling:
Code:
/etc/rc.local:
# Limit tasks to not run on core 0 and it's Hyper-Threaded sibling:
cpus='1-19,21-39';
for pid in `pidof kvm`; do
  taskset -a -cp $cpus $pid &> /dev/null;
  for vhostpid in `pidof vhost-$pid`; do
    taskset -a -cp $cpus $vhostpid &> /dev/null;
  done
done
for pid in `pidof ceph-fuse ceph-mon ceph-osd`; do
  taskset -a -cp $cpus $pid &> /dev/null;
done


Packet loss is now gone but we've effectively dedicated a tenth of our resources to handling interrupts from the network card. Resulting 'top' output now:
Code:
top - 10:45:42 up 49 days, 13:38,  1 user,  load average: 25.56, 24.59, 24.25
Tasks: 744 total,   9 running, 364 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.6 us,  6.7 sy,  0.0 ni, 54.3 id,  0.0 wa,  0.0 hi, 38.4 si,  0.0 st
%Cpu1  : 39.4 us, 24.9 sy,  0.0 ni, 31.3 id,  2.7 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu2  : 43.1 us, 23.7 sy,  0.0 ni, 32.5 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  : 51.0 us, 19.8 sy,  0.0 ni, 28.9 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu4  : 53.6 us, 19.0 sy,  0.0 ni, 26.1 id,  0.0 wa,  0.0 hi,  1.4 si,  0.0 st
%Cpu5  : 41.2 us, 24.9 sy,  0.0 ni, 33.2 id,  0.3 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu6  : 40.9 us, 25.2 sy,  0.0 ni, 31.5 id,  1.7 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu7  : 47.1 us, 23.2 sy,  0.0 ni, 28.3 id,  0.3 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu8  : 45.6 us, 22.4 sy,  0.0 ni, 27.6 id,  4.1 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu9  : 46.1 us, 25.4 sy,  0.0 ni, 27.8 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu10 : 23.9 us, 18.4 sy,  0.0 ni, 45.1 id, 11.6 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu11 : 21.2 us, 18.4 sy,  0.0 ni, 57.6 id,  2.4 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu12 : 20.1 us, 18.0 sy,  0.0 ni, 60.2 id,  1.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu13 : 20.4 us, 18.0 sy,  0.0 ni, 52.9 id,  7.3 wa,  0.0 hi,  1.4 si,  0.0 st
%Cpu14 : 20.4 us, 16.9 sy,  0.0 ni, 57.7 id,  3.9 wa,  0.0 hi,  1.1 si,  0.0 st
%Cpu15 : 22.1 us, 18.0 sy,  0.0 ni, 54.0 id,  5.5 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu16 : 20.7 us, 16.7 sy,  0.0 ni, 61.6 id,  0.4 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu17 : 22.8 us, 18.6 sy,  0.0 ni, 57.2 id,  1.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu18 : 23.3 us, 17.8 sy,  0.0 ni, 57.8 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 : 20.3 us, 19.9 sy,  0.0 ni, 54.2 id,  4.9 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu20 :  0.3 us,  9.2 sy,  0.0 ni, 87.8 id,  2.6 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 : 34.0 us, 27.2 sy,  0.0 ni, 37.4 id,  0.3 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu22 : 60.3 us, 16.8 sy,  0.0 ni, 22.6 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu23 : 49.5 us, 20.9 sy,  0.0 ni, 29.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu24 : 44.1 us, 23.7 sy,  0.0 ni, 30.9 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu25 : 38.6 us, 23.9 sy,  0.0 ni, 36.8 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu26 : 48.0 us, 19.1 sy,  0.0 ni, 31.2 id,  0.3 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu27 : 41.9 us, 23.7 sy,  0.0 ni, 33.7 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu28 : 47.5 us, 23.1 sy,  0.0 ni, 28.8 id,  0.3 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu29 : 47.1 us, 22.9 sy,  0.0 ni, 29.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu30 : 21.8 us, 20.0 sy,  0.0 ni, 42.8 id, 14.7 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu31 : 17.7 us, 19.8 sy,  0.0 ni, 61.8 id,  0.4 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu32 : 22.8 us, 17.6 sy,  0.0 ni, 44.8 id, 14.1 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu33 : 23.2 us, 16.5 sy,  0.0 ni, 58.2 id,  1.8 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu34 : 21.1 us, 17.5 sy,  0.0 ni, 58.2 id,  1.8 wa,  0.0 hi,  1.4 si,  0.0 st
%Cpu35 : 19.1 us, 18.7 sy,  0.0 ni, 60.4 id,  1.4 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu36 : 22.5 us, 17.6 sy,  0.0 ni, 58.8 id,  0.0 wa,  0.0 hi,  1.1 si,  0.0 st
%Cpu37 : 19.8 us, 19.1 sy,  0.0 ni, 55.5 id,  4.9 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu38 : 18.7 us, 20.8 sy,  0.0 ni, 56.7 id,  3.5 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu39 : 18.6 us, 17.9 sy,  0.0 ni, 28.7 id, 33.3 wa,  0.0 hi,  1.4 si,  0.0 st
KiB Mem : 52828739+total, 40350712 free, 21839540+used, 26954128+buff/cache
KiB Swap: 26830438+total, 26819404+free,   110336 used. 31489411+avail Mem



Has anyone got experience with distributing IRQs over multiple cores? The system appear to automatically create Tx/Rx queues equal to the number of cores and associated each pair with a given core. Most interrupts (by a magnitude of 10:1 occur on core 0 though:
Code:
[admin@kvm5c ~]# grep -e CPU -e eth0 /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      CPU16      CPU17      CPU18      CPU19      CPU20      CPU21      CPU22      CPU23      CPU24      CPU25      CPU26      CPU27      CPU28      CPU29      CPU30      CPU31
 152: 1620602535          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670016-edge      eth0-TxRx-0
 153:          0  168146477          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670017-edge      eth0-TxRx-1
 154:          0          0  154649369          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670018-edge      eth0-TxRx-2
 155:          0          0          0  140435314          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670019-edge      eth0-TxRx-3
 156:          0          0          0          0  133984352          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670020-edge      eth0-TxRx-4
 157:          0          0          0          0          0  129604155          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670021-edge      eth0-TxRx-5
 158:          0          0          0          0          0          0  126145092          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670022-edge      eth0-TxRx-6
 159:          0          0          0          0          0          0          0  123962115          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670023-edge      eth0-TxRx-7
 160:          0          0          0          0          0          0          0          0   40980503          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0   PCI-MSI 3670024-edge      eth0-TxRx-8
 161:          0          0          0          0          0          0          0          0          0   42662918          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0   PCI-MSI 3670025-edge      eth0-TxRx-9
 162:          1          0          0          0          0          0          0          0          0          0   41846450          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670026-edge      eth0-TxRx-10
 163:          0          1          0          0          0          0          0          0          0          0          0   45015288          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670027-edge      eth0-TxRx-11
 164:          0          0          1          0          0          0          0          0          0          0          0          0   45015780          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670028-edge      eth0-TxRx-12
 165:          0          0          0          1          0          0          0          0          0          0          0          0          0   41972828          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670029-edge      eth0-TxRx-13
 166:          0          0          0          0          1          0          0          0          0          0          0          0          0          0   43592026          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670030-edge      eth0-TxRx-14
 167:          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0   41824424          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670031-edge      eth0-TxRx-15
 168:          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0   85380315          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670032-edge      eth0-TxRx-16
 169:          0          0          0          0          0          0          0          1          0          0          0          0          0          0          0          0          0  116711472          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670033-edge      eth0-TxRx-17
 170:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          1          0  115485583          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI 3670034-edge      eth0-TxRx-18
<snip>
 
IRQ balance isn't recommended and attempting to mitigate this had disastrous effects resulting in packet loss, latency and service interruptions.

What I've learnt:
  • Running Open vSwitch (OvS) LACP bonding is not recommended, one should either run OvS bonds for VM traffic as active-backup bonds (what we were doing originally) or use Linux kernel LACP bond interfaces with an OvS bridge (reference: https://access.redhat.com/documenta...r_installation_and_usage/appe-bonding_options)
  • Whilst the above taskset commands work perfectly in mitigating against the problem, booting the system with 'isolcpus=0,20' and then updating OvS's CPU affinity to use cores 0-39 (taskset -a -cp 0-39 `pidof ovs-switchd`) should effectively be the same as what we were doing above but CPU utilisation on core 0 is multitudes higher and we experience the same load related problems as when running OvS LACP bonds for VM traffic.
  • The OvS LACP issue does not affect the Ceph traffic bond, the issue only surfaces over a certain threshold of traffic when bridging traffic for VMs.
  • Reading this (https://www.kernel.org/doc/Documentation/networking/scaling.txt) had be trying aRFS (Accelerated Receive Flow Steering) but although CPU utilisation was more evenly distributed we still experienced packet drops and latency:
Code:
cores=`nproc --all`;
flows='2048';
echo $[$flows*$cores] > /proc/sys/net/core/rps_sock_flow_entries;
for f in 0 1 2 3; do
  for g in `seq 0 $[cores-1]`; do
    echo $flows > /sys/class/net/eth$f/queues/rx-$g/rps_cpus;
  done;
  ethtool -K eth$f ntuple on;
  sleep 10;  # Changing ntuple flaps the interface
done;



Our nodes have 2 x 10GbE interfaces for VM traffic and 2 x 10 SFP+ interfaces for Ceph. We subsequently create two OvS bridges where one LACP bond is configured for Ceph and an active-backup bond is used for VM traffic. Red Hat officially recommend against running OvS LACP bonds and rather recommend constructing a Linux kernel bond interface which is then attached to the OvS bridge, we have scheduled this to be tested in-house prior to configuring it on a production node to evaluate impact.

Herewith the current pure OvS configuration:
Code:
auto lo
iface lo inet loopback

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
        ovs_options bond_mode=active-backup tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan1
        mtu 9216

allow-vmbr0 vlan1
iface vlan1 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=1
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 192.168.240.66
        netmask 255.255.255.224
        gateway 192.168.240.1
        mtu 1500

allow-vmbr1 bond1
iface bond1 inet manual
        ovs_bridge vmbr1
        ovs_type OVSBond
        ovs_bonds eth2 eth3
        pre-up ( ifconfig eth2 mtu 9216 && ifconfig eth3 mtu 9216 )
        ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr1
allow-ovs vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports bond1 vlan33
        mtu 9216

allow-vmbr1 vlan33
iface vlan33 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr1
        ovs_options tag=33
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 10.254.1.66
        netmask  255.255.255.0
        mtu 9212

Notes:
  • eth0 and eth1 are associated directly with the relevant VLANs that virtuals connect to. Management access (192.168.240.0/24) is untagged and all other guest VLANs are tagged on these ports.
  • LAG 66 is a LACP multi-chassis bond on the upstream switch which is exclusively a tagged member of the Ceph network (vlan 33).


The following yielded severe service degradation when passing guest VM traffic through an OvS LACP bond interface:
Code:
allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
-       ovs_options bond_mode=active-backup tag=1 vlan_mode=native-untagged
+       ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=1 vlan_mode=native-untagged
        mtu 9216



PS: OvS allowed one to construct a hybrid LACP bond interface which doesn't use recirculation (ie an active-backup bond interface which sends LACP BPDU frames to the switch). Configuring an OvS bond0 interface, which is attached to the bridge all VMs attach to, had the same effect in that there was considerably higher CPU utilisation on core 0 which lead to packet loss and latency problems:
Code:
allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
        ovs_options bond_mode=active-backup lacp=active other_config:lacp-time=fast tag=1 vlan_mode=native-untagged
        mtu 9216



Not sure if this problem is due to 4.15.18-4-pve using the mainline ixgbe driver, instead of the Intel out-of-tree ixgbe driver. The problem can essentially be traced to the Intel Flow Director matching packets:
Code:
[root@kvm5a ~]# for f in 0 1 2 3; do echo "eth$f:"; ethtool -S eth$f | grep -P 'fdir|queue_(0|1)_packets'; done
eth0:
     fdir_match: 62107489
     fdir_miss: 27983350
     fdir_overflow: 2039
     tx_queue_0_packets: 294139
     tx_queue_1_packets: 94801807
     rx_queue_0_packets: 2760148155
     rx_queue_1_packets: 6445802
eth1:
     fdir_match: 0
     fdir_miss: 1588198
     fdir_overflow: 0
     tx_queue_0_packets: 0
     tx_queue_1_packets: 0
     rx_queue_0_packets: 67519165
     rx_queue_1_packets: 121575
eth2:
     fdir_match: 59439086
     fdir_miss: 21391627
     fdir_overflow: 3
     tx_queue_0_packets: 617019
     tx_queue_1_packets: 1526448
     rx_queue_0_packets: 884564
     rx_queue_1_packets: 1826231
eth3:
     fdir_match: 45470259
     fdir_miss: 16940877
     fdir_overflow: 3
     tx_queue_0_packets: 696103
     tx_queue_1_packets: 1583566
     rx_queue_0_packets: 494183
     rx_queue_1_packets: 1716153


When running bond0 with LACP:
Code:
     fdir_match: 66807
     fdir_miss: 107802
     fdir_overflow: 1140
 
Last edited:
We host virtual routers and firewalls, so we need the ability to QinQ a virtual guest to a VLAN tag. The equivalent of this action is configuring a Cisco switch port in 'dot1q-tunnel' mode. Any untagged or tagged packets are essentially wrapped in another VLAN tag on ingress from the VM and the VLAN tag popped on egress to the VM.

The VLAN aware Linux bridge does not support this type of interface, the non VLAN aware Linux bridge does support this but can't differentiate the same MAC in different VLANs so the only proper solution is for us to run OvS with a tiny modification to the Proxmox script:
Code:
[root@pve-test ~]# diff -uNr /usr/share/perl5/PVE/Network.pm.orig /usr/share/perl5/PVE/Network.pm;
--- /usr/share/perl5/PVE/Network.pm.orig        2018-02-19 12:41:12.000000000 +0200
+++ /usr/share/perl5/PVE/Network.pm     2018-04-05 15:41:56.131434252 +0200
@@ -251,9 +251,9 @@
     $trunks =~ s/;/,/g if $trunks;

     my $cmd = "/usr/bin/ovs-vsctl add-port $bridge $iface";
-    $cmd .= " tag=$tag" if $tag;
-    $cmd .= " trunks=". join(',', $trunks) if $trunks;
-    $cmd .= " vlan_mode=native-untagged" if $tag && $trunks;
+    $cmd .= " vlan_mode=dot1q-tunnel tag=$tag other-config:qinq-ethtype=802.1q" if $tag;
+    $cmd .= " cvlans=". join(',', $trunks) if $trunks && $tag;
+    $cmd .= " trunks=". join(',', $trunks) if $trunks && !$tag;

     $cmd .= " -- set Interface $iface type=internal" if $internal;
     system($cmd) == 0 ||

NB: Running a non-VLAN aware OS such as Windows would simply require the guest to communicate using untagged frames which OvS would associate with the relevant VLAN. Configuration management is therefore unchanged.

Sample network configuration for a standard VM, virtual router or virtual firewall:
sample_vm.jpg

Sample VLAN configuration of the router attached to VLAN 941:
sample_router.jpg


PS: Perhaps you could bring the following forum post (https://forum.proxmox.com/threads/proxmox-5-0-and-ovs-with-dot1q-tunnel.34090/#post-211337) and feature request (https://bugzilla.proxmox.com/show_bug.cgi?id=1350) to the attention of relevant staff? Proxmox 5.2 simply needs to have the OvS packages upgraded to those included in Debian Buster, running in production for 400+ VMs since April with zero side effects. Perhaps this could be added to the Proxmox release scheduled to be associated with Debian Buster?
 
Last edited:
I did mention in my earlier post that I was going to follow Red Hat's official recommendation of avoiding OvS LACP bond interfaces by assembling a Linxu LACP bond and then attaching that to the OvS bridgef or VM traffic. I'll provide feedback on whether or not this addresses the throughput and packet loss issue we experienced when we converted the OvS active-backup bond to LACP (pure OvS).

I hope to test this on an internal host this evening and thereafter trial one production cluster node this coming weekend...
 
Thanks.
Funny, I'm currently working on 802.1ad vlan stack for proxmox /etc/network/interfaces support (linux vlan aware bridge and linux interfaces), but not yet ovs. (I'll try to look at it with your code)


I'm planning to use ifupdown2 for this (a new package which replace ifupdown), and have more options. (#apt-get install ifupdown2, but ovs is not 100% supported, so to do on a test server)


for interfaces, it's possible to define something like:

iface eth0.10
vlan-protocol 802.1ad

iface eth0.10.100 (#802.1q by default)

(don't have verified, this should work with proxmox, with a default vmbr0 on eth0.10, it should create vmbr0v100 and eth0.10.100 when choosing vlan for vm).

iface eth0.10
vlan-protocol 802.1ad

iface vmbr0
bridge_ports eth0.10



another way should be with vlan aware bridge for 802.1Q (tag should be done on bridge)

iface eth0.10
vlan-protocol 802.1ad

iface vmbr0
bridge_ports eth0.10
bridge_vlan-aware yes

But If I understand, you need to see the 802.1ad inside the bridge and inside the vm, not when packet is going out the physical server ?



It's also possible to create a vlan aware bridge with 802.1ad vlan, but it's all the bridge, can't be selected by ports,
so this need to stack 2 vlanaware bridge with some veth interfaces

iface vmbr0
bridge-vlan-protocol 802.1ad
bridge_vlan-aware yes
..
(you can do the same with echo 0x88a8 > /sys/class/net/vmbr0/bridge/vlan_protocol)

Not sure if it could help you ?
 
Interesting, I'll read up on ifupdown2.

My requirement is actually the exact opposite, I want frames ingressing a LACP bond on the Proxmox host to have the first VLAN tag (standard 802.1Q aka TPID 0x8100) stripped when it is presented to the virtual guest and to wrap anything and everything leaving the guest back in to the same VLAN tag.

Sample frame entering or leaving Proxmox host (triple tagged):
802.1Q:941_802.1Q:10_802.1Q:11​
Should be presented and accepted from virual guest as (double tagged):
802.1Q:10_802.1Q:11​

The VLAN aware Linux bridge wouldn't wrap tagged frames originated from a virtual guest router and wouldn't pass through frames to the virtual guest router if it contained a second VLAN tag. Whilst the ifupdown2 method you describe may work it requires constant manipulation of the /etc/network/interfaces file on all participating cluster nodes. Iperf with 1500 byte MTU shows 7.5 Gbps when testing between VMs on the same node and when VMs are migrated to separate nodes.
 
Interesting, I'll read up on ifupdown2.

My requirement is actually the exact opposite, I want frames ingressing a LACP bond on the Proxmox host to have the first VLAN tag (standard 802.1Q aka TPID 0x8100) stripped when it is presented to the virtual guest and to wrap anything and everything leaving the guest back in to the same VLAN tag.

Sample frame entering or leaving Proxmox host (triple tagged):
802.1Q:941_802.1Q:10_802.1Q:11​
Should be presented and accepted from virual guest as (double tagged):
802.1Q:10_802.1Q:11​

The VLAN aware Linux bridge wouldn't wrap tagged frames originated from a virtual guest router and wouldn't pass through frames to the virtual guest router if it contained a second VLAN tag. Whilst the ifupdown2 method you describe may work it requires constant manipulation of the /etc/network/interfaces file on all participating cluster nodes. Iperf with 1500 byte MTU shows 7.5 Gbps when testing between VMs on the same node and when VMs are migrated to separate nodes.

Ok, Got it! Indeed it don't think it's possible with vlan aware bridge. I'll try to see if we can update ovs for next proxmox release.

BTW, does mtu works fine with 1500 and double tag ? (shouldn't be it increase a little bit like 1508) ?
 
MTU does need to be increased. Finding the maximum supported MTU requires one to step through various values until the command reports an error. We essentially start at the maximum allowed by our switches (ifconfig eth0 mtu 9216) and then step down until we don't get an error (Dell servers typically don't allow interface MTU larger than 9000 bytes).

Herewith a sample OvS network definition for a cluster node where there are two LACP bonds. The first is an active-backup bond to prevent packet loss and latency issues we experienced with high volumes of traffic on a LACP bond and the second is the LACP bond used for Ceph replication traffic. Management vlan is presented as untagged on the first bond, so we associate untagged traffic with a 'vlan1' interface, whilst Ceph is handed off as tagged on the second bond.

MTU of physical ports is set to maximum available (9216), bond interface MTU matches the physical slave interface limit, Ceph replication VLAN interface is 4 bytes less as it's leaving the bond and physical interfaces with a VLAN tag and management network has a default 1500 byte MTU.

Code:
/etc/network/interfaces
auto lo
iface lo inet loopback

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 )
        ovs_options bond_mode=active-backup tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan1
        mtu 9216

allow-vmbr0 vlan1
iface vlan1 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=1
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 192.168.240.66
        netmask 255.255.255.0
        gateway 192.168.240.1
        mtu 1500

allow-vmbr1 bond1
iface bond1 inet manual
        ovs_bridge vmbr1
        ovs_type OVSBond
        ovs_bonds eth2 eth3
        pre-up ( ifconfig eth2 mtu 9216 && ifconfig eth3 mtu 9216 )
        ovs_options bond_mode=balance-tcp lacp=active other_config:lacp-time=fast tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr1
allow-ovs vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports bond1 vlan33
        mtu 9216

allow-vmbr1 vlan33
iface vlan33 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr1
        ovs_options tag=33
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 10.254.1.66
        netmask  255.255.255.0
        mtu 9212


The following was a simpler starting point (4 x 10GbE LACP bond) but bridging traffic for multiple virtuals through the LACP trunk caused latency problems and severe packet loss:
Code:
/etc/network/interfaces
auto lo
iface lo inet loopback

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bridge vmbr0
        ovs_type OVSBond
        ovs_bonds eth0 eth1 eth2 eth3
        pre-up ( ifconfig eth0 mtu 9216 && ifconfig eth1 mtu 9216 && ifconfig eth2 mtu 9216 && ifconfig eth3 mtu 9216 )
        ovs_options bond_mode=active-backup tag=1 vlan_mode=native-untagged
        mtu 9216

auto vmbr0
allow-ovs vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan1
        mtu 9216

allow-vmbr0 vlan1
iface vlan1 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=1
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 192.168.240.66
        netmask 255.255.255.0
        gateway 192.168.240.1
        mtu 1500

allow-vmbr0 vlan33
iface vlan33 inet static
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=33
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif
        address 10.254.1.66
        netmask  255.255.255.0
        mtu 9212
 
Back on topic for this thread. This document (https://software.intel.com/en-us/articles/setting-up-intel-ethernet-flow-director) details that the Intel Flow Director will automatically program the NIC with hashes of outgoing flows so that return traffic is received back on the same CPU:

Using Intel Ethernet Flow Director
Intel Ethernet FD can run in one of two modes: externally programmed (EP) mode, and ATR mode. Once Intel Ethernet Flow Director is enabled, ATR mode is the default mode, provided that the driver is in multiple Tx queue mode. <cut> In either mode, fields are intelligently selected from the packets in the Rx queues to index into the Perfect-Match filter table. For more information on how Intel Ethernet FD works, see this whitepaper.

Application Targeting Routing
In ATR mode, Intel Ethernet FD uses fields from the outgoing packets in the Tx queues to populate the 8K-entry Perfect-Match filter table. The fields that are selected depend on the packet type; for example, fields to filter TCP traffic will be different than those used to filter user diagram protocol (UDP) traffic. Intel Ethernet FD then uses the Perfect-Match filter table to intelligently route incoming traffic to the Rx queues.​


I have an assumption that the mainline kernel ixgbe driver doesn't handle this the same way as the out-of-tree ixgbe driver. Perhaps Proxmox should rather continue to replace the mainline driver with Intel's...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!