[SOLVED] communication issue between SRIOV VM VF and CT on PF bridge

ctr

Member
Dec 21, 2019
27
6
8
52
I'm passing through a VF using SRIOV to a VM. This VM works fine, I can reach it from Proxmox (mgmt interface on vmbr0 of the PF) and other hosts on the network.
However, a CT attached to the very same vmbr0 cannot reach the VM, but also everything else on the network. According to tcpdump packets from the CT are getting to the VF and the VF is responding, but those response packets are then not visible on the PF anymore and as result do not reach the CT.

Does that ring any bells?
 

Dominic

Proxmox Staff Member
Staff member
Mar 18, 2019
1,388
173
68
Have you been able to solve your problem already? If not, what version are you running pveversion -v and what does your /etc/network/interfaces look like?
 

ctr

Member
Dec 21, 2019
27
6
8
52
Yes I still have the issue. Just tried again after some time (and a series of updates in the meantime)
Bash:
# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

nothing special in interfaces config:
Code:
# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno2
iface eno2 inet manual

auto eno1
iface eno1 inet manual

auto eno3
iface eno3 inet manual

iface eno4 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.12.12/24
    gateway 192.168.12.1
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0
#LAN Bridge

auto vmbr1
iface vmbr1 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
#unused

auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Firewall to Router Transit
 

ctr

Member
Dec 21, 2019
27
6
8
52
Some more info:
Even the ARP reply is not getting through.

ping the default gateway (192.168.12.1) from the container (192.168.12.13) I can see:
Code:
11:33:21.000863 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:21.000927 ARP, Reply 192.168.12.1 is-at de:ad:be:ef:21:00, length 28
on the gateway, but this ARP reply never arrives on the PF, it keeps asking over and over:
Code:
11:33:20.998019 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:22.000162 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:23.024155 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
consequently inside the container:
Bash:
# arp -na
? (192.168.12.1) at <incomplete>  on eth0

The same works fine from the host, just not from within the container.

Maybe this combination of bridge, SRIOV PF/VF and veth doesn't play nice but I really want SRIOV VF in the VM being the gateway.

Would it (as work around or solution) be possible to use SRIOV also for the container? LXD supports nictype=sriov but I don't know if Proxmox does...
 

ctr

Member
Dec 21, 2019
27
6
8
52
This may not be specific to CT. I just spun up the first VM that was supposed to go onto the bridge and it faced the same communication issues like the container.
 

ctr

Member
Dec 21, 2019
27
6
8
52
Maybe not as uncommon as I thought it would be, the Intel DPDK documentation is even describing exactly that us case.

The problem I'm having is that VMs from the left side (via VF driver) cannot talk to VMs on the right hand side (via bridge of PF). Both can talk to the PF itself and outside hosts.
single_port_nic.png
 

ctr

Member
Dec 21, 2019
27
6
8
52
Again answering to myself...

This seems to be a known issue (or maybe feature?) and there is a workaround available.
I've got the information from
https://bugzilla.redhat.com/show_bug.cgi?id=1067802
and
https://community.intel.com/t5/Ethernet-Products/82599-VF-to-Linux-host-bridge/td-p/351802

a simple:
Bash:
# bridge fdb add DE:AD:BE:EF:00:01 dev eno2
(with the mac in there being the mac of the container or bridged VM and the dev being the PF interface that is attached to the bridge)
Enables the communication between the VF and virtual NICs on the PF bridge.

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.
 

Dominic

Proxmox Staff Member
Staff member
Mar 18, 2019
1,388
173
68
Thank you for posting the solution!

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.
You can post an enhancement request at our Bugzilla. Forum threads might be forgotten after a while, but open Bugzilla requests are easy to find. If you script something yourself, it would be great if you could make it available to other users as well :)
 

kriss35

Member
Aug 5, 2015
4
0
21
Again answering to myself...

This seems to be a known issue (or maybe feature?) and there is a workaround available.
I've got the information from
https://bugzilla.redhat.com/show_bug.cgi?id=1067802
and
https://community.intel.com/t5/Ethernet-Products/82599-VF-to-Linux-host-bridge/td-p/351802

a simple:
Bash:
# bridge fdb add DE:AD:BE:EF:00:01 dev eno2
(with the mac in there being the mac of the container or bridged VM and the dev being the PF interface that is attached to the bridge)
Enables the communication between the VF and virtual NICs on the PF bridge.

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.

Thank you so much CTR !
This problem made me crazy, you save my week :)

Now i did a bash script (extention renamed from sh to txt to attach it to this post) wich run every minutes to check if all mac adresses of containers or VMs are already in the ForwardDB of the interface, if they are not, it will add it. The STDOUT can be redirected in a log file.

again, big thank you
 

Attachments

  • mac_register.txt
    1.3 KB · Views: 40

kbumsik

New Member
Nov 1, 2020
1
0
1
31
Wow, thanks sooooo much ctr and kriss35 !!!

I use a ixgbe device (Intel X520-DA2) and had exactly same problem. You guys saved my days :)

And thanks a lot for the script, it works out of the box for me!
 

Ramalama

Active Member
Dec 26, 2020
354
51
28
33
Bash:
#!/usr/bin/bash
#
# vf_add_maddr.sh Version 1.1
# Script is based on kriss35
# Update by Rama: Added vmbridge macaddress itself, simplified, systemd-service(RestartOnFailure) Compatible and speeded up with a tmpfile(one readout).
# Usage: execute directly without arguments, make an systemd-service or add it to crontab to run every x Minutes.
#
CTCONFDIR=/etc/pve/nodes/proxmox/lxc
VMCONFDIR=/etc/pve/nodes/proxmox/qemu-server
IFBRIDGE=enp35s0f0
LBRIDGE=vmbr0
TMP_FILE=/tmp/vf_add_maddr.tmp

C_RED='\e[0;31m'
C_GREEN='\e[0;32m'
C_NC='\e[0m'

if [ ! -d $CTCONFDIR ] || [ ! -d $VMCONFDIR ]; then
        echo -e "${C_RED}ERROR: Not mounted, self restart in 5s!${C_NC}"
        exit 1
else
        MAC_LIST_VMS=" $(cat ${VMCONFDIR}/*.conf | grep bridge | grep -Eo '([[:xdigit:]]{1,2}[:-]){5}[[:xdigit:]]{1,2}' | tr '[:upper:]' '[:lower:]') $(cat ${CTCONFDIR}/*.conf | grep hwaddr | grep -Eo '([[:xdigit:]]{1,2}[:-]){5}[[:xdigit:]]{1,2}' | tr '[:upper:]' '[:lower:]')"
        MAC_ADD2LIST="$(cat /sys/class/net/$LBRIDGE/address)"
        MAC_LIST="$MAC_LIST_VMS $MAC_ADD2LIST"
        /usr/sbin/bridge fdb show | grep "${IFBRIDGE} self permanent" > $TMP_FILE

        for mactoregister in ${MAC_LIST}; do
                if ( grep -Fq $mactoregister $TMP_FILE ); then
                        echo -e "${C_GREEN}$mactoregister${C_NC} - Exists!"
                else
                        /usr/sbin/bridge fdb add $mactoregister dev ${IFBRIDGE}
                        echo -e "${C_RED}$mactoregister${C_NC} - Added!"
                fi
        done
        exit 0
fi

I've updated the script a bit:
+ The macaddress of the linux bridge itself was missing. Propably that wasn't neccessary earlier.
+ Speeded the script up a lot, cause it's not neccessary to read the bridge table in a while loop. Reading it once and using a tempfile now.
+ If you use a Systemd-Service or Crontab, sometimes the script runs before the pve folders are mounted. There is an exit code for this now, that Systemd Service can restart the script. (You can use type=simple & Restart=on-failure)
+ Simplified: Less echo output, instead there are colors now.

If there is enough interrest we could make a proper Systemd Service of this. In theory the script could be modified to run as a proper service in a constant while loop (sleeping) and react instead of time based intervals, to filesystem based intervals (for instantly adding a macaddress to a db if a vm or an adapter get added to a vm), but this would require to have additionally incrond installed.

However, in theory this is a bug, either on intels side or on linux side. Because every other sr-iov supported nic, doesn't need this workaround here.
So its probably better to report it https://bugzilla.kernel.org/, whoever wants to do this, feel free xD

Cheers
 

ctr

Member
Dec 21, 2019
27
6
8
52
I also created a hookscript with a little more check logic and opened a bug in Bugzilla.
The script is in use on multiple Proxmox clusters for about a year without any issues.
 

murky51

New Member
Dec 16, 2020
1
0
1
Christian, thank you for documenting this and creating the hookscript.

I have the same issue with a Mellanox ConnectX-3 NIC where enabling SR-IOV broke connectivity between containers and VMs on the PF bridge and external hosts on the LAN. VMs then added using VFs had IP connectivity to these external hosts and the PF IP address, but not to the PF connected bridged containers or VMs.

As you found the root cause appears to be that these NICs have a simple statically configured switch between the PF, any VFs and the physical port when in SR-IOV mode. The MAC addresses of any bridged containers or VMs downstream of the PF (and VFs if bridges are attached to them) are not added to the switch forwarding table when these PF and VF interfaces are configured by the device driver. Perhaps newer NICs have now implemented learning switches to avoid this problem?

As you noted post #7 above the work-around is a “# bridge fdb add MAC-address dev upstream-IF” command (with the MAC address being the MAC of the bridged container or VM and the dev being the upstream PF or VF interface of the bridge), for each bridged container or VM. This enables communication between the virtual NICs on the PF bridge, any VFs and LAN connected hosts.

A warning about this really should be added as a caveat to the “PCI(e) Passthrough” wiki entry for SR-IOV for NICs.

I have modified your hookscript by removing the “vf is on the vlan” code to restore the functionally of adding bridged interface MAC address to the PF. I am currently unable to test the vlan functionality. The “present” test was also always failing due to a case mismatch in MAC addresses preventing the cleanup from functioning on post-stop.

This hookscript was tested on a dual port Mellanox ConnectX-3 at 40Gbe and 10Gbe, four VFs per port, with a bridge attached to each port, containers and VMs on each bridge and a VM with a VF. This is on the latest 5.15.17-1-pve kernel and it’s mlx4 driver with the NIC flashed with current firmware.

Regards, Michael
 

Attachments

  • pf-bridge-fdb-mp.txt
    3.4 KB · Views: 25

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!