[SOLVED] communication issue between SRIOV VM VF and CT on PF bridge

ctr · Apr 19, 2020

I'm passing through a VF using SRIOV to a VM. This VM works fine, I can reach it from Proxmox (mgmt interface on vmbr0 of the PF) and other hosts on the network.
However, a CT attached to the very same vmbr0 cannot reach the VM, but also everything else on the network. According to tcpdump packets from the CT are getting to the VF and the VF is responding, but those response packets are then not visible on the PF anymore and as result do not reach the CT.

Does that ring any bells?

Dominic · May 14, 2020

Have you been able to solve your problem already? If not, what version are you running pveversion -v and what does your /etc/network/interfaces look like?

ctr · Sep 12, 2020

Yes I still have the issue. Just tried again after some time (and a series of updates in the meantime)

Bash:

# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

nothing special in interfaces config:

Code:

# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno2
iface eno2 inet manual

auto eno1
iface eno1 inet manual

auto eno3
iface eno3 inet manual

iface eno4 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.12.12/24
    gateway 192.168.12.1
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0
#LAN Bridge

auto vmbr1
iface vmbr1 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
#unused

auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Firewall to Router Transit

ctr · Sep 12, 2020

Some more info:
Even the ARP reply is not getting through.

ping the default gateway (192.168.12.1) from the container (192.168.12.13) I can see:

Code:

11:33:21.000863 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:21.000927 ARP, Reply 192.168.12.1 is-at de:ad:be:ef:21:00, length 28

on the gateway, but this ARP reply never arrives on the PF, it keeps asking over and over:

Code:

11:33:20.998019 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:22.000162 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:23.024155 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28

consequently inside the container:

Bash:

# arp -na
? (192.168.12.1) at <incomplete>  on eth0

The same works fine from the host, just not from within the container.

Maybe this combination of bridge, SRIOV PF/VF and veth doesn't play nice but I really want SRIOV VF in the VM being the gateway.

Would it (as work around or solution) be possible to use SRIOV also for the container? LXD supports nictype=sriov but I don't know if Proxmox does...

ctr · Sep 13, 2020

This may not be specific to CT. I just spun up the first VM that was supposed to go onto the bridge and it faced the same communication issues like the container.

ctr · Sep 17, 2020

Maybe not as uncommon as I thought it would be, the Intel DPDK documentation is even describing exactly that us case.

The problem I'm having is that VMs from the left side (via VF driver) cannot talk to VMs on the right hand side (via bridge of PF). Both can talk to the PF itself and outside hosts.

ctr · Sep 19, 2020

Again answering to myself...

This seems to be a known issue (or maybe feature?) and there is a workaround available.
I've got the information from
https://bugzilla.redhat.com/show_bug.cgi?id=1067802
and
https://community.intel.com/t5/Ethernet-Products/82599-VF-to-Linux-host-bridge/td-p/351802

a simple:

Bash:

# bridge fdb add DE:AD:BE:EF:00:01 dev eno2

(with the mac in there being the mac of the container or bridged VM and the dev being the PF interface that is attached to the bridge)
Enables the communication between the VF and virtual NICs on the PF bridge.

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.

Dominic · Sep 28, 2020

Thank you for posting the solution!

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.

You can post an enhancement request at our Bugzilla. Forum threads might be forgotten after a while, but open Bugzilla requests are easy to find. If you script something yourself, it would be great if you could make it available to other users as well

kriss35 · Nov 6, 2020

ctr said:
Again answering to myself...

This seems to be a known issue (or maybe feature?) and there is a workaround available.
I've got the information from
https://bugzilla.redhat.com/show_bug.cgi?id=1067802
and
https://community.intel.com/t5/Ethernet-Products/82599-VF-to-Linux-host-bridge/td-p/351802

a simple:

Bash:

# bridge fdb add DE:AD:BE:EF:00:01 dev eno2

(with the mac in there being the mac of the container or bridged VM and the dev being the PF interface that is attached to the bridge)
Enables the communication between the VF and virtual NICs on the PF bridge.

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.

Thank you so much CTR !
This problem made me crazy, you save my week

Now i did a bash script (extention renamed from sh to txt to attach it to this post) wich run every minutes to check if all mac adresses of containers or VMs are already in the ForwardDB of the interface, if they are not, it will add it. The STDOUT can be redirected in a log file.

again, big thank you

kbumsik · Jan 30, 2021

Wow, thanks sooooo much ctr and kriss35 !!!

I use a ixgbe device (Intel X520-DA2) and had exactly same problem. You guys saved my days

And thanks a lot for the script, it works out of the box for me!

Ramalama · Apr 1, 2021

Bash:

#!/usr/bin/bash
#
# vf_add_maddr.sh Version 1.1
# Script is based on kriss35
# Update by Rama: Added vmbridge macaddress itself, simplified, systemd-service(RestartOnFailure) Compatible and speeded up with a tmpfile(one readout).
# Usage: execute directly without arguments, make an systemd-service or add it to crontab to run every x Minutes.
#
CTCONFDIR=/etc/pve/nodes/proxmox/lxc
VMCONFDIR=/etc/pve/nodes/proxmox/qemu-server
IFBRIDGE=enp35s0f0
LBRIDGE=vmbr0
TMP_FILE=/tmp/vf_add_maddr.tmp

C_RED='\e[0;31m'
C_GREEN='\e[0;32m'
C_NC='\e[0m'

if [ ! -d $CTCONFDIR ] || [ ! -d $VMCONFDIR ]; then
        echo -e "${C_RED}ERROR: Not mounted, self restart in 5s!${C_NC}"
        exit 1
else
        MAC_LIST_VMS=" $(cat ${VMCONFDIR}/*.conf | grep bridge | grep -Eo '([[:xdigit:]]{1,2}[:-]){5}[[:xdigit:]]{1,2}' | tr '[:upper:]' '[:lower:]') $(cat ${CTCONFDIR}/*.conf | grep hwaddr | grep -Eo '([[:xdigit:]]{1,2}[:-]){5}[[:xdigit:]]{1,2}' | tr '[:upper:]' '[:lower:]')"
        MAC_ADD2LIST="$(cat /sys/class/net/$LBRIDGE/address)"
        MAC_LIST="$MAC_LIST_VMS $MAC_ADD2LIST"
        /usr/sbin/bridge fdb show | grep "${IFBRIDGE} self permanent" > $TMP_FILE

        for mactoregister in ${MAC_LIST}; do
                if ( grep -Fq $mactoregister $TMP_FILE ); then
                        echo -e "${C_GREEN}$mactoregister${C_NC} - Exists!"
                else
                        /usr/sbin/bridge fdb add $mactoregister dev ${IFBRIDGE}
                        echo -e "${C_RED}$mactoregister${C_NC} - Added!"
                fi
        done
        exit 0
fi

I've updated the script a bit:
+ The macaddress of the linux bridge itself was missing. Propably that wasn't neccessary earlier.
+ Speeded the script up a lot, cause it's not neccessary to read the bridge table in a while loop. Reading it once and using a tempfile now.
+ If you use a Systemd-Service or Crontab, sometimes the script runs before the pve folders are mounted. There is an exit code for this now, that Systemd Service can restart the script. (You can use type=simple & Restart=on-failure)
+ Simplified: Less echo output, instead there are colors now.

If there is enough interrest we could make a proper Systemd Service of this. In theory the script could be modified to run as a proper service in a constant while loop (sleeping) and react instead of time based intervals, to filesystem based intervals (for instantly adding a macaddress to a db if a vm or an adapter get added to a vm), but this would require to have additionally incrond installed.

However, in theory this is a bug, either on intels side or on linux side. Because every other sr-iov supported nic, doesn't need this workaround here.
So its probably better to report it https://bugzilla.kernel.org/, whoever wants to do this, feel free xD

Cheers

ctr · Dec 7, 2021

I also created a hookscript with a little more check logic and opened a bug in Bugzilla.
The script is in use on multiple Proxmox clusters for about a year without any issues.

murky51 · Feb 3, 2022

Christian, thank you for documenting this and creating the hookscript.

I have the same issue with a Mellanox ConnectX-3 NIC where enabling SR-IOV broke connectivity between containers and VMs on the PF bridge and external hosts on the LAN. VMs then added using VFs had IP connectivity to these external hosts and the PF IP address, but not to the PF connected bridged containers or VMs.

As you found the root cause appears to be that these NICs have a simple statically configured switch between the PF, any VFs and the physical port when in SR-IOV mode. The MAC addresses of any bridged containers or VMs downstream of the PF (and VFs if bridges are attached to them) are not added to the switch forwarding table when these PF and VF interfaces are configured by the device driver. Perhaps newer NICs have now implemented learning switches to avoid this problem?

As you noted post #7 above the work-around is a “# bridge fdb add MAC-address dev upstream-IF” command (with the MAC address being the MAC of the bridged container or VM and the dev being the upstream PF or VF interface of the bridge), for each bridged container or VM. This enables communication between the virtual NICs on the PF bridge, any VFs and LAN connected hosts.

A warning about this really should be added as a caveat to the “PCI(e) Passthrough” wiki entry for SR-IOV for NICs.

I have modified your hookscript by removing the “vf is on the vlan” code to restore the functionally of adding bridged interface MAC address to the PF. I am currently unable to test the vlan functionality. The “present” test was also always failing due to a case mismatch in MAC addresses preventing the cleanup from functioning on post-stop.

This hookscript was tested on a dual port Mellanox ConnectX-3 at 40Gbe and 10Gbe, four VFs per port, with a bridge attached to each port, containers and VMs on each bridge and a VM with a VF. This is on the latest 5.15.17-1-pve kernel and it’s mlx4 driver with the NIC flashed with current firmware.

Regards, Michael

Sjoerdos92 · Jul 21, 2022

@murky51 Thank you so much! This was driving me crazy. I can confirm the hookup script works for Mellanox ConnectX-4 Lx as well.

Lefuneste · Jun 5, 2023

murky51 said:
Christian, thank you for documenting this and creating the hookscript.

I have the same issue with a Mellanox ConnectX-3 NIC where enabling SR-IOV broke connectivity between containers and VMs on the PF bridge and external hosts on the LAN. VMs then added using VFs had IP connectivity to these external hosts and the PF IP address, but not to the PF connected bridged containers or VMs.

As you found the root cause appears to be that these NICs have a simple statically configured switch between the PF, any VFs and the physical port when in SR-IOV mode. The MAC addresses of any bridged containers or VMs downstream of the PF (and VFs if bridges are attached to them) are not added to the switch forwarding table when these PF and VF interfaces are configured by the device driver. Perhaps newer NICs have now implemented learning switches to avoid this problem?

As you noted post #7 above the work-around is a “# bridge fdb add MAC-address dev upstream-IF” command (with the MAC address being the MAC of the bridged container or VM and the dev being the upstream PF or VF interface of the bridge), for each bridged container or VM. This enables communication between the virtual NICs on the PF bridge, any VFs and LAN connected hosts.

A warning about this really should be added as a caveat to the “PCI(e) Passthrough” wiki entry for SR-IOV for NICs.

I have modified your hookscript by removing the “vf is on the vlan” code to restore the functionally of adding bridged interface MAC address to the PF. I am currently unable to test the vlan functionality. The “present” test was also always failing due to a case mismatch in MAC addresses preventing the cleanup from functioning on post-stop.

This hookscript was tested on a dual port Mellanox ConnectX-3 at 40Gbe and 10Gbe, four VFs per port, with a bridge attached to each port, containers and VMs on each bridge and a VM with a VF. This is on the latest 5.15.17-1-pve kernel and it’s mlx4 driver with the NIC flashed with current firmware.

Regards, Michael

Hi there

I believe that your script is not working anymore.

I am not expert in bash but trying to understand the cinematic of your code I believe that there has been some change in the following path :

/sys/class/net/${bridge}/brif/

This path does not exist anymore on current (7.4-3) version of proxmox, but seems instead to be :

/sys/class/net/${bridge}/subsystem/

I can confirm that changing the path seems to restore the functionnality of your nice script.

Thanks !

Ramalama · Jun 19, 2023

Hey everyone,

It's been over 2 years and sadly this is still an issue.
Im using myself my own script since 2 years, however i had always some troubles with arp and multicast, not big issues, but issues.

Let me explain:
I have a bridge for containers and VM's, but a VF for Opnsense, vf has simply a lot more performance here.
Maybe vf isn't important for some 1gbe networks, but here it's all 10gbe and i have 5 vlans to route between.
The opnsense is the router for all networks here.

However, the cool thing is, i need only one 10gbe port for the Server itself, because the vmbridge sits on the Physical Interface and the Virtual Funtion of the Physical Interface is assigned to Opnsense. Actually a nice solution, but well, i abandoned this setup, because of the weird behaviour i had always with opnsense.

So after digging and testing around i want to share some "things" i found.
Maybe thats interresting for one or another.

1. Avoid using bridge-vids 2-4094, use instead exatly the vlans you need! Example: bridge-vids 20 23-28
-> The issue is, that the fdb table gets simply flooded with all the vlans you never need. Check yourself with bridge fdb show
-> I don't know if its actually a performance concern, but i think that a smaller fdb table is at least better.

2. /usr/bin/ip link set enp35s0f0 vf 0 mac d0:20:55:db:fb:20 vlan 0 trust on spoofchk off
-> 1. vlan 0 = vlan filtering off
-> 2. if you see in your logs "spoofed packages detected":
-----> either disable spoofchk
-----> or set the proper vlan for your vm, example: ip link set enp35s0f0 vf 0 vlan 25
-----> or even setting vlan 0, that will fix it either, i don't know why...
-----> I do simply "vlan 0" and "spoofchk off", that worked simply the best for me, however if you disable vlan filtering with "vlan 0" and you're running opnsense, you need to disable in opnsense hardware vlan filtering either. So for opnsense i recommend to not set the vlan option at all, just spoofchk off. Then hardware vlan filtering works. However, it would be nicer to know how to add trunk vlans to the vf interface.
---------> i tryed ip link set enp35s0f0 vf 1 vlan 25 proto 802.1Q trunk 23-28, but that doesn't work. But you can passthrough multiple VF's with individual vlans to opnsense, instead of one VF and creating vlans inside opnsense (with hw vlan filtering)
-> 3. Windows VM's (at least W11), requires trust on, the Windows VM change the Mac-Address always to something random, so you need the trust mode.

3. What i've switched to & issues i had:
-> 1. I was never able to get WSS Working for my samba shares, if the Samba Server is in another Vlan as the Windows Client. (Multicast)
-> 2. Jellyfin/Plex UPnP was never working between vlan's. Same as above (Multicast)
-> 3. Opnsense-HA never worked reliable (Arp)
----- That were actually the only issues i had. (I thought at least before switching)
> My X550 has 2x10Gbe ports, so 2 PF.
-> I use one PF directly for vmbr0 now and the other PF i passtrough directly to opnsense.
--> This has the Downside that i need 2 10GBe ports to be connected to the switch.
--> What works perfectly well either is using 1PF for vmbr and making as much VF's as you want on the Second PF, that works perfectly either!
-> That way everything starts to work, i even see my printers and gadgets and ehatever else between vlans.

4. A weird workaround, which works without needing to edit the fdb table:
-> Set the PF to Auto: iface enp35s0f0 inet auto, don't attach vmbr to it!
--> add iface enp35s0f0v0 inet manual and use it for your vmbr!
-> Why ever, but if you use the vmbr on the virtual function instead of pf, everything can communicate with each other.
-> However, you need to unbind ixgbev and attach vfio-pci to every function that you want to passthrough. See below how to do that:

5. How to passthrough only one of 2 identical devices, in case of 2 Primary Functions, or Multiple Virtual Functions:
-> 1. Check the bus of your VF or PF, that you want to passthough with: lspci -nnk | grep -A4 Eth
-> In my case it's a PF with the busid of 0000:23:00.1 and device id of "[8086:1563]"
-> So the Service would look as fullowing:
-> /etc/systemd/system/sriov-Blacklist.service

Code:

[Unit]
Description=Script to Blacklist one X550 Port on boot
Before=network-pre.target
After=sysinit.target local-fs.target

[Service]
Type=oneshot
# Blacklisting Passthrough Port
ExecStart=/usr/bin/sriov-blacklist.sh

[Install]
WantedBy=multi-user.target

-> /usr/bin/sriov-blacklist.sh

Code:

#!/usr/bin/sh

/usr/bin/echo "0000:23:00.1" > "/sys/bus/pci/devices/0000:23:00.1/driver/unbind"
sleep 0.1
/usr/bin/echo "8086 1563" > "/sys/bus/pci/drivers/vfio-pci/new_id"

-> You can add in case of VF's as much "unbind's" as you want, just multiple lines before the "sleep 0.1"
-> vfio-pci will bind automatically every "unbinded" device through it's device-id, so you don't need that multiple times, because all your VF's have anyway the same device id's.

6. This whole linux bridge fdb issue, has nothing todo with proxmox, this happens even with esxi or any other linux distribution. I don't actually know why this can't be somehow fixed.
Seems like enterprises never mix VF's and Linux Bridges, the use either one or another, dunno.
However, it's sad that we can't have a better workaround.
And as last, this isn't even an Intel Issue, it happens actually with every sr-iov capable cards (mellanox etc..) either. So it needs actually somewhat of an kernel implementation.

Cheers

kcjd · Sep 6, 2023

I ran into a further complication because I was using bonding. I updated the hookscript provided by murky51 to also consider the slaved interfaces of a bond. I went ahead and threw it in a repository so I don't lose it. If anyone else needs to modify or expand, feel free to add a PR.

https://github.com/jdlayman/pve-hookscript-sriov

Thanks,
JD

wrobelda · Feb 27, 2024

Well, interestingly, I have same exact issue not only with LXC containers, but also with Windows and Darwin VMs! I was pulling hair until it struck me to try adding their MACs to forwarding database.

Note this is with Mellanox Connectx4 LX.

Fazio8 · May 7, 2024

kcjd said:
I ran into a further complication because I was using bonding. I updated the hookscript provided by murky51 to also consider the slaved interfaces of a bond. I went ahead and threw it in a repository so I don't lose it. If anyone else needs to modify or expand, feel free to add a PR.

https://github.com/jdlayman/pve-hookscript-sriov

Thanks,
JD

Thank you, this is fixing the issue!

Quasmo · Saturday at 23:24

I have a pull request in for kcdj’s script. I unfortunately am a masocist, and have a failover bond, backed up by another bond, and therefore need the script to look through multiple bonds to find the physical interface.

Also, the new script takes into account the difference in how the nVidia, and Intel drivers handle the physical function identification.

[SOLVED] communication issue between SRIOV VM VF and CT on PF bridge

Member

Proxmox Retired Staff

Member

Member

Member

Member

Member

Proxmox Retired Staff

Member

Attachments

New Member

Well-Known Member

Member

Member

Attachments

New Member

Active Member

Well-Known Member

New Member

Member

New Member

New Member