[SOLVED] communication issue between SRIOV VM VF and CT on PF bridge

ctr

New Member
Dec 21, 2019
23
4
3
51
I'm passing through a VF using SRIOV to a VM. This VM works fine, I can reach it from Proxmox (mgmt interface on vmbr0 of the PF) and other hosts on the network.
However, a CT attached to the very same vmbr0 cannot reach the VM, but also everything else on the network. According to tcpdump packets from the CT are getting to the VF and the VF is responding, but those response packets are then not visible on the PF anymore and as result do not reach the CT.

Does that ring any bells?
 

Dominic

Proxmox Staff Member
Staff member
Mar 18, 2019
1,388
168
68
Have you been able to solve your problem already? If not, what version are you running pveversion -v and what does your /etc/network/interfaces look like?
 

ctr

New Member
Dec 21, 2019
23
4
3
51
Yes I still have the issue. Just tried again after some time (and a series of updates in the meantime)
Bash:
# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

nothing special in interfaces config:
Code:
# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno2
iface eno2 inet manual

auto eno1
iface eno1 inet manual

auto eno3
iface eno3 inet manual

iface eno4 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.12.12/24
    gateway 192.168.12.1
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0
#LAN Bridge

auto vmbr1
iface vmbr1 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
#unused

auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0
    mtu 9000
#Firewall to Router Transit
 

ctr

New Member
Dec 21, 2019
23
4
3
51
Some more info:
Even the ARP reply is not getting through.

ping the default gateway (192.168.12.1) from the container (192.168.12.13) I can see:
Code:
11:33:21.000863 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:21.000927 ARP, Reply 192.168.12.1 is-at de:ad:be:ef:21:00, length 28
on the gateway, but this ARP reply never arrives on the PF, it keeps asking over and over:
Code:
11:33:20.998019 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:22.000162 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
11:33:23.024155 ARP, Request who-has 192.168.12.1 tell 192.168.12.13, length 28
consequently inside the container:
Bash:
# arp -na
? (192.168.12.1) at <incomplete>  on eth0

The same works fine from the host, just not from within the container.

Maybe this combination of bridge, SRIOV PF/VF and veth doesn't play nice but I really want SRIOV VF in the VM being the gateway.

Would it (as work around or solution) be possible to use SRIOV also for the container? LXD supports nictype=sriov but I don't know if Proxmox does...
 

ctr

New Member
Dec 21, 2019
23
4
3
51
This may not be specific to CT. I just spun up the first VM that was supposed to go onto the bridge and it faced the same communication issues like the container.
 

ctr

New Member
Dec 21, 2019
23
4
3
51
Maybe not as uncommon as I thought it would be, the Intel DPDK documentation is even describing exactly that us case.

The problem I'm having is that VMs from the left side (via VF driver) cannot talk to VMs on the right hand side (via bridge of PF). Both can talk to the PF itself and outside hosts.
single_port_nic.png
 

ctr

New Member
Dec 21, 2019
23
4
3
51
Again answering to myself...

This seems to be a known issue (or maybe feature?) and there is a workaround available.
I've got the information from
https://bugzilla.redhat.com/show_bug.cgi?id=1067802
and
https://community.intel.com/t5/Ethernet-Products/82599-VF-to-Linux-host-bridge/td-p/351802

a simple:
Bash:
# bridge fdb add DE:AD:BE:EF:00:01 dev eno2
(with the mac in there being the mac of the container or bridged VM and the dev being the PF interface that is attached to the bridge)
Enables the communication between the VF and virtual NICs on the PF bridge.

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.
 
  • Like
Reactions: kriss35

Dominic

Proxmox Staff Member
Staff member
Mar 18, 2019
1,388
168
68
Thank you for posting the solution!

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.
You can post an enhancement request at our Bugzilla. Forum threads might be forgotten after a while, but open Bugzilla requests are easy to find. If you script something yourself, it would be great if you could make it available to other users as well :)
 

kriss35

Member
Aug 5, 2015
4
0
21
Again answering to myself...

This seems to be a known issue (or maybe feature?) and there is a workaround available.
I've got the information from
https://bugzilla.redhat.com/show_bug.cgi?id=1067802
and
https://community.intel.com/t5/Ethernet-Products/82599-VF-to-Linux-host-bridge/td-p/351802

a simple:
Bash:
# bridge fdb add DE:AD:BE:EF:00:01 dev eno2
(with the mac in there being the mac of the container or bridged VM and the dev being the PF interface that is attached to the bridge)
Enables the communication between the VF and virtual NICs on the PF bridge.

I'd love to see some automation around this, will look into the possibility of creating a hook-script for this scenario.

Thank you so much CTR !
This problem made me crazy, you save my week :)

Now i did a bash script (extention renamed from sh to txt to attach it to this post) wich run every minutes to check if all mac adresses of containers or VMs are already in the ForwardDB of the interface, if they are not, it will add it. The STDOUT can be redirected in a log file.

again, big thank you
 

Attachments

  • mac_register.txt
    1.3 KB · Views: 23

kbumsik

New Member
Nov 1, 2020
1
0
1
31
Wow, thanks sooooo much ctr and kriss35 !!!

I use a ixgbe device (Intel X520-DA2) and had exactly same problem. You guys saved my days :)

And thanks a lot for the script, it works out of the box for me!
 

Ramalama

Active Member
Dec 26, 2020
354
48
28
32
Bash:
#!/usr/bin/bash
#
# vf_add_maddr.sh Version 1.1
# Script is based on kriss35
# Update by Rama: Added vmbridge macaddress itself, simplified, systemd-service(RestartOnFailure) Compatible and speeded up with a tmpfile(one readout).
# Usage: execute directly without arguments, make an systemd-service or add it to crontab to run every x Minutes.
#
CTCONFDIR=/etc/pve/nodes/proxmox/lxc
VMCONFDIR=/etc/pve/nodes/proxmox/qemu-server
IFBRIDGE=enp35s0f0
LBRIDGE=vmbr0
TMP_FILE=/tmp/vf_add_maddr.tmp

C_RED='\e[0;31m'
C_GREEN='\e[0;32m'
C_NC='\e[0m'

if [ ! -d $CTCONFDIR ] || [ ! -d $VMCONFDIR ]; then
        echo -e "${C_RED}ERROR: Not mounted, self restart in 5s!${C_NC}"
        exit 1
else
        MAC_LIST_VMS=" $(cat ${VMCONFDIR}/*.conf | grep bridge | grep -Eo '([[:xdigit:]]{1,2}[:-]){5}[[:xdigit:]]{1,2}' | tr '[:upper:]' '[:lower:]') $(cat ${CTCONFDIR}/*.conf | grep hwaddr | grep -Eo '([[:xdigit:]]{1,2}[:-]){5}[[:xdigit:]]{1,2}' | tr '[:upper:]' '[:lower:]')"
        MAC_ADD2LIST="$(cat /sys/class/net/$LBRIDGE/address)"
        MAC_LIST="$MAC_LIST_VMS $MAC_ADD2LIST"
        /usr/sbin/bridge fdb show | grep "${IFBRIDGE} self permanent" > $TMP_FILE

        for mactoregister in ${MAC_LIST}; do
                if ( grep -Fq $mactoregister $TMP_FILE ); then
                        echo -e "${C_GREEN}$mactoregister${C_NC} - Exists!"
                else
                        /usr/sbin/bridge fdb add $mactoregister dev ${IFBRIDGE}
                        echo -e "${C_RED}$mactoregister${C_NC} - Added!"
                fi
        done
        exit 0
fi

I've updated the script a bit:
+ The macaddress of the linux bridge itself was missing. Propably that wasn't neccessary earlier.
+ Speeded the script up a lot, cause it's not neccessary to read the bridge table in a while loop. Reading it once and using a tempfile now.
+ If you use a Systemd-Service or Crontab, sometimes the script runs before the pve folders are mounted. There is an exit code for this now, that Systemd Service can restart the script. (You can use type=simple & Restart=on-failure)
+ Simplified: Less echo output, instead there are colors now.

If there is enough interrest we could make a proper Systemd Service of this. In theory the script could be modified to run as a proper service in a constant while loop (sleeping) and react instead of time based intervals, to filesystem based intervals (for instantly adding a macaddress to a db if a vm or an adapter get added to a vm), but this would require to have additionally incrond installed.

However, in theory this is a bug, either on intels side or on linux side. Because every other sr-iov supported nic, doesn't need this workaround here.
So its probably better to report it https://bugzilla.kernel.org/, whoever wants to do this, feel free xD

Cheers
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!