SDN broken after using pve-network-interface-pinning tool

Feb 6, 2025
594
247
43
After successfully using the PBS version of the NIC pinning tool on a couple servers, I tried a PVE 8.4.x server tonight. It seemed fine for most everything except SDN was broken afterwards, showing an error state on the zone (ZoneVLAN). I found file /etc/network/interfaces.d/sdn still referenced the old NIC name, eno1:

Code:
iface vmbr0v198
        bridge_ports  eno1.198 pr_vnetC

...thus it fails to load. Manually changing "eno1" to the new name ("nic0" in this case) doesn't seem to help. If I Apply the SDN config the invalid entry is removed but not replaced:

Code:
iface vmbr0v198
        bridge_ports  pr_vnetC

(Edit: same if I remove the zone from this server and re-add it)

In this configuration the zone shows as "available" however it doesn't pass traffic to/from VMs running on that server node, using that VLAN. All other VMs not using the VLAN are fine.

What is necessary to reconnect the SDN zone on this node, aside from reverting the pinned NIC names?

Of the two "sdn" files the docs mention the tool will update:
  • /etc/pve/sdn/controllers.cfg - size 0
  • /etc/pve/sdn/fabrics.cfg - does not exist


Side note, per that doc page, "It is recommended to assign a name starting with en or eth so that Proxmox VE recognizes the interface as a physical network device which can then be configured via the GUI," however the tool itself does not do this.
 
Last edited:
I pin with SDN but it's on PVE 9.1 and the pinning was done before ever configuring the SDN
 
Can you post the contents of the following files?

Code:
cat /etc/pve/sdn/zones.cfg
cat /etc/pve/sdn/vnets.cfg

cat /etc/network/interfaces
cat /etc/network/interfaces.d/sdn

+ your current network config:

Code:
ip a

Looks like you have a non-vlan aware bridge fwict, there's possible a problem with detecting those in the sdn interfaces file, I'll try to reproduce it.
 
Hi @shanreich, thanks, see below. Notably the VLAN 198 has been working for 6 months, before the network pinning tool was attempted.

Code:
/etc/pve/sdn/zones.cfg

vlan: ZoneVLAN
        bridge vmbr0
        ipam pve

Code:
/etc/pve/sdn/vnets.cfg

vnet: vnetC
        zone ZoneVLAN
        alias NameHere
        tag 198

Code:
/etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface nic0 inet manual

auto nic1
iface nic1 inet static
        address ****/24
        mtu 9000
#storage net

auto vmbr0
iface vmbr0 inet manual
        bridge-ports nic0
        bridge-stp off
        bridge-fd 0
#front net for VMs

auto vlan222
iface vlan222 inet static
        address ***/24
        gateway ***.1
        vlan-raw-device nic0
#management net

auto vlan1231
iface vlan1231 inet static
        address ***/24
        vlan-raw-device nic0
#corosync front net

auto vlan1232
iface vlan1232 inet static
        address ***/24
        vlan-raw-device nic1
#corosync storage net

source /etc/network/interfaces.d/*

Code:
/etc/network/interfaces.d/sdn

#version:7

auto ln_vnetC
iface ln_vnetC
        link-type veth
        veth-peer-name pr_vnetC

auto pr_vnetC
iface pr_vnetC
        link-type veth
        veth-peer-name ln_vnetC

auto vmbr0v198
iface vmbr0v198
        bridge_ports  pr_vnetC
        bridge_stp off
        bridge_fd 0

auto vnetC
iface vnetC
        bridge_ports ln_vnetC
        bridge_stp off
        bridge_fd 0
        alias NameHere

Code:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: nic0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:25:90:fa:5c:ca brd ff:ff:ff:ff:ff:ff
    altname enp3s0f0
    altname ens15f0
(...)


Also of note, /etc/network/interfaces.d/sdn on other nodes is different as I mentioned, with the tagged interface in there:
Code:
#version:7

auto ln_vnetC
iface ln_vnetC
        link-type veth
        veth-peer-name pr_vnetC

auto pr_vnetC
iface pr_vnetC
        link-type veth
        veth-peer-name ln_vnetC

auto vmbr0v198
iface vmbr0v198
        bridge_ports  ens2f1.198 pr_vnetC
        bridge_stp off
        bridge_fd 0

auto vnetC
iface vnetC
        bridge_ports ln_vnetC
        bridge_stp off
        bridge_fd 0
        alias NameHere


Edit:
# pveversion
pve-manager/8.4.14/b502d23c55afcba1 (running kernel: 6.8.12-15-pve)
 
Last edited:
  • Like
Reactions: SteveITS
Cool, thanks for the review. Sounds related to my "side note" above about the docs saying to only use "en or eth." :) I would prefer to go forwards with pinning before updating to 9.x.
 
As a workaround, the issue should be fixed if you use a vlan-aware bridge as the underlying bridge of the VLAN zone and reapply the SDN config - since there we do not rely on generating a config with the physical port.
 
After making vmbr0 VLAN-aware on that first node, I applied the network config, then applied the SDN config, and the problem VLAN worked on that node again. The /etc/network/interfaces.d/sdn file was simplified to:

Code:
#version:8

auto vnetC
iface vnetC
        bridge_ports vmbr0.198
        bridge_stp off
        bridge_fd 0
        alias NameHere

I did not reboot. I repeated that on each of the other nodes (VLAN-aware, SDN). After the last one, with everything working, I thought I'd proceed with the pinning.

Since it is late and my reading comprehension was apparently impaired, on the first server I made the mistake of clicking the now-enabled Apply Configuration button that was visible on the open network settings page again, instead of just rebooting as directed. That immediately disconnected the network on this server. So maybe somehow PVE should prevent one from clicking that button after running the pinning tool, or show a popup to reboot instead.

Then I found a new problem: after I pinned and restarted each of the other nodes, the default route was missing until I applied the existing network config again. In that problem state, all local networking (cluster, Ceph) was working fine so it was just the missing route. Not sure why this not a problem on the first node except on the first one I pinned then fixed the SDN, and on all the others I made the SDN change first, then pinned...? The default route on all servers is a VLAN, used as the PVE management network.
 
With the benefit of sleep, I moved the management VLAN "raw device" from nic0 to vmbr0 to solve the missing gateway issue. It is apparently not necessary for local traffic, e.g. a corosync link VLAN is still functional on nic* on all servers, and Ceph and PVE GUI proxying were fine last night.

And that "no gateway" configuration issue/bug/error/whatever did affect all servers, once I rebooted the first one, which makes more sense.
 
Since it is late and my reading comprehension was apparently impaired, on the first server I made the mistake of clicking the now-enabled Apply Configuration button that was visible on the open network settings page again, instead of just rebooting as directed. That immediately disconnected the network on this server. So maybe somehow PVE should prevent one from clicking that button after running the pinning tool, or show a popup to reboot instead.
Yes, that will apply the newly-generated network configuration immediately - but not the pinning, leading to connectivity loss :/
Maybe we could touch a tempfile somewhere and check for its existence to catch this use-case.


With the benefit of sleep, I moved the management VLAN "raw device" from nic0 to vmbr0 to solve the missing gateway issue. It is apparently not necessary for local traffic, e.g. a corosync link VLAN is still functional on nic* on all servers, and Ceph and PVE GUI proxying were fine last night.

And that "no gateway" configuration issue/bug/error/whatever did affect all servers, once I rebooted the first one, which makes more sense.
It's possible you ran into the following ifupdown2 bug then, where the GW sometimes doesn't come up because of dependency issues (see [1]). I didn't think of that particular issue, sorry :(

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5406
 
ifupdown2 bug then, where the GW sometimes doesn't come up
Hmm, sounds very possible, though it's been fine for [edit: 7] months before the pinning. Maybe pinning changes the timing like the bond speculation in the report? I didn't pull logs or anything, since, like they report, reapplying the network config worked, meaning it's "valid." Thanks.
 
Last edited:
I think it might be related to the change to a vlan-aware bridge, as that creates the VLAN on top of the bridge interface, instead of the physical interface. You have to be careful, since if this is the bug I linked (which I highly suspect). Then you will run into those issues again on reboot.

You need to fix the ordering of your entries in the network configuration to avoid this issue altogether or use one of the workarounds. You can find more informations in the Bugzilla entry - but if you post your network configuration (/etc/network/interfaces as well as /etc/network/interfaces.d/sdn) then I can also take a look at it.
 
Last edited:
Hmm, well all nodes had no gateway on reboot until I moved the management VLAN from nic0 to vmbr0. I did not reboot many times, but after that I saw 100% success with one reboot on each server. Worst case we can still connect with BMC/IPMI remote console. I will keep an eye on it, and read that again; thank you for the warning.