SDN zone shows "pending" on peer nodes after node reboot (9.2.x) -- is this a bug?

kodamap · Friday at 09:17

Hi everyone,

I found a SDN status issue in PVE 9.2.x. After a node reboot or temporary network outage, the recovered node increments its local SDN configuration version, leaving all other healthy nodes on the old version (flagged as pending).

Confirmed on: 9.2.2, 9.2.3 (`libpve-network-perl` 1.6.6)
Not observed on: 9.1.1 (`libpve-network-perl` 1.2.3)

Summary

After a node reboot or NIC-down event (HA fencing test), only the recovered node gets its
/etc/network/interfaces.d/sdn #version incremented. The other nodes are left on the
old version and are immediately flagged as pending

Running `Apply` from the GUI restores all nodes to available, but the issue reappears
on every subsequent reboot or NIC failure. **Reproducibility: 100%.**

There is no impact on VM connectivity — the issue is limited to the SDN status display.

Environment

| | 9.1.1 (not affected) | 9.2.3 (affected) |
|---|---|---|
| `pve-manager` | 9.1.1 | 9.2.3 |
| `libpve-network-perl` | **1.2.3** | **1.6.6** |
| Cluster | 3-node | 3-node |
| SDN zone types | Simple (OVS VLAN) + VXLAN | Simple (OVS VLAN) + VXLAN |

Steps to reproduce (9.2.x)

1. Apply SDN from GUI — confirm all nodes show `available`.
2. Reboot any one node (or bring down all NICs to trigger HA fence).
3. Wait for the node to rejoin the cluster.
4. Check SDN status — the recovered node shows `available`, all others show `pending`.

Before reboot / NIC-down (all nodes in sync)

Bash:

# head -1 /etc/network/interfaces.d/sdn
pve01: #version:58
pve02: #version:58
pve03: #version:58

After pve03 reboot / NIC-down (version increments only on the recovered node)

Bash:

# head -1 /etc/network/interfaces.d/sdn
pve01: #version:58
pve02: #version:58
pve03: #version:59  ← only the recovered node is incremented

Resulting SDN status

Code:

pve01: pending   (still on version:58 — judged as too old)
pve02: pending   (still on version:58 — judged as too old)
pve03: available (updated to version:59)

Questions

Is there a way to prevent the rejoining node from advancing its local SDN version?
Is there any workaround other than manually clicking "Apply" in the GUI after every reboot?

shanreich · Friday at 10:11

Hmm, seems like this is caused by the pve-sdn-commit one-shot service. Could you post your SDN configuration to see what seems to trigger this?

Code:

cat /etc/pve/sdn/zones.cfg
cat /etc/pve/sdn/vnets.cfg
cat /etc/pve/sdn/subnets.cfg

kodamap · Friday at 10:34

Thanks for your replay, Here is my SDN configuration:

Bash:

root@pve02:~# cat /etc/pve/sdn/zones.cfg
vlan: VMNet
        bridge vmbr0
        ipam pve

vlan: VMNet2
        bridge vmbr1
        ipam pve

vlan: Service
        bridge vmbr0
        ipam pve

root@pve02:~# cat /etc/pve/sdn/vnets.cfg
vnet: vNet1227
        zone Service
        tag 1227

vnet: vNet1214
        zone Service
        tag 1214

vnet: vNet1240
        zone Service
        tag 1240

root@pve02:~# cat /etc/pve/sdn/subnets.cfg
(empty)

shanreich · Friday at 10:35

How do the underlying bridges for the SDN zones look like?

Code:

cat /etc/network/interfaces

kodamap · Friday at 10:49

Here is my network configuration as well:

Bash:

root@pve02:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto nic0
iface nic0 inet manual

auto nic6
iface nic6 inet static
        address 192.168.0.52/24

iface nic7 inet manual

auto nic2
iface nic2 inet manual

auto nic3
iface nic3 inet manual

auto nic4
iface nic4 inet manual

auto nic5
iface nic5 inet manual

auto nic1
iface nic1 inet manual

auto mng_port0
iface mng_port0 inet static
        address 10.1.214.52/24
        gateway 10.1.214.1
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=1214

auto bond0
iface bond0 inet manual
        ovs_bonds nic0 nic1
        ovs_type OVSBond
        ovs_bridge vmbr0
        ovs_options bond_mode=active-backup

auto bond1
iface bond1 inet manual
        bond-slaves nic4 nic5
        bond-miimon 100
        bond-mode active-backup
        bond-primary nic4

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 mng_port0

auto vmbr1
iface vmbr1 inet static
        address 192.168.100.52/24
        bridge-ports bond1
        bridge-stp off
        bridge-fd 0

source /etc/network/interfaces.d/*
root@pve02:~#

shanreich · Friday at 11:11

We currently run the apply unconditionally if we detect a non-vlan-aware bridge, see [1] for more in-depth reasoning. We should maybe revisit if we can improve the situation there and improve the change detection.

In the meanwhile the problem should be avoidable by utilizing a vlan-aware bridge, which is recommended for bridges backing VLAN zones anyway.

You can simply set the bridges on each node to VLAN-aware, re-apply the network + SDN configuration, and then restart all VMs. If you don't want any service interruption then the procedure is a bit more tricky and needs to be done node by node instead:

migrate away all guests from the node you want to update
Set the underlying bridge (vmbr1) on the node to VLAN-aware
Apply the network configuration for that node
Apply the SDN configuration to re-generate the proper network configuration for that node
Migrate the guests back and check if everything worked as expected
repeat the procedure for every host in the cluster

[1] https://git.proxmox.com/?p=pve-manager.git;a=commit;h=ea09550b4feb79e033b7041f4156003d6507c6ae

kodamap · Friday at 11:40

Thank you for the explanation.

Just to confirm my understanding — if we are using an OVS Bridge, it cannot be set to VLAN-aware, so there is no way to avoid this issue in an OVS-based setup, correct?
Also, I tested setting only the Linux bridge (vmbr1) to VLAN-aware, but the pending issue still occurred.

SDN zone shows "pending" on peer nodes after node reboot (9.2.x) -- is this a bug?

kodamap

New Member

shanreich

Proxmox Staff Member

kodamap

New Member

shanreich

Proxmox Staff Member

kodamap

New Member

shanreich

Proxmox Staff Member

kodamap

New Member

We value your privacy