SDN EVPN zone stuck "pending" on a single node

hblandford

Member
Feb 25, 2024
3
1
8
On a multi-node cluster with an EVPN SDN zone, the zone status is permanently shown as pending on a single node, while all other nodes show available.
It appears that the root cause is pvestatd failing to parse the output of ifquery -a -c -o json, because on that one node ifquery aborts with:

error: main exception: cycle found involving iface dmz (indegree 1)

The SDN configuration and package versions are identical across all nodes. The only difference I can find between the failing node and the working ones is the physical NIC name (enp0s31f6 on the failing node vs eno1 / nic0 on the working ones). The data plane is completely unaffected — this appears to be cosmetic, but the status flag and the 10-second log spam are persistent and survive reboots.

Environment​

  • Proxmox VE: 9.2.3
  • ifupdown2: 3.3.0-1+pmx12 (identical on all nodes)
  • FRRouting: 10.6.1
  • 4-node cluster, EVPN SDN zone with an external EVPN gateway (VyOS) as the L3 gateway
prox1 is a working node:
root@prox1:~# pveversion -v
proxmox-ve: 9.2.0 (running kernel: 7.0.6-2-pve)
pve-manager: 9.2.3 (running version: 9.2.3/d0fde103346cf89a)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.6-2
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.13-2-pve-signed: 6.17.13-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
ceph: 20.2.1-pve1
ceph-fuse: 20.2.1-pve1
corosync: 3.1.10-pve2
criu: 4.1.1-1
frr-pythontools: 10.6.1-1+pve2
ifupdown2: 3.3.0-1+pmx12
intel-microcode: 3.20260227.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.1
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.1.1
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.1.6
libpve-cluster-perl: 9.1.6
libpve-common-perl: 9.1.13
libpve-guest-common-perl: 6.0.3
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.6.6
libpve-notify-perl: 9.1.6
libpve-rs-perl: 0.15.3
libpve-storage-perl: 9.1.5
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 7.0.0-2
lxcfs: 7.0.0-pve1
novnc-pve: 1.7.0-1
proxmox-backup-client: 4.2.1-1
proxmox-backup-file-restore: 4.2.1-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.3
proxmox-kernel-helper: 9.2.0
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.3
pve-cluster: 9.1.6
pve-container: 6.1.10
pve-docs: 9.2.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-4
pve-ha-manager: 5.2.4
pve-i18n: 3.7.5
pve-qemu-kvm: 11.0.0-4
pve-xtermjs: 6.0.0-1
qemu-server: 9.1.16
smartmontools: 7.5-pve2
spiceterm: 3.4.2
swtpm: 0.8.0+pve3
vncterm: 1.9.2
zfsutils-linux: 2.4.2-pve1


prox5 is the failing node:
root@prox5:~# pveversion -v
proxmox-ve: 9.2.0 (running kernel: 7.0.6-2-pve)
pve-manager: 9.2.3 (running version: 9.2.3/d0fde103346cf89a)
proxmox-kernel-helper: 9.2.0
proxmox-kernel-7.0: 7.0.6-2
proxmox-kernel-7.0.6-2-pve-signed: 7.0.6-2
proxmox-kernel-6.17: 6.17.13-13
proxmox-kernel-6.17.13-13-pve-signed: 6.17.13-13
proxmox-kernel-6.17.13-2-pve-signed: 6.17.13-2
proxmox-kernel-6.14: 6.14.11-9
proxmox-kernel-6.14.11-9-pve-signed: 6.14.11-9
proxmox-kernel-6.8: 6.8.12-15
proxmox-kernel-6.8.12-15-pve-signed: 6.8.12-15
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph: 20.2.1-pve1
ceph-fuse: 20.2.1-pve1
corosync: 3.1.10-pve2
criu: 4.1.1-1
frr-pythontools: 10.6.1-1+pve2
ifupdown2: 3.3.0-1+pmx12
intel-microcode: 3.20240813.2
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.1
libproxmox-backup-qemu0: 2.0.2
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.1.1
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.1.6
libpve-cluster-perl: 9.1.6
libpve-common-perl: 9.1.13
libpve-guest-common-perl: 6.0.3
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.6.6
libpve-notify-perl: 9.1.6
libpve-rs-perl: 0.15.3
libpve-storage-perl: 9.1.5
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 7.0.0-2
lxcfs: 7.0.0-pve1
novnc-pve: 1.7.0-1
proxmox-backup-client: 4.2.1-1
proxmox-backup-file-restore: 4.2.1-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.3
proxmox-kernel-helper: 9.2.0
proxmox-mail-forward: 1.0.3
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.4
proxmox-widget-toolkit: 5.2.3
pve-cluster: 9.1.6
pve-container: 6.1.10
pve-docs: 9.2.2
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.18-4
pve-ha-manager: 5.2.4
pve-i18n: 3.7.5
pve-qemu-kvm: 11.0.0-4
pve-xtermjs: 6.0.0-1
qemu-server: 9.1.16
smartmontools: 7.5-pve2
spiceterm: 3.4.2
swtpm: 0.8.0+pve3
vncterm: 1.9.2
zfsutils-linux: 2.4.2-pve1

Symptom​


In Datacenter → SDN, the public zone shows pending on one node (here prox5) and available on all others.

pvestatd logs the following every poll cycle, on the affected node only:

pvestatd[...]: sdn status update error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Network/SDN/Zones.pm line 200.
pvestatd[...]: network status update error: malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Network/SDN/Zones.pm line 200.

Zones.pm line 200 is the ifquery_check path, which runs ifquery -a -c -o json.

The "(end of string)" at offset 0 is an empty stdout — ifquery produced no JSON because it aborted.


The actual failure​

Running the command manually on the affected node:


root@prox5:~# ifquery -a -c -o json
error: main exception: cycle found involving iface dmz (indegree 1)


On every other node the identical command returns valid JSON (truncated):

root@prox1:~# ifquery -a -c -o json | head
[
{
"name": "lo",
"addr_method": "loopback",
"addr_family": "inet",
"auto": true,
"config": {},
"config_status": {},
"status": "pass"
}
etc

Failing node (prox5, NIC = enp0s31f6):


root@prox5:~# ifquery --print-dependency=list dmz
lo : []
enp0s31f6 : []
vmbr0 : ['enp0s31f6']
vmbr21 : ['enp0s31f6.21']
vmbr20 : ['enp0s31f6.20']
vmbr12 : ['enp0s31f6.12']
vmbr10 : ['enp0s31f6.10']
vmbr11 : ['enp0s31f6.11']
vmbr14 : ['enp0s31f6.14']
dmz : ['vxlan_dmz']
vrf_public : ['dmz', 'vrfbr_public']
vrfbr_public : ['vrfvx_public']
vrfvx_public : []
vxlan_dmz : []
enp0s31f6.21 : ['enp0s31f6']
enp0s31f6.20 : ['enp0s31f6']
enp0s31f6.12 : ['enp0s31f6']
enp0s31f6.10 : ['enp0s31f6']
enp0s31f6.11 : ['enp0s31f6']
enp0s31f6.14 : ['enp0s31f6']


root@prox5:~# ip -d link show dmz | grep master
17: dmz: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf_public state UP mode DEFAULT group default qlen 1000

Working node (prox1, NIC = nic0) — identical dependency relationships, only the NIC name differs:


root@prox1:~# ifquery --print-dependency=list dmz
lo : []
nic0 : []
wls2f3 : []
vmbr0 : ['nic0']
vmbr21 : ['nic0.21']
vmbr20 : ['nic0.20']
vmbr12 : ['nic0.12']
vmbr10 : ['nic0.10']
vmbr11 : ['nic0.11']
vmbr14 : ['nic0.14']
dmz : ['vxlan_dmz']
vrf_public : ['dmz', 'vrfbr_public']
vrfbr_public : ['vrfvx_public']
vrfvx_public : []
vxlan_dmz : []
nic0.21 : ['nic0']
nic0.20 : ['nic0']
nic0.12 : ['nic0']
nic0.10 : ['nic0']
nic0.11 : ['nic0']
nic0.14 : ['nic0']


root@prox1:~# ip -d link show dmz | grep master
18: dmz: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf_public state UP mode DEFAULT group default qlen 1000

The only thing I can see that is different is the name of the NIC.

Anyone have any suggestions? Thanks.