Node won't get rid of old SDN zone et al


New Member
Apr 3, 2024
I have five nodes in a cluster. I am using SDN.

The only zone I have now is "proxnet."


I can't seem to get one of the nodes to remove the VXLAN zone that has since been removed from all the other nodes.


I suspect part of the problem is that one of the VNETs associated with that zone has a VLAN ID set to 1.


It appears that whenever I click SDN > "Apply", this node attempts to apply the configuration from /etc/pve/sdn/*.conf which then causes the node to go offline due to VLAN 1. I can edit /etc/network/interfaces.d/sdn and change the VLAN to 2. Then I can reboot and get the node back online.

Even with quorum, trying to edit /etc/pve/sdn/zones.conf just freezes. I think this is because of the swap files below, indicating that something is already editing the files? I have attempted to identify the process that holds the lock, but nothing is coming up.


Bottom line--I need to know how to either force the config from any of the other nodes to this one, or a way to edit these cfg files, or a way to blow this node's SDN configuration away so that it gets it all over again from the rest of the cluster. I'm worn out, so I'm giving this a rest for the weekend. :(
Active SDN-Settings are (also) kept in /etc/network/interfaces.d/sdn
On both the working and the "sick" node, could you cat/nano them and see what the differences are? If none: Also check the interfaces.d directory itself if there are any other files in there and either change, move or remove them.
Afterwards, in the node's network-settings (not SDN-settings) change "something" (for example just the name of a port or so) and then try to apply, to only apply new settings for that node (it will also reload the sdn-file) and see if it both does not go offline anymore and clears out the stale network
First, thank you for the help! It is much appreciated.

I just booted the node up and connected. Here are the contents of the active sdn file:

auto gamvn
iface gamvn
        bridge_ports vmbr0.3
        bridge_stp off
        bridge_fd 0

auto genvn
iface genvn
        bridge_ports vmbr0.2
        bridge_stp off
        bridge_fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vxgamvn
iface vxgamvn
        bridge_ports vxlan_vxgamvn
        bridge_stp off
        bridge_fd 0
        mtu 1450

auto vxgenvn
iface vxgenvn
        bridge_ports vxlan_vxgenvn
        bridge_stp off
        bridge_fd 0
        mtu 1450

auto vxlan_vxgamvn
iface vxlan_vxgamvn
        vxlan-id 103
        mtu 1450

auto vxlan_vxgenvn
iface vxlan_vxgenvn
        vxlan-id 102
        mtu 1450

auto vxlan_vxwpvn
iface vxlan_vxwpvn
        vxlan-id 104
        mtu 1450

auto vxwpvn
iface vxwpvn
        bridge_ports vxlan_vxwpvn
        bridge_stp off
        bridge_fd 0
        mtu 1450

auto wpvn
iface wpvn
        bridge_ports vmbr0.4
        bridge_stp off
        bridge_fd 0

And here are the contents of the file from one of the other nodes:


auto gamvn
iface gamvn
        bridge_ports vmbr0.3
        bridge_stp off
        bridge_fd 0

auto genvn
iface genvn
        bridge_ports vmbr0.2
        bridge_stp off
        bridge_fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto wpvn
iface wpvn
        bridge_ports vmbr0.4
        bridge_stp off
        bridge_fd 0

There are no other files in the interfaces.d directory. When I click on System > Network for the problem node, this is what I get:


So what I did next was the following:
  1. ip link add brtemp type bridge
  2. ifreload -a
Unfortunately, while this interface remained, the sdn config also remained. No matter what I have tried, I cannot get rid of that old config.

Today (Sunday afternoon), I gave it one more try but found the entire cluster practically unusable. So I removed the node from the cluster and re-added it per Then I updated all packages on all nodes and rebooted them. The instability remained, and I could not get the nodes to communicate to the re-added note. That is, I could not get the SSH key copied over... would just hang up. So I took out my flame thrower and... just about. :) I'm recreating the cluster from scratch. This is the time to do it, as I still have my lab VMs on the old KVM host and can easily migrate them over again.
Given the version-numbering difference between the conflicting and the other nodes, it has been going on for a while.
That said, giving any more tips/advice will probably not be useful anymore, since the cluster is rebuilt.
Good luck further with the cluster-rebuild and the re-migration further.
Thank you for the help! I concur that further investigation is not useful given that I'm rebuilding the cluster.

(time was fine @spirit I did see that come up in some other forum posts and double checked.)