Node won't get rid of old SDN zone et al

hazaqames477

New Member
Apr 3, 2024
3
0
1
I have five nodes in a cluster. I am using SDN.

The only zone I have now is "proxnet."

1720914211317.png

I can't seem to get one of the nodes to remove the VXLAN zone that has since been removed from all the other nodes.

1720914242617.png

I suspect part of the problem is that one of the VNETs associated with that zone has a VLAN ID set to 1.

1720914298617.png

It appears that whenever I click SDN > "Apply", this node attempts to apply the configuration from /etc/pve/sdn/*.conf which then causes the node to go offline due to VLAN 1. I can edit /etc/network/interfaces.d/sdn and change the VLAN to 2. Then I can reboot and get the node back online.

Even with quorum, trying to edit /etc/pve/sdn/zones.conf just freezes. I think this is because of the swap files below, indicating that something is already editing the files? I have attempted to identify the process that holds the lock, but nothing is coming up.

1720914189674.png

Bottom line--I need to know how to either force the config from any of the other nodes to this one, or a way to edit these cfg files, or a way to blow this node's SDN configuration away so that it gets it all over again from the rest of the cluster. I'm worn out, so I'm giving this a rest for the weekend. :(
 
Active SDN-Settings are (also) kept in /etc/network/interfaces.d/sdn
On both the working and the "sick" node, could you cat/nano them and see what the differences are? If none: Also check the interfaces.d directory itself if there are any other files in there and either change, move or remove them.
Afterwards, in the node's network-settings (not SDN-settings) change "something" (for example just the name of a port or so) and then try to apply, to only apply new settings for that node (it will also reload the sdn-file) and see if it both does not go offline anymore and clears out the stale network
 
First, thank you for the help! It is much appreciated.

I just booted the node up and connected. Here are the contents of the active sdn file:
Code:
#version:34

auto gamvn
iface gamvn
        bridge_ports vmbr0.3
        bridge_stp off
        bridge_fd 0

auto genvn
iface genvn
        bridge_ports vmbr0.2
        bridge_stp off
        bridge_fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto vxgamvn
iface vxgamvn
        bridge_ports vxlan_vxgamvn
        bridge_stp off
        bridge_fd 0
        mtu 1450

auto vxgenvn
iface vxgenvn
        bridge_ports vxlan_vxgenvn
        bridge_stp off
        bridge_fd 0
        mtu 1450

auto vxlan_vxgamvn
iface vxlan_vxgamvn
        vxlan-id 103
        vxlan_remoteip 192.168.1.91
        vxlan_remoteip 192.168.1.92
        vxlan_remoteip 192.168.1.93
        vxlan_remoteip 192.168.1.95
        mtu 1450

auto vxlan_vxgenvn
iface vxlan_vxgenvn
        vxlan-id 102
        vxlan_remoteip 192.168.1.91
        vxlan_remoteip 192.168.1.92
        vxlan_remoteip 192.168.1.93
        vxlan_remoteip 192.168.1.95
        mtu 1450

auto vxlan_vxwpvn
iface vxlan_vxwpvn
        vxlan-id 104
        vxlan_remoteip 192.168.1.91
        vxlan_remoteip 192.168.1.92
        vxlan_remoteip 192.168.1.93
        vxlan_remoteip 192.168.1.95
        mtu 1450

auto vxwpvn
iface vxwpvn
        bridge_ports vxlan_vxwpvn
        bridge_stp off
        bridge_fd 0
        mtu 1450

auto wpvn
iface wpvn
        bridge_ports vmbr0.4
        bridge_stp off
        bridge_fd 0

And here are the contents of the file from one of the other nodes:

Code:
#version:42

auto gamvn
iface gamvn
        bridge_ports vmbr0.3
        bridge_stp off
        bridge_fd 0

auto genvn
iface genvn
        bridge_ports vmbr0.2
        bridge_stp off
        bridge_fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

auto wpvn
iface wpvn
        bridge_ports vmbr0.4
        bridge_stp off
        bridge_fd 0

There are no other files in the interfaces.d directory. When I click on System > Network for the problem node, this is what I get:

1720962324463.png

So what I did next was the following:
  1. ip link add brtemp type bridge
  2. ifreload -a
Unfortunately, while this interface remained, the sdn config also remained. No matter what I have tried, I cannot get rid of that old config.

Today (Sunday afternoon), I gave it one more try but found the entire cluster practically unusable. So I removed the node from the cluster and re-added it per https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node. Then I updated all packages on all nodes and rebooted them. The instability remained, and I could not get the nodes to communicate to the re-added note. That is, I could not get the SSH key copied over... would just hang up. So I took out my flame thrower and... just about. :) I'm recreating the cluster from scratch. This is the time to do it, as I still have my lab VMs on the old KVM host and can easily migrate them over again.
 
Given the version-numbering difference between the conflicting and the other nodes, it has been going on for a while.
That said, giving any more tips/advice will probably not be useful anymore, since the cluster is rebuilt.
Good luck further with the cluster-rebuild and the re-migration further.
 
Thank you for the help! I concur that further investigation is not useful given that I'm rebuilding the cluster.

(time was fine @spirit I did see that come up in some other forum posts and double checked.)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!