SDN Apply Remains Pending

proxy

Active Member
May 5, 2017
11
0
41
Hello,

Existing cluster: v8.1.4
New node: v8.2.4

I'm having a strange issue with applying SDN (QinQ) across my cluster but specifically to a new node.

Currently the cluster has 8 nodes and I joined a 9th. Everything has been working perfectly and applying SDN across the existing 8 has been working flawlessly for some time (and continue to work fine throughout the below). Upon joining a 9th node, SDN appears on the new node Server View (SDN Zone is set to All) however, it fails to apply and remains pending with the message (and yellow exclamation mark present) for the new node:

"local sdn network configuration is too old, please reload"

The networking configuration and hardware is identical on all nodes, even up to the interface names, VLANs etc. There are no issues with MTU, jumbo ping or SSH, across the cluster or to the new node.

"source /etc/network/interfaces.d/*" does indeed exist at the bottom of /etc/network/interfaces

"/etc/network/interfaces.d" has the "sdn" config which matches the rest of the clusters sdn files

"/etc/pve/sdn/" has "vnets.cfg" and "zones.cfg" present and matches.

This is where it gets strange:

At first, on the new node, the 'sdn' file was not present in /etc/network/interfaces.d despite applying multiple times. The error at that time was "local sdn configuration is not yet generated".

A minor change was applied to the node Network configuration via the GUI and suddenly the SDN came up and the 'sdn' file was present, matching versions. Now all is good.

However, applying SDN again (even with no changes), results in the yellow exclamation mark only on this new node (all other nodes are reloaded and available) and "local sdn network configuration is too old, please reload" message. Reviewing the file shows the version number to be the previous revision while all other nodes in the cluster are the newer version.

Repeating of Apply Configuration on the Network of the new node (by just editing a Comment) results in an instant fix, and the latest 'sdn' version file is present and SDN is healthy.

Any guidance would be much apprecaited.
 
I seem to have resolved this issue by actually finding another issue. I was having a VNC connection issue and the error per log:

Host key verification failed.
TASK ERROR: Failed to run vncproxy.

This lead me to https://forum.proxmox.com/threads/c...to-server-host-key-verification-failed.78957/

..and by running:

Code:
/usr/bin/ssh -e none -o 'HostKeyAlias=HOSTNAME' root@IP /bin/true

..I was prompted to accept the hostkey. VNC was fixed and now applying SDN is working perfectly too.

Still rather strange how this happened in the first place after successfully joining the cluster. I'm also keen to know how else this issue could have been found if it wasn't for the vncproxy issue. Also still don't understand what Applying Network Config has to do with it working per above explanations yet Apply SDN wasn't.

Any thoughts/comments would be helpful! :)
 
Apply Network Config is local and is basically reloading the setting-files already there, and I BELIEVE it also does not look at anything cluster-related (like /etc/pve/sdn).
Apply SDN is a "cluster wide" setting, which requires all nodes to be talking to eachother and be in sync.

During your tests, did you check the status of your cluster (pvecm status) if all nodes were present, talking and in-sync?
Because I'm thinking that maybe for whatever reason one of the nodes (new or old) didn't get the host-key / ssh-key synced properly, getting the cluster running but not 100% in sync, which caused it to not (always) get the correct data shared around.

Just my guesses, but since it's working now, finding the cause might be tricky. It could of course be caused by not all nodes running on the same version (which is advised not to of course, or only while in the process of upgrading all nodes in said cluster) and there being some change between those 2 versions that messed things up (although this small of a difference is USUALLY fine)
 
Apply SDN is a "cluster wide" setting, which requires all nodes to be talking to eachother and be in sync.

Only this newly joined node was having the issue - Apply SDN was running through all the others without issue.

During your tests, did you check the status of your cluster (pvecm status) if all nodes were present, talking and in-sync?
Because I'm thinking that maybe for whatever reason one of the nodes (new or old) didn't get the host-key / ssh-key synced properly, getting the cluster running but not 100% in sync, which caused it to not (always) get the correct data shared around.

Unfortunately I did not specifically check the status from console as in the UI, all showed good. It was also responding correctly ie online/offline upon multiple reboots of the new node. I should mention that rebooting did not change the SDN state either.

Just my guesses, but since it's working now, finding the cause might be tricky. It could of course be caused by not all nodes running on the same version (which is advised not to of course, or only while in the process of upgrading all nodes in said cluster) and there being some change between those 2 versions that messed things up (although this small of a difference is USUALLY fine)

I'm aware of the versions and read on a few posts that so long as it's minor version differences, all should be good. Major I can certainly understand.

Where would the SDN logs be ? Considering each time the status just remained 'pending', there should at least be a way to see why..

The invalid/missing hostkey on a brand new host that just joined the cluster is also strange.
 
I ran into the exact same issue again by adding a new 10th node. However this time, I tried to open the Shell from the host that the new node initially joined. I had to accept the host key - and once again instantly SDN was healthy on the new node. Really odd.
 
Are you planning to add more nodes to get off of the even number of nodes, or do you have a Q-Device ready to re-join btw? (Since even number of nodes is recommended against + Q-Device should be UN-joined before adding/removing nodes)
Also before you ask: No, you can't see if a Q-Device is added from the GUI, only from the shell, so you might want to double-check just in case, if you haven't fully set up this cluster yourself and know it doesn't exists.
 
Thank you for that. Yes, I will add an 11th now that I know of the quick SDN fix :)
 
Once you've ran the fix once, do any future changes to the SDN work normally btw?

Yes, simply accepting the host key either via opening a Shell (UI) on the new node or manually (console) as mentioned previously, it then functions normally and no Apply issues.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!