SDN suddenly stopped working on one node

gutleib

New Member
Oct 23, 2024
1
1
3
PVE Version 8.2.7

Hello
I have been running a PVE cluster of 4 nodes for some time, and started using SDN (VXLan) since even before release. All worked rather stable, but recently one node started having problems. SDN spans between nodes via static IPs, and has a few VNets. Three nodes communicate successfully, but all VMs on one node can't communicate with VM's on other nodes. PVE nodes do communicate, and configs do apply. Yet VM's are not working until I migrate them off problematic nodes. It's also seems very strange because I can't quite recall any changes in configuration in cluster, including this node.
How can I diagnose SDN issues?
Has anyone seen such behaviour?
 
  • Like
Reactions: fahadysf
I just hit the same issue. A newly added node doesn't to be properly working with SDN. Any VM on it can't reach the VMs using the SDN networks on other nodes, but they do work between the older nodes. I've checked out all the config files and everything seems correct.

Still looking for a solution or some pointers on how to troubleshoot.

Additional detail / Update:

I just noticed that I have 2 sets of nodes which have VMs communicating over SDN with each other but not accross the sets of nodes. Very weird.
 
Last edited:
Apparently I have the same issue. I made a single node and a SDN, then added two new nodes and they just dont work with SDN... A _pita_ because i've set up multipathed storage, gfs2, corosync, pacemaker... oof.
 
Apparently I have the same issue. I made a single node and a SDN, then added two new nodes and they just dont work with SDN... A _pita_ because i've set up multipathed storage, gfs2, corosync, pacemaker... oof.
please provide your sdn configs /etc/pve/sdn/*.cfg
 
Hey @spirit, I was just reading this thread https://forum.proxmox.com/threads/sdn-broken-after-underlying-network-change.133628/+ and told to myself "that weird bond & vlan & bridge names that you find human friendly could be the cause of your entire week trying things" but... I just don't remember now if in one of my attempts I used "enp0s0f0" "bond0" and "vmbr0" but if I did, it certainly didn't work.

I'm going to sleep right now (its like 01am here) but tomorrow I will post that as-is and then will crank a little bit those interface names.

Thank you for caring.
 
My case was solved by using a different vlan+bridge with standard naming (vlanNN + vmbrN) with a bigger MTU, plus specifying MTU values for every iface and using MTU 9000 for physical and bond0 iface. Then everything "normal" uses 1500 and SDN uses a iface group with 2000 allowing me to use 1500 inside the SDN. Everyone happy ever after.
Thanks a lot.