[SOLVED] Applying SDN config breaks VM connectivity untill VM reboot [8.4.1]

scyto

Well-Known Member
Aug 8, 2023
569
133
53
I am messing with SDN, evey time I apply a configuration it breaks VM connectivity unitl the VM is rebooted.
The VMs virtual adapaters are bound to vmbr0 - nothing special, no vlans etc

This happens even when i fully delete all the SDN config items and hit apply (i.e. it is not the SDN config itself)

The only thing i will note as special about my machine is:
1. i compiled the patched version of ifupdown2 that works with IPv6
2. my vmbr0 and lo interfaces have IPv6
3. in SDN i am only playing with IPv4 cofigs (i saw that IPv6 configs don't work, which is reasonable given the shipping version of ifupdown2)

The only errors I see in the logs are as follows.
The last entry is not an issue as frr.service does restart instead of reloading (and stopping and starting the service myself doesn't have any effect on the broken VM connectivity) only rebooting the VMs seems to help.

Code:
vrf_cephEVPN : warning: vrf_cephEVPN: post-up cmd 'ip route del vrf vrf_cephEVPN unreachable default metric 4278198272' failed: returned 2 (RTNETLINK answers: No such process
frr reload command fail. Restarting frr. at /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm line 643.
TASK OK

I have not tried `systemctl restart network.service, the service has errors but these don't seem to correlate strictly to the VM issues.
Should the SDN apply be restarting this service (and isn't for some reason)?

Code:
[root@pve1 11:16:39]$ systemctl status networking.service
● networking.service - Network initialization
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; preset: enabled)
     Active: active (exited) since Sun 2025-04-20 18:49:13 PDT; 2 days ago
       Docs: man:interfaces(5)
             man:ifup(8)
             man:ifdown(8)
   Main PID: 794 (code=exited, status=0/SUCCESS)
        CPU: 631ms

Apr 22 13:22:38 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 13:22:39 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 13:22:40 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 13:23:32 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 13:25:31 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 13:25:59 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 13:34:23 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 14:12:06 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 14:25:13 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
Apr 22 14:25:17 pve1 systemd[1]: networking.service: Cannot add dependency job, ignoring: Unit networking.service failed to load properly, please adjust/correct and reload service manager: Device or resource busy
 
Last edited:
i decided to do what it said and reloaded the service - it causes the same issue with the VMs

Code:
Apr 23 11:34:28 pve1 systemd[1]: Reloading networking.service - Network initialization...
Apr 23 11:34:28 pve1 networking[2095151]: networking: Reloading network interfaces configuration
Apr 23 11:34:34 pve1 networking[2095153]: warning: vrf_cephEVPN: post-up cmd 'ip route del vrf vrf_cephEVPN unreachable default metric 4278198272' failed: returned 2 (RTNETLINK answers: No such process
Apr 23 11:34:34 pve1 networking[2095153]: )
Apr 23 11:34:34 pve1 /usr/sbin/ifreload[2095153]: warning: vrf_cephEVPN: post-up cmd 'ip route del vrf vrf_cephEVPN unreachable default metric 4278198272' failed: returned 2 (RTNETLINK answers: No such process
                                                  )
Apr 23 11:34:34 pve1 systemd[1]: Reloaded networking.service - Network initialization.

as soon as this executed I lost the VM connectivity

restarting the network.service did not fix the VM connectivity, rebooting the VMs did
 
Last edited:
removing the IPv6 address from vmbr0 and clicking apply in the node network portion of the UI caused the same issue
 
1. i compiled the patched version of ifupdown2 that works with IPv6
Does this occur with the version from our repository as well? The error message looks like ifupdown failing to reload the network configuration (possibly related to the patch).

Can you post the output of the following command?

Code:
ifreload -avd

Can you post the output of the following files?

Code:
cat /etc/network/interfaces
cat /etc/network/interfaces.d/*
cat /etc/frr/frr.conf
cat /etc/pve/sdn/.running-config

Can you also attach a journal from the host?

Code:
journalctl --since '1 day ago'
 
Does this occur with the version from our repository as well? The error message looks like ifupdown failing to reload the network configuration (possibly related to the patch).
if you mean the shipping version "ifupdown2/stable,now 3.2.0-1+pmx11", just tested that is fine
so yes this issue applies just to the patched version at wido/ifupdown2 at tunnelip6

interestingly with the one from the repo the SDN apply no longer reports errors when i have an IPv6 on vmbr0 (which it did before) did something change with the package in the last week or is this a naturally outcome of me switching the SDN all to IPv4?

interestingly i still get these two errors, but my VMs continue to work so these errors are unrelated to the issue in this thread
Code:
vrf_cephEVPN : warning: vrf_cephEVPN: post-up cmd 'ip route del vrf vrf_cephEVPN unreachable default metric 4278198272' failed: returned 2 (RTNETLINK answers: No such process
frr reload command fail. Restarting frr. at /usr/share/perl5/PVE/Network/SDN/Controllers/EvpnPlugin.pm line 643.

tl;dr the VM connectivity issues was caused by that patched upstream version of ifupdown2 - i will wait until y'all resolve the PR issues with the upstream maintainer ;-)
 
Last edited:
not sure it's related, but frr don't support ipv6 for underlay with evpn

https://github.com/FRRouting/frr/issues/5885
for the VM connectivity issue on host <> VM it isn't, also the underlay has both a IPv6 and IPv4 openabric routing so we get frr paths, but it may be causing other issues - I have yet to get a VM talking to the host IP's that are purely on the thunderbolt mesh - though i see others have done it.... i can ping the addreses, there is just no apparent UDP / TCP connectivity - thats the next thing me to dig into.

it might cause this i guess vrf_cephEVPN : warning: vrf_cephEVPN: post-up cmd 'ip route del vrf vrf_cephEVPN unreachable default metric 4278198272' failed: returned 2 (RTNETLINK answers: No such process that vrf is not present - so i assume thats why the del is failing.... i still dont have a good mental model of what vrf is, what a vxrf is and what all these weird 'interfaces' i now see in ip a do. this is me learning :-)
 
Last edited:
interestingly with the one from the repo the SDN apply no longer reports errors when i have an IPv6 on vmbr0 (which it did before) did something change with the package in the last week or is this a naturally outcome of me switching the SDN all to IPv4?

That's most likely the case, yes. There never was any issue with configuring IPv6 on normal Linux bridges, the problems are related to IPv6 in conjunction with VXLAN / EVPN zones. SDN does create its own ifupdown2 configuration and on applying either the host configuration or the SDN configuration, both configuration files get reloaded - so they affect each other.


that vrf is not present - so i assume thats why the del is failing.... i still dont have a good mental model of what vrf is, what a vxrf is and what all these weird 'interfaces' i now see in ip a do. this is me learning

vrf_<name> is a virtual interface representing a VRF, there are some issues with generating the configuration, particularly when deleting all controllers. I've already sent patches that should address this issue, but they haven't been merged yet.


tl;dr the VM connectivity issues was caused by that patched upstream version of ifupdown2 - i will wait until y'all resolve the PR issues with the upstream maintainer ;-)
While I understand that all this might be a frustrating experience, we're aware that the current situation is not optimal and currently working on a patched ifupdown2 version, as well as on integrating Openfabric (and other routing protocols) natively into the SDN stack. I cannot give any guarantees of course, but I'm hopeful that this will be introduced sometime not too far away. You can check the progress on the mailing list [1] [2].



[1] https://lore.proxmox.com/all/20250423104556.644234-1-c.heiss@proxmox.com/
[2] https://lore.proxmox.com/all/20250404162908.563060-1-g.goller@proxmox.com/
 
  • Like
Reactions: scyto
While I understand that all this might be a frustrating experience
not frustraiting - just confusing, had the UI stopped from me from entering IPv6 addresses i could have avoided the octet errors generated by ifupdown2 and would have stopped there on the IPv6 path as i would have realised it wasn't supported. I don't mind testing things, its opesource! i hope you don't mind me asking questions :-)