PVE 8.4 to 9 - networking.service never starts

Dec 18, 2024
2
0
1
Hello! I attempted to do an in-place upgrade from PVE 8.4 to PVE 9, and have run into a strange issue where the networking service never manages to start during startup. As far as I can tell, no one has had this issue going from 8 to 9 so far. I have a three system cluster, and followed the guide on the wiki step-by-step (and simultaneously) for two of them, leaving the third alone until I knew the process worked. I am experiencing the same issue on both systems, where the job to start networking.service hangs indefinitely.
1754595322467.png
These are non-production no-subscription systems, so I can blow them up and start over if needed, but I'd like to get these working before I consider moving any other systems I manage from 8 to 9.

I did find instances of a similar issue in the past for people that upgraded from 7 to 8, but the cause there was old NTP packages that needed to be removed, and had been deprecated in PVE 8. I did check the system I left on 8.4 for the packages and script that caused the issue last time, but they are not present, so I can't imagine them being on the other two systems, or that being the cause. I set up this cluster last month and all three systems were fresh installs of 8.4 using the latest ISO that was available on the site at the time.

I tried rebooting one of the two failing systems a few times, but the issue persists.

I have full access to these servers and can take any troubleshooting steps necessary - I just honestly don't know what I should be checking for.

It might also be worth noting that I tried going into the Rescue Boot option from the PVE 9 ISO and got an error stating it couldn't find my boot zpool or find the boot disk automatically. I'm not sure why.

1754595902315.png


Any help or guidance is very much appreciated!!
 
Last edited:
I am also experiencing this issue on a Proliant Gen8 server using 802.3ad link aggregation after following the 8 to 9 upgrade. I'm also seeing that the physical link is up for both members of the bonded interface. I'll try disabling aggregation on my switch and report back.

1754679474039.png
 
To both:

How does your /etc/network/interfaces look like? Do you have any post-up scripts in there? Can you post the contents of the file?
Any scripts in /etc/network/if-up.d ?
 
I think you're on to something. I have an openfabric mesh network for my ceph cluster and it does define a post-up command. However I do not think there is anything nonstandard about my bond0 interface definition.

Code:
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto eno2
iface eno2 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.62/24
        gateway 192.168.1.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        post-up /usr/bin/systemctl restart frr.service

source /etc/network/interfaces.d/*

auto ens1
iface ens1 inet static
        mtu 9000

auto ens1d1
iface ens1d1 inet static
        mtu 9000

Code:
# ls /etc/network/if-up.d
bridgevlan  bridgevlanport  chrony  ethtool  mtu  postfix

Nothing that wasn't placed there by the system.
 
Last edited:
Code:
post-up /usr/bin/systemctl restart frr.service

This line is most likely the culprit, because of changes to the frr service. Removing this line should fix it (although I cannot say if your OpenFabric configuration will work after that, I'll have to take a closer look on Monday and reproduce it).
 
Thank you for your help @shanreich! After deleting the post-up as well as
Code:
auto ens1
iface ens1 inet static
        mtu 9000

auto ens1d1
iface ens1d1 inet static
        mtu 9000
I was able to boot the system and the fabric seems to work normally. One reason I'm playing around with pve9 is for the new SDN features ironically enough :).