[SOLVED] PVE loses network connection after update from 8.x to 9.x

max1337

Member
Dec 6, 2022
16
0
6
I updated my two PVEs recently and one of the servers did not come back to the network. The update itself went smooth and I followed the update guidelines step by step.
I followed this thread for a while but it was a different problem to mine in the end.

It looks like a botched update since
cat /etc/debian_version returns 13.5 and pveversion returns pve-manager/9.2.2/b9984c6d90a4bd80 (running kernel: 6.8.12-25-pve)
but the kernel should have version 6.14.x, 6.17.x or 7.0.x. according to daanw.

dpkg -l | grep proxmox-kernel showed that I had kernel 7.0 available so I installed it with apt install proxmox-kernel-7.0. This update had a hickup when it asked if I wanted to update the file /etc/apt/sources.list.d/pve-enterprise.sources to the maintainers version which I accepted. It got stuck and the only way to get out of this was to reboot the machine with Ctrl+Alt+Del. Afterwards the system booted with kernel 7.0 and I tried to clean up my mess with dpkg --configure -a
I am at the point where ip a
Code:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
3: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether c8:**:**:**:**:** brd ff:ff:ff:ff:ff:ff
    inet 10.241.1.113/24 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::ca7f:54ff:fe00:c838/64 scope link proto kernel_ll
       valid_lft forever preferred_lft forever
4: tap113i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc fq_codel master fwbr113i0 state UNKNOWN group default qlen 1000
    link/ether fe:56:a1:28:f5:e0 brd ff:ff:ff:ff:ff:ff
5: fwbr113i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 02:d4:ac:40:df:21 brd ff:ff:ff:ff:ff:ff
6: fwpr113p0@fwln113i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
    link/ether d6:95:4c:d7:1d:9b brd ff:ff:ff:ff:ff:ff
7: fwln113i0@fwpr113p0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master fwbr113i0 state UP group default qlen 1000
    link/ether 02:d4:ac:40:df:21 brd ff:ff:ff:ff:ff:ff

gives me hope that we have a valid IP but nothing network related works. No pings, no apt update no webinterface.

What irritates me is the fact that I do not see my physical interfaces anymore (formerly named ino1).

Here is /etc/network/interfaces for reference
Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
iface enxa0369f381708 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.241.1.113/24
    gateway 10.0.10.8
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

# source /etc/network/interfaces.d/*

Any ideas?

Thanks,
Max
 
To confirm; all packages are now fully installed and you are booting a 7.0.x kernel?

Check the output of journalctl -b of a problematic boot for errors.
Specifically look for (network) devices failing to initialize properly (udev / kernel modules) or failing services.
 
Last edited:
See what you did:
Code:
address 10.241.1.113/24
gateway 10.0.10.8
Looks weird but actually works in our ...interesting... network setup here. The second server does not mind this configuration and reaches everything necessary via this gateway.

I found something rather interesting:
Both servers used to run in a cluster (before the update and I intended to keep it that way) with a proxmox backup server as the third node for the quorum to work out. I just saw that the server that is misbehaving was connecting to said cluster for a short while. There are entries in the cluster log from this node. Unfortunately it stops being part of the cluster after some short time and is not reachable anymore via network.

I attached the output of journalctl -b
 

Attachments

Code:
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0: can't disable ASPM; OS doesn't have ASPM control
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0 eth0: RTL8125B, c8:**:**:**:**:**, XID 641, IRQ 159
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0 eth0: jumbo features [frames: 16362 bytes, tx checksumming: ko]
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: renamed from eth0
Jun 01 09:32:03 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: entered allmulticast mode
Jun 01 09:32:03 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: entered promiscuous mode
Jun 01 09:32:03 pfvm-testmaster kernel: Realtek Internal NBASE-T PHY r8169-0-900:00: attached PHY driver (mii_bus:phy_addr=r8169-0-900:00, irq=MAC)
Jun 01 09:32:03 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: Link is Down
Jun 01 09:32:06 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: Link is Up - 1Gbps/Full - flow control off
Jun 01 09:32:13 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: Link is Down
Jun 01 09:32:13 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1 (unregistering): left allmulticast mode
Jun 01 09:32:13 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1 (unregistering): left promiscuous mode

Corosync probably fails later on because the network dropped.

See for example:
https://forum.proxmox.com/threads/single-port-on-nic-randomly-disconnecting.183888/#post-854837
https://forum.proxmox.com/threads/custom-iso-9-1-1-image-with-r8168-dkms-baked-in.182380/

You could try the r8125-dkms kernel module and make sure you have firmware-realtek installed.
 
Last edited:
  • Like
Reactions: max1337
To confirm; all packages are now fully installed and you are booting a 7.0.x kernel?

Check the output of journalctl -b of a problematic boot for errors.
Specifically look for (network) devices failing to initialize properly (udev / kernel modules) or failing services.
I think all packages are fully installed. apt upgrade does not list any packets that are not completely installed or need to be updated.
 
Looks weird but actually works in our ...interesting... network setup here. The second server does not mind this configuration and reaches everything necessary via this gateway.

I found something rather interesting:
Both servers used to run in a cluster (before the update and I intended to keep it that way) with a proxmox backup server as the third node for the quorum to work out. I just saw that the server that is misbehaving was connecting to said cluster for a short while. There are entries in the cluster log from this node. Unfortunately it stops being part of the cluster after some short time and is not reachable anymore via network.

I attached the output of journalctl -b
This Network Setup can not work, nobody knows to reach: 10.0.10.8
please report your Routing table.
 
Last edited:
Code:
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0: can't disable ASPM; OS doesn't have ASPM control
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0 eth0: RTL8125B, c8:**:**:**:**:**, XID 641, IRQ 159
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0 eth0: jumbo features [frames: 16362 bytes, tx checksumming: ko]
Jun 01 09:32:01 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: renamed from eth0
Jun 01 09:32:03 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: entered allmulticast mode
Jun 01 09:32:03 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: entered promiscuous mode
Jun 01 09:32:03 pfvm-testmaster kernel: Realtek Internal NBASE-T PHY r8169-0-900:00: attached PHY driver (mii_bus:phy_addr=r8169-0-900:00, irq=MAC)
Jun 01 09:32:03 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: Link is Down
Jun 01 09:32:06 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: Link is Up - 1Gbps/Full - flow control off
Jun 01 09:32:13 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1: Link is Down
Jun 01 09:32:13 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1 (unregistering): left allmulticast mode
Jun 01 09:32:13 pfvm-testmaster kernel: r8169 0000:09:00.0 eno1 (unregistering): left promiscuous mode

Corosync probably fails later on because the network dropped.

See for example:
https://forum.proxmox.com/threads/single-port-on-nic-randomly-disconnecting.183888/#post-854837
https://forum.proxmox.com/threads/custom-iso-9-1-1-image-with-r8168-dkms-baked-in.182380/

You could try the r8125-dkms kernel module and make sure you have firmware-realtek installed.
Unfortunately I cannot install r8168-dkms via apt - it is not able to locate the package. I added the non-free and non-free-firmware to my sources but that did not help either.
I have a PCIe slot left over and an intel i350 I could put in that machine. That should help me solve the issue. Although I do not understand why it worked with the older kernel but does not with 7.x? Was support for this chip dropped? I never had an issue before.

Max

BTW: I plugged in an usb network interface in hope of allowing me to use apt with internet to install/update everything. And what do you know: its a RTL8125 that works just flawlessly :D
 
Last edited:
r8168-dkms and r8125-dkms are in the Debian non-free repositories:
https://packages.debian.org/trixie/r8168-dkms
https://packages.debian.org/trixie/r8125-dkms
firmware-realtek is in Debian non-free-firmware:
https://packages.debian.org/trixie/firmware-realtek
Obviously you also need proxmox-kernel-headers-7.0 if you go the dkms way.

Stabilty can be hit and miss with these realtek chipsets. I had one system with a RTL8152 2.5gbe USB-Ethernet adapter that always booted fine but dropped randomly on plain Debian with about any kernel / dkms module while 2 of those work flawless in Proxmox machines surviving many kernel upgrades. Being fed up with it I switched to a RTL8125B NIC that works just fine with the default r8169 kernel module in Debian, while you are posting instability in Proxmox..

Anyway, good to hear you have your system/cluster going again.
 
Last edited:
  • Like
Reactions: max1337