No network after upgrade attempt to 6.5.11-4-pve

RJT

New Member
Aug 17, 2023
3
0
1
I have a 3 node cluster in the homelab, two went fine upgrading to 6.5.11-4. These are all headless. The third went offline, but still had power - I had to power down and move it to a monitor/keyboard. Getting access to it, when it booted up, it's still on 6.2.16-14 . Appeared to be working okay, but the interfaces were down, and no vmbr0 exists. I see several others having issues, but they mainly seem to be realtek. I have intel embedded nics. Restarting networking throws no errors, but just doesn't do anything. The only IP-4 address that gets loaded is the loopback (127.0.0.1). I can force the links up (enp86s0 and enp87s0), but no IP address loads. I tried loading to older kernels 6.2.16-12-pve and 6.2.16-3-pve, but same behavior. After ip link set enp86s0 up , I get notification it's up and igc driver loaded, linked at 2500 Mbps full duplex, flow control RX/TX. And IPV6: ADDRCONF( NETDEV_CHANGE) enp86s0: link becomes ready. But no IPv4 at all. I can't ping anything other than 127.0.0.1. Everything else is 'network is unreachable'

It's a little difficult to get log info since it's got no net, but was hoping someone had run across this before. I'd rather not have to pull it from the cluster/zfs pool and start over. I migrated guests off except one, that was getting an error when I tried to migrate, so I shut it down prior to upgrade. I've got backups of the guests if worse comes to worse.

The error when I tried to migrate the guest prior to upgrade - "Can't use an undefined value as an ARRAY reference at /usr/share/perl5/PVE/QemuServer.pm line 2685. TASK ERROR: migration aborted" . I'd been working on resource mapping and SRV-IO passthrough of the iGPU, plus the host I was migrating to was already successfully upgraded - so I figured the error probably had something to do with my inconsistent state between my hosts and just shut down the guest to upgrade the host I'm having issues with (HomeLab2). Network interface names did not change.

I have two minisforum HomeLab nodes, that are similar, but not identical. One is 12th gen Intel, and one is 13th gen. HomeLab1 (12th gen, upgraded fine) and HomeLab2 (13th gen, the afflicted). The third is a SOC mainly for quorum and light duty.

systemctl status networking.service shows enabled/enabled, exited with status=0 success, and looks like my good host (HomeLab1), with the exception of

Nov 26 11:14:40 HomeLab1 systemd[1]: Starting networking.service - Network initialization...
Nov 26 11:14:40 HomeLab1 networking[917]: networking: Configuring network interfaces (**this statement is missing on HomeLab2)
Nov 26 11:14:41 HomeLab1 systemd[1]: Finished networking.service - Network initialization.


/etc/network/interfaces is pretty simple.

----start-----
auto lo
iface lo inet loopback

iface enp86s0 inet manual

auto enp87s0 inet static
address 192.168.200.2/28
#ZFSMigration Network

auto vmbr0
iface vmbr0 inet static
address 192.168.0.52/23
gateway 192.168.0.1
bridge-ports enp86s0
bridge-stp off
bridge-fd 0

iface wlp88s0 manual

-----end-----

I did notice that no matter which kernel I loaded testing, and behavior was same (links down at boot, can force them up, but no IPv4) - that across all 3 kernels, the driver info for the intel nics was the same 'ethtool -i enps86s0'
driver: igc
version: 6.2.16-14-pve
 
Okay... after a few (read: many) hours clubbing away at this. The network issue was probably from the ungraceful failed upgrade. I was able to get networking up manually forcing IPs, and then manually did updates/upgrades, and it hung / rebooted probably 6-7 times during the process. I had to explicitly install install pve-headers-6.5.11-4-pve, and reconfigure locales / and gen locale-gen en_US.UTF-8. Why it would reboot instantly sometimes after the update started vs go through eventually, I've no idea. After the initial update attempt and it rebooted in the middle of it, the network config started working and worked after every spontaneous reboot thereafter, so at least there was that. It's now on 6.5.11-4-pve with the rest of the cluster, zpool status shows no errors anymore, and I was able to re-establish replication jobs (I had pruned them all as I was about to pull it from the cluster). Backups of the VMs and CTs completed. So hopefully we're passed the rough patch and into calmer waters. I also pulled out all the resource mappings and SRV-IO / IOMMU changes I had made just to keep it more vanilla for now until it's logs some stability.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!