Hello,
I'm just posting this so people have guidance on our experience on upgrading their clusters. I'll summarize what I did for it to be stable, and go through the details of how I got to this conclusion.
Pre-upgrade Environment:
Proxmox VE: 7.4
Kernel: 5.15
Open vSwitch Setup: Bond (LACP), Bridge, Tagged VLAN for Management Network
Summary: Once you upgrade, make sure to boot into Proxmox VE 8.0 with Kernel 5.15. Problems arise with Kernel 6.2. Not entirely sure what the problem is, but I'm hoping to find out in this thread.
Helpful commands:
# Command below pins the default kernel on grub to boot
> proxmox-boot-tool kernel pin 5.15.107-2-pve
-----------
Phase 1 : After boot on Upgrade
Environment at this point:
Proxmox VE: 8.0
Kernel: 6.2
interface0: ens9f0 with altname enp179s0f0
interface1: ens9f1 with altname enp179s0f1
Upon boot, dmesg was showing the two physical interfaces as going in and out of promiscuous mode and the network is not coming up.
Everything upgraded smoothly, the Ceph network was working properly, so I was able to access the Proxmox environment on this node through that. Lucky.
Saw that the physical networks were not being set to UP at all.
After reconfigure the /etc/network/interfaces multiple times, I figured out that Open vSwitch was not recognizing the altnames (enp179s0f0, and enp179s0f1) as interface names, so my bond was not getting set up properly.
I updated all references of enp179s0f0 and enp179s0f1 to its original ens9f0 and ens9f1. Rebooted the node, and everything is now up.
Conclusion:
-----------
Phase 2: VM live Migration Problems
Once this node was up, I started live migrating existing VMs from node 2 to node 1. After the live migration, I get the same errors as on this thread (https://forum.proxmox.com/threads/l...to-old-cpu-only-using-latest-pm-7-3-4.122901/). Basiclaly, after migration, the VM hangs and consumes 100% cpu.
I'm not sure if it's because node 2 was running 5.15 and new node was on 6.2. It might be the reason, but I have not gone through this yet. Information on this would be good.
I forced a restart on the VM that hung, didn't have a choice. Since the suggestion was to stay on the kernel, I went through the proxmox-boot-tool to set the 5.15 kernel as the default, so I don't have to remember it in the future in case nodes need restarts.
Once booted, I'm greeted with the networking problem now because Open vSwitch won't recognize the name ens9f0. Had to modify /etc/network/interfaces to use the ens179s0f0 name format to get it working.
-----------
Conclusion Points:
* Open vSwitch on kernel 6.2 won't recognize the enp179s0f0 for interface configuration, and need to use ens9f0 name instead.
* Open vSwitch on kernel 5.15 won't recognize the ens9f0 interface configuration, and need to use the ens179s0f0 name instead.
* Live migration on 6.2 doesn't seem to work, but maybe because VMs were running on a different kernel version. Have not tried Live migration on 6.2 to 6.2 yet. Enlighten me on this please.
I hope this would help someone from pulling their hair out.
PS: If i've made some wrong assumptions or conclusions here, please do correct me. Thanks.
I'm just posting this so people have guidance on our experience on upgrading their clusters. I'll summarize what I did for it to be stable, and go through the details of how I got to this conclusion.
Pre-upgrade Environment:
Proxmox VE: 7.4
Kernel: 5.15
Open vSwitch Setup: Bond (LACP), Bridge, Tagged VLAN for Management Network
Summary: Once you upgrade, make sure to boot into Proxmox VE 8.0 with Kernel 5.15. Problems arise with Kernel 6.2. Not entirely sure what the problem is, but I'm hoping to find out in this thread.
Helpful commands:
# Command below pins the default kernel on grub to boot
> proxmox-boot-tool kernel pin 5.15.107-2-pve
-----------
Phase 1 : After boot on Upgrade
Environment at this point:
Proxmox VE: 8.0
Kernel: 6.2
interface0: ens9f0 with altname enp179s0f0
interface1: ens9f1 with altname enp179s0f1
Upon boot, dmesg was showing the two physical interfaces as going in and out of promiscuous mode and the network is not coming up.
Everything upgraded smoothly, the Ceph network was working properly, so I was able to access the Proxmox environment on this node through that. Lucky.
Saw that the physical networks were not being set to UP at all.
After reconfigure the /etc/network/interfaces multiple times, I figured out that Open vSwitch was not recognizing the altnames (enp179s0f0, and enp179s0f1) as interface names, so my bond was not getting set up properly.
I updated all references of enp179s0f0 and enp179s0f1 to its original ens9f0 and ens9f1. Rebooted the node, and everything is now up.
Conclusion:
-----------
Phase 2: VM live Migration Problems
Once this node was up, I started live migrating existing VMs from node 2 to node 1. After the live migration, I get the same errors as on this thread (https://forum.proxmox.com/threads/l...to-old-cpu-only-using-latest-pm-7-3-4.122901/). Basiclaly, after migration, the VM hangs and consumes 100% cpu.
I'm not sure if it's because node 2 was running 5.15 and new node was on 6.2. It might be the reason, but I have not gone through this yet. Information on this would be good.
I forced a restart on the VM that hung, didn't have a choice. Since the suggestion was to stay on the kernel, I went through the proxmox-boot-tool to set the 5.15 kernel as the default, so I don't have to remember it in the future in case nodes need restarts.
Once booted, I'm greeted with the networking problem now because Open vSwitch won't recognize the name ens9f0. Had to modify /etc/network/interfaces to use the ens179s0f0 name format to get it working.
-----------
Conclusion Points:
* Open vSwitch on kernel 6.2 won't recognize the enp179s0f0 for interface configuration, and need to use ens9f0 name instead.
* Open vSwitch on kernel 5.15 won't recognize the ens9f0 interface configuration, and need to use the ens179s0f0 name instead.
* Live migration on 6.2 doesn't seem to work, but maybe because VMs were running on a different kernel version. Have not tried Live migration on 6.2 to 6.2 yet. Enlighten me on this please.
I hope this would help someone from pulling their hair out.
PS: If i've made some wrong assumptions or conclusions here, please do correct me. Thanks.