Proxmox 8 Upgrade Warning for Open vSwitch Cluster Network

Jul 27, 2022
6
0
1
Hello,

I'm just posting this so people have guidance on our experience on upgrading their clusters. I'll summarize what I did for it to be stable, and go through the details of how I got to this conclusion.

Pre-upgrade Environment:
Proxmox VE: 7.4
Kernel: 5.15
Open vSwitch Setup: Bond (LACP), Bridge, Tagged VLAN for Management Network

Summary: Once you upgrade, make sure to boot into Proxmox VE 8.0 with Kernel 5.15. Problems arise with Kernel 6.2. Not entirely sure what the problem is, but I'm hoping to find out in this thread. :)

Helpful commands:
# Command below pins the default kernel on grub to boot
> proxmox-boot-tool kernel pin 5.15.107-2-pve

-----------

Phase 1 : After boot on Upgrade
Environment at this point:
Proxmox VE: 8.0
Kernel: 6.2
interface0: ens9f0 with altname enp179s0f0
interface1: ens9f1 with altname enp179s0f1

Upon boot, dmesg was showing the two physical interfaces as going in and out of promiscuous mode and the network is not coming up.

Everything upgraded smoothly, the Ceph network was working properly, so I was able to access the Proxmox environment on this node through that. Lucky.
Saw that the physical networks were not being set to UP at all.

After reconfigure the /etc/network/interfaces multiple times, I figured out that Open vSwitch was not recognizing the altnames (enp179s0f0, and enp179s0f1) as interface names, so my bond was not getting set up properly.

I updated all references of enp179s0f0 and enp179s0f1 to its original ens9f0 and ens9f1. Rebooted the node, and everything is now up.

Conclusion:

-----------

Phase 2: VM live Migration Problems

Once this node was up, I started live migrating existing VMs from node 2 to node 1. After the live migration, I get the same errors as on this thread (https://forum.proxmox.com/threads/l...to-old-cpu-only-using-latest-pm-7-3-4.122901/). Basiclaly, after migration, the VM hangs and consumes 100% cpu.

I'm not sure if it's because node 2 was running 5.15 and new node was on 6.2. It might be the reason, but I have not gone through this yet. Information on this would be good.

I forced a restart on the VM that hung, didn't have a choice. Since the suggestion was to stay on the kernel, I went through the proxmox-boot-tool to set the 5.15 kernel as the default, so I don't have to remember it in the future in case nodes need restarts.

Once booted, I'm greeted with the networking problem now because Open vSwitch won't recognize the name ens9f0. Had to modify /etc/network/interfaces to use the ens179s0f0 name format to get it working.

-----------

Conclusion Points:
* Open vSwitch on kernel 6.2 won't recognize the enp179s0f0 for interface configuration, and need to use ens9f0 name instead.
* Open vSwitch on kernel 5.15 won't recognize the ens9f0 interface configuration, and need to use the ens179s0f0 name instead.
* Live migration on 6.2 doesn't seem to work, but maybe because VMs were running on a different kernel version. Have not tried Live migration on 6.2 to 6.2 yet. Enlighten me on this please.



I hope this would help someone from pulling their hair out. :)

PS: If i've made some wrong assumptions or conclusions here, please do correct me. Thanks.
 
Hi,
Phase 2: VM live Migration Problems

Once this node was up, I started live migrating existing VMs from node 2 to node 1. After the live migration, I get the same errors as on this thread (https://forum.proxmox.com/threads/l...to-old-cpu-only-using-latest-pm-7-3-4.122901/). Basiclaly, after migration, the VM hangs and consumes 100% cpu.

I'm not sure if it's because node 2 was running 5.15 and new node was on 6.2. It might be the reason, but I have not gone through this yet. Information on this would be good.
Yes, unfortunately, with certain CPU models, migrating between 5.15 and newer kernels can cause issues. The upgrade guide mentions that it should be tested first: https://pve.proxmox.com/wiki/Upgrade_from_7_to_8#VM_Live-Migration_with_different_host_CPUs

For completeness, what CPU models do your hosts have?
 
Hi,

Yes, unfortunately, with certain CPU models, migrating between 5.15 and newer kernels can cause issues. The upgrade guide mentions that it should be tested first: https://pve.proxmox.com/wiki/Upgrade_from_7_to_8#VM_Live-Migration_with_different_host_CPUs

For completeness, what CPU models do your hosts have?
All nodes use Intel Xeon Gold 6210U. The notice seems to only highlight that it's caused by different CPUs. In our case, it's the same CPU, but one node is on a newer 6.2 kernel while the other node was on a 5.15.

Currently, we are on 5.15. Have not tested on 6.2 yet.

Would migration between nodes that both have 6.2 be OK? If that's the case, does that mean that we need to shutdown all VMs and restart all nodes to boot to the newer 6.2 kernel?

Also requesting information on using interface altnames on Open vSwitch on Kernel 6.2 if that is to be expected or just a regression.

Thank you.
 
All nodes use Intel Xeon Gold 6210U. The notice seems to only highlight that it's caused by different CPUs. In our case, it's the same CPU, but one node is on a newer 6.2 kernel while the other node was on a 5.15.
IIRC, something in the FPU handling regressed, making 5.15 and other/newer kernels incompatible for live migration for certain CPU models. There were multiple issues involving 5.15 and live migration, some should already be addressed, but this one couldn't be without breaking live migration between 5.15 and 5.15-with-the-fix, so the fix was not made.

Would migration between nodes that both have 6.2 be OK? If that's the case, does that mean that we need to shutdown all VMs and restart all nodes to boot to the newer 6.2 kernel?
Yes, that should work. Haven't heard any reports of issues with that yet.

Also requesting information on using interface altnames on Open vSwitch on Kernel 6.2 if that is to be expected or just a regression.
Not an expert in that area, but from what I know, the kernel can detect some aspects of certain devices better which lead to interface name changes. I don't think they'll change again and I don't think they will change back either.
 
IIRC, something in the FPU handling regressed, making 5.15 and other/newer kernels incompatible for live migration for certain CPU models. There were multiple issues involving 5.15 and live migration, some should already be addressed, but this one couldn't be without breaking live migration between 5.15 and 5.15-with-the-fix, so the fix was not made.


Yes, that should work. Haven't heard any reports of issues with that yet.


Not an expert in that area, but from what I know, the kernel can detect some aspects of certain devices better which lead to interface name changes. I don't think they'll change again and I don't think they will change back either.

Thank you for the responses.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!