Opt-in Linux 6.2 Kernel for Proxmox VE 7.x available

Looks like the latest 6.2 kernel still has some major performance issues in our enviroment just like 6.1.

We are seeing a load increase of 30-50% on pretty much all VM's running on hosts with the 6.2.x kernel

If we go back to 5.15.x all is well.

Check out the screen shot, our load has essentially doubled on the 6.1 or 6.2 kernels.

View attachment 51129

Wanted to report back. Did some more testing.

Here is what I pinned down.

- Slowness only happens on VM's which are migrated
- VM is aok if its freshly started on a host running the newer 6.2.x kernel
- I was in the process of upgrading packages and moving over to 6.2.x when I hit this bug
- I use live migration heavily for upgrading

I will need to do some testing to see if this happens on the 5.15.x kernel now that I can easily reproduce it.
 
Last edited:
Wanted to report back. Did some more testing.

Here is what I pinned down.

- Slowness only happens on VM's which are migrated from hosts running older packages.
- VM is aok if its freshly started on a host running the newer 6.2.x kernel
- I was in the process of upgrading packages and moving over to 6.2.x when I hit this bug
- I use live migration heavily for upgrading

For now I am going to plan on stopping/starting all VM's once I get all front ends on the same packages and the 6.2.x kernel.
Thanks for your feedback, while it'd be naturally better that there is no impact at all, having this resolved after a clean VM start is far better than nothing. What CPU is in use here?
 
Thanks for your feedback, while it'd be naturally better that there is no impact at all, having this resolved after a clean VM start is far better than nothing. What CPU is in use here?

I just found out that its actually happening in all situations with live migration.

I agree, start/stop is great, but we have a cluster with almost 600 VM's and depend on live migration heavily for uptime.

We have a mix of Intel 2nd/3rd Gen Xeon's. I can reproduce the issue going between the same model CPU's and different CPU's.

Using "7za b" within the VM as a quick and easy CPU test. Seeing 30-70% performance hits.

I have to test on some other clusters to make sure I am not going crazy.
 
Hi,

I'l still have corosync problem with kernel 6.2 with my epyc v3 cluster. (8 nodes). After 2-3 days, I got constant corosync nodes join/leave.

no error in kernel.log.

stop/start corosync/pve-cluster everywhere is not fixing it, it need reboot of nodes. (Generally reboot 1 specific node in the cluster (always different), is fixing the problem).


I'll try to play with preempt or enable amd_pstate .

the cluster is working fine with kernel 5.15.
 
Is this epyc-only cluster? Because i have 2 epycs in heterogenous cluster, and they work fine with 6.2
yes full epyc with 2 sockets by server:

vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7543 32-Core Processor
stepping : 1
microcode : 0xa0011ce

The cluster is currently empty, no vm is running.
 
yes full epyc with 2 sockets by server:

vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7543 32-Core Processor
stepping : 1
microcode : 0xa0011ce

The cluster is currently empty, no vm is running.

I am running all Epyc AMD processors on 14 nodes. No issues far as I can tell.

CPU(s) 64 x AMD EPYC 7502P 32-Core Processor (1 Socket)
Kernel Version Linux 6.2.11-2-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.11-2 (2023-05-10T09:13Z)

Also, they're all running on Dell PowerEdge R7515 with 512GB of RAM.
 
I am running all Epyc AMD processors on 14 nodes. No issues far as I can tell.

CPU(s) 64 x AMD EPYC 7502P 32-Core Processor (1 Socket)
Kernel Version Linux 6.2.11-2-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.11-2 (2023-05-10T09:13Z)

Also, they're all running on Dell PowerEdge R7515 with 512GB of RAM.
thanks. I'm running lenovo sr635.

I also have some grub tunning like processor.max_cstate=1, I wonder if it couldn't be a thermal problem or something like that. (but I don't have problem with 5.15).

what is your microcode version ? (/proc/cpuinfo ?)

I'm sure that have this problem on 6.0, 6.1, 6.2. I'll try 5.19 to see.
 
thanks. I'm running lenovo sr635.

I also have some grub tunning like processor.max_cstate=1, I wonder if it couldn't be a thermal problem or something like that. (but I don't have problem with 5.15).

what is your microcode version ? (/proc/cpuinfo ?)

I'm sure that have this problem on 6.0, 6.1, 6.2. I'll try 5.19 to see.
Processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7502P 32-Core Processor
stepping : 0
microcode : 0x8301055
cpu MHz : 2495.270
cache size : 512 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 32
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb
bogomips : 4990.54
TLB size : 3072 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
 
Im still fighting performance issues.

I have found stop/start only works for a few days.

These are very active CentOS7 apache servers, but performance is night and day on the 5.15 kernels.
 
CentOS7 apache servers
What kernel is running inside the guest, is that CentOS release still on the original Linux 3.10 one?

Can you reproduce this with a somewhat modern guest OS too?

FWIW, if we don't find any specific fix for such a regression: Proxmox VE 7 will be maintained until 2024-07 and CentOS 7 will go fully EOL on 2024-06-30 (ref), so you could keep that setup on the working Proxmox VE 7, and it's 5.15 based kernel until it reaches EOL, as long as it then works for whatever guest OS is considered by your replacement plan afterwards.
 
New to this. Is it possible to downgrade kernel?

My setup is finally working OK. What's the best method to backup in case I need to restore?
 
New to this. Is it possible to downgrade kernel?
Yes, as answered:
You can either:
  • manually select a specific kernel at boot for switching back once
  • use the proxmox-boot-tool kernel pin OLD-KERNEL command to pin the older one permanently while keeping the newer
  • simply remove the newer kernel again (apt remove pve-kernel-6.2*), as Proxmox VE 7 still has a package dependency on the 5.15 based one, it will be still installed and then used again.

What's the best method to backup in case I need to restore?
For backing up virtual guests (Container and/or Virtual Machines), see the docs: https://pve.proxmox.com/pve-docs/chapter-vzdump.html

You can use any filesystem based storage for classic vzdump archive backups, or set up a Proxmox Backup Server, which comes with a lot more features and deduplicated storage.
 
  • Like
Reactions: tonynca
I upgraded to 6.2 , but the vm couldn't start due to it misclassify an iommu group. A snapshot of my iommu group

Code:
Group 11:       [1022:1484] [R] 00:08.1  PCI bridge                               Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
Group 12:       [1022:790b]     00:14.0  SMBus                                    FCH SMBus Controller
                [1022:790e]     00:14.3  ISA bridge                               FCH LPC Bridge
Group 13:       [1022:1440]     00:18.0  Host bridge                              Matisse Device 24: Function 0
                [1022:1441]     00:18.1  Host bridge                              Matisse Device 24: Function 1
                [1022:1442]     00:18.2  Host bridge                              Matisse Device 24: Function 2
                [1022:1443]     00:18.3  Host bridge                              Matisse Device 24: Function 3
                [1022:1444]     00:18.4  Host bridge                              Matisse Device 24: Function 4
                [1022:1445]     00:18.5  Host bridge                              Matisse Device 24: Function 5
                [1022:1446]     00:18.6  Host bridge                              Matisse Device 24: Function 6
                [1022:1447]     00:18.7  Host bridge                              Matisse Device 24: Function 7
Group 14:       [15b7:5011] [R] 01:00.0  Non-Volatile memory controller           WD Black SN850
Group 15:       [1022:43ee] [R] 02:00.0  USB controller                           Device 43ee
Group 16:       [1022:43eb]     02:00.1  SATA controller                          Device 43eb
Group 17:       [1022:43e9]     02:00.2  PCI bridge                               Device 43e9
Group 18:       [1022:43ea] [R] 03:00.0  PCI bridge                               Device 43ea
Group 19:       [1022:43ea]     03:07.0  PCI bridge                               Device 43ea
Group 20:       [1022:43ea]     03:09.0  PCI bridge                               Device 43ea
Group 21:       [1002:1478] [R] 04:00.0  PCI bridge                               Navi 10 XL Upstream Port of PCI Express Switch
Group 22:       [1002:1479] [R] 05:00.0  PCI bridge                               Navi 10 XL Downstream Port of PCI Express Switch

I got this error

Code:
kvm: vfio: Cannot reset device 0000:02:00.0, depends on group 16 which is not owned.

I'm trying to pass through the USB controller, as it's more stable than individual usb ports. This works on 5.15 without issues.
 
thanks. I'm running lenovo sr635.

I also have some grub tunning like processor.max_cstate=1, I wonder if it couldn't be a thermal problem or something like that. (but I don't have problem with 5.15).

what is your microcode version ? (/proc/cpuinfo ?)

I'm sure that have this problem on 6.0, 6.1, 6.2. I'll try 5.19 to see.
ok, some news, it was not kernel related finally or epyc cpu related.

I have notice than on my test cluster, sometime if too much udp packets are received for some second (i'm using vxlan), all udp traffic seem to be laggy , and corosync is hanging. Then other corosync are doing retransmit, then increase the udp load even more -> snowball effect.

(I'm using mellanox connect-x4, not sure if it's a kernel buffer bug, or a nic driver/firmware bug, but I don't have any log , and I don't see any cpu load. It look like a contention/lock and only for udp)

Not sure why it's happen more with 6.2 on real traffic, but I can reproduce it with iperf with 5.15.

So,I have switch corosync to sctp, and not a single retransmit have occur, even under udp flood/bench.
 
I upgraded to 6.2 , but the vm couldn't start due to it misclassify an iommu group. A snapshot of my iommu group

Code:
Group 11:       [1022:1484] [R] 00:08.1  PCI bridge                               Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
Group 12:       [1022:790b]     00:14.0  SMBus                                    FCH SMBus Controller
                [1022:790e]     00:14.3  ISA bridge                               FCH LPC Bridge
Group 13:       [1022:1440]     00:18.0  Host bridge                              Matisse Device 24: Function 0
                [1022:1441]     00:18.1  Host bridge                              Matisse Device 24: Function 1
                [1022:1442]     00:18.2  Host bridge                              Matisse Device 24: Function 2
                [1022:1443]     00:18.3  Host bridge                              Matisse Device 24: Function 3
                [1022:1444]     00:18.4  Host bridge                              Matisse Device 24: Function 4
                [1022:1445]     00:18.5  Host bridge                              Matisse Device 24: Function 5
                [1022:1446]     00:18.6  Host bridge                              Matisse Device 24: Function 6
                [1022:1447]     00:18.7  Host bridge                              Matisse Device 24: Function 7
Group 14:       [15b7:5011] [R] 01:00.0  Non-Volatile memory controller           WD Black SN850
Group 15:       [1022:43ee] [R] 02:00.0  USB controller                           Device 43ee
Group 16:       [1022:43eb]     02:00.1  SATA controller                          Device 43eb
Group 17:       [1022:43e9]     02:00.2  PCI bridge                               Device 43e9
Group 18:       [1022:43ea] [R] 03:00.0  PCI bridge                               Device 43ea
Group 19:       [1022:43ea]     03:07.0  PCI bridge                               Device 43ea
Group 20:       [1022:43ea]     03:09.0  PCI bridge                               Device 43ea
Group 21:       [1002:1478] [R] 04:00.0  PCI bridge                               Navi 10 XL Upstream Port of PCI Express Switch
Group 22:       [1002:1479] [R] 05:00.0  PCI bridge                               Navi 10 XL Downstream Port of PCI Express Switch

I got this error

Code:
kvm: vfio: Cannot reset device 0000:02:00.0, depends on group 16 which is not owned.

I'm trying to pass through the USB controller, as it's more stable than individual usb ports. This works on 5.15 without issues.
Are you by any chance using a 11th gen Intel cpu? I get a similar error on my i5-1145G7 when passing through 02:00 intel i226 nic.
https://forum.proxmox.com/threads/p...ork-while-linux-kernel-updated-to-6-x.122034/
 
Are you by any chance using a 11th gen Intel cpu? I get a similar error on my i5-1145G7 when passing through 02:00 intel i226 nic.
https://forum.proxmox.com/threads/p...ork-while-linux-kernel-updated-to-6-x.122034/
I’m on AMD 3900x with MSI B550-A pro motherboard. I’m trying to pass through the usb controller, but this kinda prevent me from upgrading the new kernel version.
The gpu passthrough is fine without issues on the new kernel. Somehow the usb controller isn’t. I guess I need to wait for a while and see.
 
Last edited:
I see there is a newer kernel in the beta. pve-kernel-6.2.16-3-pve Will that be coming to the opt-in program?

I finally got some crash files after 4 or 5 crashes with 6.2.11-2-pve kernel. There is also a dump file of about 37.5 GB. If needed I can post that too.

Thanks

This may apply to my old opteron cpu.
pve-manager (8.0.0~9) bookworm; urgency=medium
* ui: create VM: use new x86-64-v2-AES as new default vCPU type. The x86-64-v2-aes is compatible with Intel Westmere, launched in 2010 and Opteron 6200-series "Interlagos" launched in 2011. This model provides a few important extra fetures over the qemu64/kvm64 model (which are basically v1 minus the -vme,-cx16 CPU flags) like SSE4.1 and additionally also AES, while not supported by all v2 models it is by all recent ones, improving performance of many computing operations drastically.
 

Attachments

  • dmesg.txt
    149.2 KB · Views: 1
Last edited:
PCI passthrough of some USB controllers got broken after 6.2.6.

I'm passing through two of my motherboard's USB Extensible Host Controller (same IOMMU group) to a Win 11 VM and I noticed one of them failed to initialize on Windows after a kernel update.

I tried a fresh Win installation to rule out any Window config issue. Same problem.
I tried an Ubuntu VM where it also used to work. Same problem.

It's broken on 6.2.9-1, 6.2.11-1 and 6.2.11-2.
It works correctly on 6.2.6-1 and 5.19.17-2.

So this seems to be related to any change between kernel version > 6.2.6 <= 6.2.9
I don't see the issue anymore in 6.2.16-3 + Proxmox 8.
 
After upgrade Proxmox VE to 8.0.3 based on Debian 12 and kernel 6.2.16-3-pve, I don't see the Intel AX210 BT adapter anymore in Proxmox devices list, so I can't add it to HA VM. Also there are nothing found in Proxmox startup log file related to Intel AX210 BT adapter , even error logs that were present before, right now are gone. Any ideas how to resolve an issue? Much appreciate for your help!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!