PVE 8 Upgrade: Kernel 6.2.16-*-pve causing consistent instability not present on 5.15.*-pve

We are seeing this issue on PVE 7 with 6.x kernel, notably on Ryzen CPU's and with LXC workloads, so KSM is not a factor.

Testing the 6.x kernel with `mitigations=off` prior to reverting to 5.x kernel.

Any progress towards a longer term fix?
 
  • Like
Reactions: Outpost1534
We are seeing this issue on PVE 7 with 6.x kernel, notably on Ryzen CPU's and with LXC workloads, so KSM is not a factor.

Testing the 6.x kernel with `mitigations=off` prior to reverting to 5.x kernel.

Any progress towards a longer term fix?

Have you tried:
echo 0 > /proc/sys/kernel/numa_balancing
?
 
I tried disabling numa_balancing, and so far haven't seen any lock ups, will leave it like this for a while with real world workloads and see what happens.

Code:
echo 0 > /proc/sys/kernel/numa_balancing

@Whatever Could you explain why this must be set, and what is the root cause of the problem. I read the wiki about numa_balancing, but don't see why this would be causing issues with kernel 6.xx whereas it was fine in 5.xx
 
Last edited:
  • Like
Reactions: Whatever and fweber
I tried disabling numa_balancing, and so far haven't seen any lock ups, will leave it like this for a while with real world workloads and see what happens.

Code:
echo 0 > /proc/sys/kernel/numa_balancing

@Whatever Could you explain why this must be set, and what is the root cause of the problem. I read the wiki about numa_balancing, but don't see why this would be causing issues with kernel 6.xx whereas it was fine in 5.xx
Thanks for trying this. If you see any lockups after disabling NUMA balancing, please let us know. I forgot to link the other thread's post regarding NUMA balancing [1] earlier, thanks to @Whatever for providing it.

We are seeing this issue on PVE 7 with 6.x kernel, notably on Ryzen CPU's and with LXC workloads, so KSM is not a factor.

Testing the 6.x kernel with `mitigations=off` prior to reverting to 5.x kernel.

Any progress towards a longer term fix?
As you are using containers and VMs it seems unlikely that you're seeing the same issue as in [1], but anyway, can you doublecheck whether KSM is really inactive, i.e. is /sys/kernel/mm/ksm/pages_shared reporting 0? Could you post the output of lscpu?

[1] https://forum.proxmox.com/threads/p...ows-server-2019-vms.130727/page-7#post-601617
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!