1) No crashes with 6.8.7. 6.8.8. 6.8.9 vanilla. Same Hardware, Same VM/qcow2 files. (Debian 12 with libvirtd)
2) I wrote about 6.8.x scheduler issues in one of my first posts. Linus himelf reverted a patch:
https://www.phoronix.com/news/Linux-6.8-Sched-Regression. I didn't follow the "full path" about this individual "story" in the Kernel - but you can be right about this. Which is academic, as new kernels fixes the issue.
Hi
@Der Harry ,
I have started a similar thread
https://forum.proxmox.com/threads/kernel-6-8-4-2-causes-random-server-freezing.146327
My 12 servers are "freezing", no logs, no segfault, no memory leak, no amdgpu, no iommu issue, etc.
Yes we have Ceph and some NVMe drives for Ceph OSD. This is the only thing in common with others with similar issues.
And i have a couple of LXC containers.
I'm now happy running a pinned kernel 6.5 - but this can not work forever...
I need to know what is causing the problem and when there will be an updated kernel (maybe also other stuff) from the Proxmox team.
I have spend one week of my time going through Vanilla kernel commits, Ubuntu patches and the PVE kernel.
I saw a lot of work done in ceph area of the kernel in 6.8 rc and later, but maybe it's not Ceph related. I have also tried to focus on AMD EPYC.
Because of lack of time I didn't find the exact commit/patch which causes the regression (You have to wait cca 6h-24h to let the sh1t happen).
(A week before I have hunted the problem to BIOS, kernel parameters, VM and Ceph config - and HW issues...
without any success, only pinning the kernel to a lower version helped)
And i can afford only one server in the cluster to be "infected" with the 6.8 kernel
1. Did we experience the same problem?
2. Do you have any knowledge what is causing the problem?
3. Or you can find out if there is a fixed pve 6.8 kernel released?
4. I can offer my time and one server for testing different kernels.
PS: You can write personal messages in German.