Kernel 6.8.4-2 causes random server freezing

We are having the same issue. We have a 5-node Dell R660 cluster with Ceph 8.2.2. This is a new cluster, and our hosts are freezing up, with nothing in the logs and nothing on the connected monitors. The only way to get them back online is to hard reboot the servers. We were planning on updating to 6.8.4-3 this weekend to see if it would resolve the issue.

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

 
  • Like
Reactions: antonin.chadima
Any word from the Proxmox Team on this? I am stuck somehow..
I've had a ticket open with the Proxmox Team since 5/1/2024, and they act like they've never seen this issue before. All they did was provide some Dell bios articles and suggest updating 4 hosts to 6.8.4-3 and 1 host to 6.5 to see which one would freeze first. This is not very enterprise-like support, and I'm extremely disappointed in Proxmox support.
 
  • Like
Reactions: simenfd
We are having the same issue. We have a 5-node Dell R660 cluster with Ceph 8.2.2. This is a new cluster, and our hosts are freezing up, with nothing in the logs and nothing on the connected monitors. The only way to get them back online is to hard reboot the servers. We were planning on updating to 6.8.4-3 this weekend to see if it would resolve the issue.

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?


No, 6.5 doesn't have the freeze issue.
 
We are having the same issue. We have a 5-node Dell R660 cluster with Ceph 8.2.2. This is a new cluster, and our hosts are freezing up, with nothing in the logs and nothing on the connected monitors. The only way to get them back online is to hard reboot the servers. We were planning on updating to 6.8.4-3 this weekend to see if it would resolve the issue.

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

No freezes. 6.5 is OK

And I didn't have time formmore experiments yet...
 
There are other threads [1] with crash dumps, which narrow down the issue to blk_flush_complete_seq, this in turn calls blk_flush_restore_request. This function has recent activity [2]. Namingly fixing a NULL pointer dereference. As far I can see, this has just been scheduled to be in 6.10-rc1. So could someone please backport this to the pve-kernel and run tests?

[1] https://forum.proxmox.com/threads/r...l-running-vms-unresponsive.145981/post-665847
[2] https://lore.kernel.org/all/20240501110907.96950-9-dlemoal@kernel.org/
 
There are other threads [1] with crash dumps, which narrow down the issue to blk_flush_complete_seq, this in turn calls blk_flush_restore_request. This function has recent activity [2]. Namingly fixing a NULL pointer dereference. As far I can see, this has just been scheduled to be in 6.10-rc1. So could someone please backport this to the pve-kernel and run tests?

[1] https://forum.proxmox.com/threads/r...l-running-vms-unresponsive.145981/post-665847
[2] https://lore.kernel.org/all/20240501110907.96950-9-dlemoal@kernel.org/
As already written: https://forum.proxmox.com/threads/r...l-running-vms-unresponsive.145981/post-668282
 
I've been experiencing this issue as well, just had my 4th hang in the past two weeks:
  • Proxmox 8.2.2 on kernel 6.8.4-3-pve (latest at time of writing)
  • Ceph 18.2.2 on NVMe (Kioxia KCM5DRUG3T84)
  • AMD CPU (Ryzen 7900) with 64GB (non-ECC) RAM
  • 3 node HA cluster
  • Intel 82599 10GE NIC (10Gtek)
The cluster has been running stable since early Feb, which admittedly is not a lot of time, but it's been solid up until now.

What's noteworthy is that only one of my nodes is hanging, and this particular node is unique in that I am using a VM with PCI passthrough of the host's SATA controller. I am using cputype=host with most of my VMs as well, but that applies to all VMs on all three hosts, whereas the only one that's hung so far (now 4 times) is the only one doing host PCI passthrough.

Is anyone else hitting this problem using hostpci devices in guests?

Edit: somehow missed the last page of comments. If the fix is only present in Linus' 6.10-rc branches, I wonder if Proxmox could cherrypick it into the pve kernel?
 
Last edited:
I've been experiencing this issue as well, just had my 4th hang in the past two weeks:
  • Proxmox 8.2.2 on kernel 6.8.4-3-pve (latest at time of writing)
  • Ceph 18.2.2 on NVMe (Kioxia KCM5DRUG3T84)
  • AMD CPU (Ryzen 7900) with 64GB (non-ECC) RAM
  • 3 node HA cluster
  • Intel 82599 10GE NIC (10Gtek)
The cluster has been running stable since early Feb, which admittedly is not a lot of time, but it's been solid up until now.

What's noteworthy is that only one of my nodes is hanging, and this particular node is unique in that I am using a VM with PCI passthrough of the host's SATA controller. I am using cputype=host with most of my VMs as well, but that applies to all VMs on all three hosts, whereas the only one that's hung so far (now 4 times) is the only one doing host PCI passthrough.

Is anyone else hitting this problem using hostpci devices in guests?

Edit: somehow missed the last page of comments. If the fix is only present in Linus' 6.10-rc branches, I wonder if Proxmox could cherrypick it into the pve kernel?
We had the same issue where random servers froze and became completely unresponsive. The only thing that fixed it for us was to revert back to kernel 6.5.13-5 using the following command: proxmox-boot-tool kernel pin 6.5.13-5-pve
 
We've the same Problem, after upgrading. Random Crashes.
We are very surprised about the progress and reactions of the employee in this thread.
Even Thomas-Krenn got an page about that, so it seams not to be a small problem:
https://www.thomas-krenn.com/de/wiki/Known_Issues_Proxmox_VE_8.2#Kernel_Freezes

What are the plan to mitigate this? Why has the version not been taken offline even though these problems have apparently been known for weeks?
At least for the paid subscribers repo.

Miss we some Info?
 
I have updated both my test server and my main server to the latest updates, 8.2.4 and 6.8.8-1 kernel, and so far no crashing or other issues noticed.
 
Hi @Kevo

How much time have been since you upgraded the servers to 6.8.8-1 ?
I've some issues with 5 servers clusters - version 6.8.4-2-pve.


The servers are freezes randomly (one server at week), without any related logs.

I am thinking about upgrading to the last kernel version or downgrade to 6.5.13-5

Thanks.
 
Hi @Kevo

How much time have been since you upgraded the servers to 6.8.8-1 ?
I've some issues with 5 servers clusters - version 6.8.4-2-pve.
About a week. My intel mac mini is already updated to 6.8.8-2 with QEMU 9, also with no issues so far. My Ryzen machine is still on 6.8.8-1 with no issues. I plan on updating it to 6.8.8-2 with QEMU 9 in another day or two as long as I don't see any problems on the mini.

I definitely had issues on both machines on 6.8.4. No problems with 6.5.13-5, so I would for sure either pin 6.5 for a while until you're comfortable with the upgrade, or go ahead and upgrade to 6.8.8 and see if that works for you.
 
Hello All,

Piggy backing on this thread because I seem to be having an issue somewhere between this one, and:
https://forum.proxmox.com/threads/r...-ssh-and-all-running-vms-unresponsive.145981/

Still having issues on 6.8.8 kernel. Presented itself when passing through a coral TPU PCIe device to a VM.

I can log into the host via SSH during the boot process, but once the VM which has the Coral passed through starts up, the connection drops and even the display from the VGA connection stops working. My WebUI is also non responsive.

Pinning the kernel to 6.5.13-5 seemed to not fix the issue in my case. Removing all of my PCIe devices one by one did not seem to fix it either.
I think I need to either stop that VM from starting, or remove the coral PCIe reference from it to get it to start up. Does anyone have any insight into how I can accomplish either of these tasks, or any other relevant information on trying to fix this issue.

Thanks.
I think this is not related to what's discussed in this Thread.
 
I think this is not related to what's discussed in this Thread.
I figured it could be related, but I suppose it is not. Disregard my comment above. I am sorry for wasting everyone's time with my post.
 
I figured it could be related, but I suppose it is not. Disregard my comment above. I am sorry for wasting everyone's time with my post.
No worries -- as your issue indeed seems to be unrelated to the issue discussed in this thread, feel free to open a new thread for it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!