Kernel 6.8.4-2 causes random server freezing

zzz09700 · May 18, 2024

No

eebgmbh said:
Did anybody try the latest kernel, i.e. proxmox-kernel-6.8.4-3-pve-signed ? Did it fix that problem ?
Thx

No. Don't try that one. Some more things were broken with that one.

eebgmbh · May 21, 2024

Thanks! That saves me !

Lephisto · May 21, 2024

Any word from the Proxmox Team on this? I am stuck somehow..

IT4U · May 24, 2024

We are having the same issue. We have a 5-node Dell R660 cluster with Ceph 8.2.2. This is a new cluster, and our hosts are freezing up, with nothing in the logs and nothing on the connected monitors. The only way to get them back online is to hard reboot the servers. We were planning on updating to 6.8.4-3 this weekend to see if it would resolve the issue.

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

IT4U · May 24, 2024

Lephisto said:
Any word from the Proxmox Team on this? I am stuck somehow..

I've had a ticket open with the Proxmox Team since 5/1/2024, and they act like they've never seen this issue before. All they did was provide some Dell bios articles and suggest updating 4 hosts to 6.8.4-3 and 1 host to 6.5 to see which one would freeze first. This is not very enterprise-like support, and I'm extremely disappointed in Proxmox support.

Lephisto · May 24, 2024

IT4U said:
We are having the same issue. We have a 5-node Dell R660 cluster with Ceph 8.2.2. This is a new cluster, and our hosts are freezing up, with nothing in the logs and nothing on the connected monitors. The only way to get them back online is to hard reboot the servers. We were planning on updating to 6.8.4-3 this weekend to see if it would resolve the issue.

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

No, 6.5 doesn't have the freeze issue.

antonin.chadima · May 26, 2024

IT4U said:
We are having the same issue. We have a 5-node Dell R660 cluster with Ceph 8.2.2. This is a new cluster, and our hosts are freezing up, with nothing in the logs and nothing on the connected monitors. The only way to get them back online is to hard reboot the servers. We were planning on updating to 6.8.4-3 this weekend to see if it would resolve the issue.

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

No freezes. 6.5 is OK

And I didn't have time formmore experiments yet...

Sprinterfreak · May 28, 2024

There are other threads [1] with crash dumps, which narrow down the issue to blk_flush_complete_seq, this in turn calls blk_flush_restore_request. This function has recent activity [2]. Namingly fixing a NULL pointer dereference. As far I can see, this has just been scheduled to be in 6.10-rc1. So could someone please backport this to the pve-kernel and run tests?

[1] https://forum.proxmox.com/threads/r...l-running-vms-unresponsive.145981/post-665847
[2] https://lore.kernel.org/all/20240501110907.96950-9-dlemoal@kernel.org/

fiona · May 28, 2024

Sprinterfreak said:
There are other threads [1] with crash dumps, which narrow down the issue to blk_flush_complete_seq, this in turn calls blk_flush_restore_request. This function has recent activity [2]. Namingly fixing a NULL pointer dereference. As far I can see, this has just been scheduled to be in 6.10-rc1. So could someone please backport this to the pve-kernel and run tests?

[1] https://forum.proxmox.com/threads/r...l-running-vms-unresponsive.145981/post-665847
[2] https://lore.kernel.org/all/20240501110907.96950-9-dlemoal@kernel.org/

As already written: https://forum.proxmox.com/threads/r...l-running-vms-unresponsive.145981/post-668282

scarlaxx · May 31, 2024

I just tried kernel 6.8.4-3 on an Intel system and it freezes after GRUB.

tack · Jun 6, 2024

I've been experiencing this issue as well, just had my 4th hang in the past two weeks:

Proxmox 8.2.2 on kernel 6.8.4-3-pve (latest at time of writing)
Ceph 18.2.2 on NVMe (Kioxia KCM5DRUG3T84)
AMD CPU (Ryzen 7900) with 64GB (non-ECC) RAM
3 node HA cluster
Intel 82599 10GE NIC (10Gtek)

The cluster has been running stable since early Feb, which admittedly is not a lot of time, but it's been solid up until now.

What's noteworthy is that only one of my nodes is hanging, and this particular node is unique in that I am using a VM with PCI passthrough of the host's SATA controller. I am using cputype=host with most of my VMs as well, but that applies to all VMs on all three hosts, whereas the only one that's hung so far (now 4 times) is the only one doing host PCI passthrough.

Is anyone else hitting this problem using hostpci devices in guests?

Edit: somehow missed the last page of comments. If the fix is only present in Linus' 6.10-rc branches, I wonder if Proxmox could cherrypick it into the pve kernel?

IT4U · Jun 6, 2024

tack said:
I've been experiencing this issue as well, just had my 4th hang in the past two weeks:

Proxmox 8.2.2 on kernel 6.8.4-3-pve (latest at time of writing)

Ceph 18.2.2 on NVMe (Kioxia KCM5DRUG3T84)

AMD CPU (Ryzen 7900) with 64GB (non-ECC) RAM

3 node HA cluster

Intel 82599 10GE NIC (10Gtek)

The cluster has been running stable since early Feb, which admittedly is not a lot of time, but it's been solid up until now.

What's noteworthy is that only one of my nodes is hanging, and this particular node is unique in that I am using a VM with PCI passthrough of the host's SATA controller. I am using cputype=host with most of my VMs as well, but that applies to all VMs on all three hosts, whereas the only one that's hung so far (now 4 times) is the only one doing host PCI passthrough.

Is anyone else hitting this problem using hostpci devices in guests?

Edit: somehow missed the last page of comments. If the fix is only present in Linus' 6.10-rc branches, I wonder if Proxmox could cherrypick it into the pve kernel?

We had the same issue where random servers froze and became completely unresponsive. The only thing that fixed it for us was to revert back to kernel 6.5.13-5 using the following command: proxmox-boot-tool kernel pin 6.5.13-5-pve

ITmarcapo · Jun 14, 2024

We've the same Problem, after upgrading. Random Crashes.
We are very surprised about the progress and reactions of the employee in this thread.
Even Thomas-Krenn got an page about that, so it seams not to be a small problem:
https://www.thomas-krenn.com/de/wiki/Known_Issues_Proxmox_VE_8.2#Kernel_Freezes

What are the plan to mitigate this? Why has the version not been taken offline even though these problems have apparently been known for weeks?
At least for the paid subscribers repo.

Miss we some Info?

Kevo · Jun 21, 2024

I have updated both my test server and my main server to the latest updates, 8.2.4 and 6.8.8-1 kernel, and so far no crashing or other issues noticed.

eliyahuadam · Jun 27, 2024

Hi @Kevo

How much time have been since you upgraded the servers to 6.8.8-1 ?
I've some issues with 5 servers clusters - version 6.8.4-2-pve.

The servers are freezes randomly (one server at week), without any related logs.

I am thinking about upgrading to the last kernel version or downgrade to 6.5.13-5

Thanks.

Kevo · Jun 27, 2024

eliyahuadam said:
Hi @Kevo

How much time have been since you upgraded the servers to 6.8.8-1 ?
I've some issues with 5 servers clusters - version 6.8.4-2-pve.

About a week. My intel mac mini is already updated to 6.8.8-2 with QEMU 9, also with no issues so far. My Ryzen machine is still on 6.8.8-1 with no issues. I plan on updating it to 6.8.8-2 with QEMU 9 in another day or two as long as I don't see any problems on the mini.

I definitely had issues on both machines on 6.8.4. No problems with 6.5.13-5, so I would for sure either pin 6.5 for a while until you're comfortable with the upgrade, or go ahead and upgrade to 6.8.8 and see if that works for you.

Lephisto · Jul 1, 2024

jtt said:
Hello All,

Piggy backing on this thread because I seem to be having an issue somewhere between this one, and:
https://forum.proxmox.com/threads/r...-ssh-and-all-running-vms-unresponsive.145981/

Still having issues on 6.8.8 kernel. Presented itself when passing through a coral TPU PCIe device to a VM.

I can log into the host via SSH during the boot process, but once the VM which has the Coral passed through starts up, the connection drops and even the display from the VGA connection stops working. My WebUI is also non responsive.

Pinning the kernel to 6.5.13-5 seemed to not fix the issue in my case. Removing all of my PCIe devices one by one did not seem to fix it either.
I think I need to either stop that VM from starting, or remove the coral PCIe reference from it to get it to start up. Does anyone have any insight into how I can accomplish either of these tasks, or any other relevant information on trying to fix this issue.

Thanks.

I think this is not related to what's discussed in this Thread.

jtt · Jul 1, 2024

Lephisto said:
I think this is not related to what's discussed in this Thread.

I figured it could be related, but I suppose it is not. Disregard my comment above. I am sorry for wasting everyone's time with my post.

Lephisto · Jul 1, 2024

jtt said:
I figured it could be related, but I suppose it is not. Disregard my comment above. I am sorry for wasting everyone's time with my post.

No worries..

fweber · Jul 1, 2024

jtt said:
I figured it could be related, but I suppose it is not. Disregard my comment above. I am sorry for wasting everyone's time with my post.

No worries -- as your issue indeed seems to be unrelated to the issue discussed in this thread, feel free to open a new thread for it.

Kernel 6.8.4-2 causes random server freezing

Active Member

New Member

Well-Known Member

New Member

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?​

New Member

Well-Known Member

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?​

Member

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?​

Well-Known Member

Proxmox Staff Member

Member

New Member

New Member

New Member

Renowned Member

Active Member

Renowned Member

Well-Known Member

New Member

Well-Known Member

Proxmox Staff Member

We value your privacy

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?

antonin.chadima and Lephisto did you have any more hosts freeze up after you downgraded back to 6.5?