watchdog detected hard lockup on cpu XX

kez

Member
Mar 26, 2023
85
12
13
Hi,

Can anyone help with 1 of the 4 nodes in an HPE Apollo r2600 Gen10 System won't boot the PVE ISO with the error:

watchdog detected hard lockup on CPU XX

Where XX is a number that changes each time I try.

All nodes are the same hardware and BIOS version (latest). I can install other Linux such as Alma9, from an ISO. Not sure how to troubleshoot this.

Thanks!
 
I know it's been six months since the original post, but was there ever a solution to this?

I have five hosts, all Dell OptiPlex & Precisions, and Core i5's & i7's, all with the latest BIOS's, and they all updated from 9.1.1 to 9.1.4 without any issue.

But my Dell Precision with a Xeon and DDR RAM hangs with this same message, even after removing it from the cluster, wiping it, and doing a clean install of 9.1.1, then updating. It appears that something in the 9.1.4 update is causing this. It was running fine up until this update, and I ran MemTest against the RAM to rule that out, and it passed with zero issues. Additionally, the clean install of 9.1.1 was running fine, which is how I was able to run the update to 9.1.4.

Thanks!
 
My solution (HPE ProLiant DL360 Gen10) was to set the power profile in the BIOS to "High Performance Mode". Apparently, there are CPU stalls when a core needs to be woken up from a deep sleep state. No problem anymore since 4 weeks.
 
I know it's been six months since the original post, but was there ever a solution to this?

I have five hosts, all Dell OptiPlex & Precisions, and Core i5's & i7's, all with the latest BIOS's, and they all updated from 9.1.1 to 9.1.4 without any issue.

But my Dell Precision with a Xeon and DDR RAM hangs with this same message, even after removing it from the cluster, wiping it, and doing a clean install of 9.1.1, then updating. It appears that something in the 9.1.4 update is causing this. It was running fine up until this update, and I ran MemTest against the RAM to rule that out, and it passed with zero issues. Additionally, the clean install of 9.1.1 was running fine, which is how I was able to run the update to 9.1.4.

Thanks!
Same issue here with my Dell Precision 5820 with Xeon 2133.
Everything was working correctly until now with Proxmox 9.1.1 and kernel 6.17.2-1-pve.
After upgrading to latest Proxmox 9.1.4 and kernel 6.17.4-1-pve, I encountered the following error "watchdog detected hard lockup on cpu". I have to rollback to previous kernel 6.17.2-1-pve to make it working again.
All my others pve nodes (with Dell Optiplex and my dual Xeon E5) are working correctly after upgrading to proxmox 9.1.4 and latest kernel 6.17.4-1-pve
 
Hey everyone, just wanted to follow-up with the solution I found for my situation.

I ran MemTest86 and Prime95 each for 24 hours with zero errors, so I was certain there weren't any hardware issues. I also made sure the BIOS was the latest version, reset it to factory defaults, and tweaked every power-saving option I could find for performance and disabled any kind of sleeping or hibernation - still got the same results. So, I wiped the drive, reinstalled Proxmox, and upgraded the kernel from 6.17.4-1 to 6.17.4-2 and pinned it there, and all has been running smoothly now.

So, the next question, if anyone here has an answer - when will it be safe to upgrade the kernel again? 6.17.5? or v6.18? Or whatever comes with the next version of Proxmox, either 9.2 or 10.0? It's kind of hard to test when it's the only machine out of five that had this issue, and once this happened and it hung at the locked-up CPU message, I had no way (that I could find) to get to a command prompt, which is why I just wiped the disk and started over.