[SOLVED] watchdog: watchdog detected hard LOCKUP on cpu all. how solve it?

Hi!

I have those messages since PVE v3, I have never solved the problem - I'm only using HPE ML/DL servers.
I have never experienced stability problem, since then I ignore those messages.
 
Hi!

I have those messages since PVE v3, I have never solved the problem - I'm only using HPE ML/DL servers.
I have never experienced stability problem, since then I ignore those messages.
But when this hardware lockup event happens, the machine will generally freeze and not respond, include ssh disconnect to pve.
 
I suspected long time heat issues as the device is cooled passively, but removing the hood for better air circulation, did not improve anything.

I started the device again, keeping an eye on the log…when starting a Windows VM installation, a blue screen occurred…in the log: „CPU locked“
 
I suspected long time heat issues as the device is cooled passively, but removing the hood for better air circulation, did not improve anything.

I started the device again, keeping an eye on the log…when starting a Windows VM installation, a blue screen occurred…in the log: „CPU locked“

Disable NUMA / NUMA BALANCING and KSM.

Code:
$> apt-get install sysfsutils

/etc/sysfs.d/ksm.conf
----------------------------------------
# --> /sys/kernel/mm/ksm/run = 0
kernel/mm/ksm/run = 0
----------------------------------------

/etc/sysctl.conf
----------------------------------------
kernel.numa_balancing = 0
----------------------------------------


More info:
https://forum.proxmox.com/threads/p...cpu-issue-with-windows-server-2019-vms.130727