It took me weeks before I found this thread while trying to figure out this same issue of having a VM freeze while using 100% of 1 CPU core. The Proxmox host(s) continues to operate without noticeable issues.
I have a three host Proxmox cluster of identical Protectli VP2420 systems with the Intel Celeron J6412 CPU running Proxmox 8.0.4 and the latest kernel as of today. All of these have had VMs go into the frozen state talked about in this thread. So, this does not appear to be specific to the N5105 CPU class. Side note, I have another cluster of Protectli systems running Intel Core i7-10810U CPUs, with the same versions of Proxmox and Ubuntu VMs and I used the same scripts to spin up a k0s cluster on the VMs and none of those VMs have frozen.
There are two k0s clusters on these J6412 CPU systems where I have been testing different methods of installation, each with 3 controller VMs and 3 worker VMs spread across the three Proxmox hosts all running on a fully patched Ubuntu 22.04. I continue to see the controller VMs freeze in the same way described here, where the VM shows 100% of a single core in use. When I give them 2 cores, the Proxmox UI will show a steady 50% CPU use. When given 4 cores, it will show a steady 25% CPU use. At that point you cannot ping or SSH into the VM and the serial console (the VMs are built using a Cloud-Init image) is unresponsive or dead. I have not had the worker nodes (same Ubuntu and k0s versions) freeze. I spent far too much time thinking this was a k0s issue before I found this thread and had a completely different VM go into this state.
I attempted to "migrate" to another host as others have reported that allowed the VM to recover but this did not work.
I also attempted a hibernate (suspend to disk) and resume, which also did not get the VM to recover.
I also attempted setting the
scaling_governor
to
powersave
which did not help.
My last attempt is installing the latest microcode as described by others. In this case it is
0x17
.
Code:
root@gadget1:~# pveversion
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-19-pve)
root@gadget1:~# uname -a
Linux gadget1 6.2.16-19-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-19 (2023-10-24T12:07Z) x86_64 GNU/Linux
A snippet of CPU info for the first one as the other 3 cores are the same.
Code:
root@gadget3:~# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 150
model name : Intel(R) Celeron(R) J6412 @ 2.00GHz
stepping : 1
microcode : 0x16
cpu MHz : 2000.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : *REDACTED for space*
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data
bogomips : 3993.60
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
UPDATE: on 2023 Nov. 15
Previously one or more of the VMs on this cluster would usually freeze within 24 hours. Since installing microcode 0x17 on these systems, I have not had any VM freeze. This appears to also be the fix for this class of CPU.