VM freezes irregularly

Hi

So due to the recent reports of success with kernel 6.2 and microcode 0x24000024 I decided to give it a try. System: Intel N5105, MB: CW-N6000 (from Topton, MB is obviously from Changwang), 4x i225 B3, 2x8GB, 1x SSD NVMe from WesternDigital.

Before I was running the kernel 6.1 with the following cmdline and 0x24000023 microcode:

root@pve:~# cat /proc/cmdline
initrd=\EFI\proxmox\6.1.10-1-pve\initrd.img-6.1.10-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_idle.max_cstate=1 intel_iommu=on iommu=pt mitigations=off i915.enable_guc=2 initcall_blacklist=sysfb_init nvme_core.default_ps_max_latency_us=14900

Without crashes or host freeze for ~43 days.

Removed max_cstate=1 from cmdline and booted up kernel 6.2 with the newer microcode, after ~6h30/7h I had a full host freeze, requiring power cycling.

root@pve:~# last | grep reboot
reboot system boot 6.2.6-1-pve Fri Mar 31 10:27 still running <- current boot, with ~2 days uptime, kernel 6.2 0x24..24 w/ max_cstate=1
reboot system boot 6.1.10-1-pve Fri Mar 31 09:45 - 10:27 (00:42) <- after power cycle booted older kernel (6.2 was only pinned for next boot)
reboot system boot 6.2.6-1-pve Thu Mar 30 23:52 - 10:27 (10:35) <- host freeze after ~6h30/7h, 6.2 0x24..24 w/o max_cstate parameter
reboot system boot 6.1.10-1-pve Wed Feb 15 18:27 - 23:52 (43+04:25)

reboot system boot 5.19.17-1-pve Wed Feb 15 18:23 - 18:26 (00:03)
reboot system boot 6.1.10-1-pve Wed Feb 8 22:47 - 18:12 (6+19:24)
reboot system boot 5.19.17-1-pve Mon Jan 30 13:48 - 22:47 (9+08:59)
reboot system boot 5.19.17-1-pve Sat Nov 26 20:14 - 13:45 (64+17:30)

root@pve:~# uptime
15:17:44 up 2 days, 4:50, 2 users, load average: 0.02, 0.03, 0.04

At least in my case, it's not a fix (kernel 6.2 + microcode 0x24..24). Only setting the max_cstate option to limit C-states to C1.
 
root@pve:~# cat /proc/cmdline
initrd=\EFI\proxmox\6.1.10-1-pve\initrd.img-6.1.10-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_idle.max_cstate=1 intel_iommu=on iommu=pt mitigations=off i915.enable_guc=2 initcall_blacklist=sysfb_init nvme_core.default_ps_max_latency_us=14900

Without crashes or host freeze for ~43 days.

Removed max_cstate=1 from cmdline and booted up kernel 6.2 with the newer microcode, after ~6h30/7h I had a full host freeze, requiring power cycling.

Here is mine:
Code:
root@prox:~# cat /proc/cmdline
initrd=\EFI\proxmox\6.1.14-1-pve\initrd.img-6.1.14-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on iommu=pt

I see you're disabling CPU mitigations, loading GPU firmware for HEVC support, likely using GPU passthrough, and changing NVMe APST. Any one of these could be an issue with a newer kernel.

I'd try running 6.1.10, x24 microcode, and removing intel_idle.max_cstate=1. That way you're testing one thing; whether the x24 microcode fixes the CPU idle bug with KVM/QEMU VMs.

If that is stable then you can experiment with different kernel parameters and versions.

Could it be that your NVMe drive is misbehaving and the 6.2 kernel is more sensitive to it? Can you update its firmware? I use a Samsung 980 (non-pro) SSD and until I updated its firmware it was throwing an occasional SMART warning about overheating when it clearly was not. After the firmware update it has been rock solid.

There are many such reports that WD NVMe drives cause Linux to hang:
https://esc.sh/blog/nvme-ssd-and-linux-freezing/

I'd try replacing it with a Samsung 980. They're $40 on Amazon for 500GB. Just make sure to upgrade its firmware if it's not running the latest.

I have not had the host crash in any of my testing. There is something else going on with your setup.
 
Last edited:
Again ALL OK !!!!! :)

N5105, 16gb kingston, ssd 980 500gb + hdd 1.75tb samsung
root@pve:~# uptime 21:45:24 up 20 days, 7:15, root@pve:~# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.1.15-1-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on root@pve:~# pveversion pve-manager/7.4-3/9002ab8a (running kernel: 6.1.15-1-pve)

microcode 024
all vms with host cpu
cstate enabled in bios, default was disabled

3 vms:
- pfsense with 2 nics in passthrough
- xigmanas: (media for plex, space for videorecording ip cameras, documents, ecc)
- homeassistant
- win 7
3 lxc:
- omada controller on ubuntu 20.04
- plex server on ubuntu 21.10
- nginx proxy manager on debian


before this setting never reached 5/10 days. Vms rebooted or crash pfsense (1-3 times in 6 months)
my traffic on pfsense lan for recording ip camera is now about 2.5tb .. never reached more than 500gb.
 
Last edited:
proxmox.jpgprox.jpgopnsense.jpg

30 days, no errors... I am gonna update proxmox and opnsense and conclude the test.

For me, I think this issue is considered "partially" resolved, again, it would be nice if the manufacturers like CWWK would release and update to the BIOS, but it looks unlikely.

I don't think I need to post a guide on how to update the microcode since Debian has already proposed it as a stable update.
stable-p-u: 3.20230214.1~deb11u1 so just adding non-free and doing a apt install is probably going to be possible either next week or the week after.
 
@s0x
mitigation=off when you are trying to use an updated microcode might not work, as it might turn off any fixes from the update. I am not sure.

Host freezing or crashing, did you check the temperature of the CPU, I recommend adding a fan on top of the device and see if you still have that issue.
 
I just register to thanks all of you for helping with this problem, I was strugling with my setup:
5 topton n6005, setup with proxmox and ceph. 5 VM with Kubernetes that start crashing as I was adding containers on it.
Since I put the intel microcode 24, it seems to be rock solid.
What I don't understand is, why the proxmox team doesn't distribute the microcode. I am pretty sure many cluster have big security issue because of that.
 
  • Like
Reactions: LiFE1688
These units are not the kind of hardware PVE is intended to run on. Although it does pretty well. So I would not expect Proxmox to include microcode directly. The microcode should better be supplied via a BIOS update by the manufacturer.
 
  • Like
Reactions: mapti89
These units are not the kind of hardware PVE is intended to run on. Although it does pretty well. So I would not expect Proxmox to include microcode directly. The microcode should better be supplied via a BIOS update by the manufacturer.
The Priority for Microcode updates are:

1) BIOS (AMD Agesa / Intel uCode)
2) OS (Windows / Linux / BSD)

Since Promox leaves OS stuff to Debian, it shouldn't be included in Proxmox. Again, the rule is leave OS stuff to the OS distro.
 
I'm calling the issue fixed. pfSense ran without issue for a month with x24 microcode. I had to reboot Proxmox to update to 7.4.

Screenshot 2023-04-06 at 9.44.00 PM.png
 
  • Like
Reactions: LiFE1688
It looks like a new microcode version has been added to the debian repository?

Code:
dpkg -l intel-microcode
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name            Version              Architecture Description
+++-===============-====================-============-===========================================
ii  intel-microcode 3.20230214.1~deb10u1 amd64        Processor microcode firmware for Intel CPUs
 
It looks like a new microcode version has been added to the debian repository?

It's currently in the stable proposed updates repo not yet in the stable repo. It'll likely come out with Debian 11.7 at the end of the month.
 
Thanks for all your work guys! Fingers crossed for me that microcode 0x24000024 does the trick, like everyone said 0x24000023 improved things marginally but still have had occasional freezes. Just updated to 0x24000024. (edit: N5105 btw)
 
It looks like a new microcode version has been added to the debian repository?

Code:
dpkg -l intel-microcode
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name            Version              Architecture Description
+++-===============-====================-============-===========================================
ii  intel-microcode 3.20230214.1~deb10u1 amd64        Processor microcode firmware for Intel CPUs
You can download latest intel-microcode from Debian unstable repo. https://packages.debian.org/sid/amd64/intel-microcode/download
 
  • Like
Reactions: VictorMike
Looks like Changwang (CWWK) is starting to update their BIOS for the JasperLake miniPCs

Please note this is for the N5105 v3 v4 v5 (4 LAN Ports).
Hopefully, the others will follow.

(Google Translate)
Updated the microcode issued by Intel on February 14, 2023,
Fixed the 11th generation virtualization crash problem

版本: FMI04 N5105
日期: 2023-04-18
支持的主板: N5105 V3 V4 V5

更新了2023年2月14日 英特尔发的微码,
修复了11代 虚拟化崩溃问题
 
Looks like Changwang (CWWK) is starting to update their BIOS for the JasperLake miniPCs

Please note this is for the N5105 v3 v4 v5 (4 LAN Ports).
Hopefully, the others will follow.

(Google Translate)
Updated the microcode issued by Intel on February 14, 2023,
Fixed the 11th generation virtualization crash problem

I have a HUNSN RJ03 that uses a Changwang v5 board. Has anyone tried to update their BIOS with their ISO? Any issues?
 
Given the lack of protection against flashing the wrong BIOS, I find it somewhat risky to update the BIOS on these boards. The only benefit of flashing the BIOS is if you run an OS that doesn't perform microcode updates. There is no difference between the updated intel-microcode package and the updated BIOS other than the time at which the updated microcode loads.

Also I'm happy to report I haven't had a single VM freeze since the updated microcode.
 
  • Like
Reactions: Edoardo396
I updated my BIOS and ripped out the intel-microcode package. It reset my BIOS settings which I had to reconfigure. Otherwise seems to be OK.

I'd rather not depend on the OS to update the microcode if the hardware can be flashed with the latest.

Code:
grep 'stepping\|model\|microcode' /proc/cpuinfo

model           : 156
model name      : Intel(R) Celeron(R) N5105 @ 2.00GHz
stepping        : 0
microcode       : 0x24000024
 
  • Like
Reactions: Neuer_User

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!