VM freezes irregularly

What are your thermals? Does your CPU frequency still scale up and down? Doesn't this completely disable CPU power management?

What is the output of these two commands:
Code:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
cat /sys/devices/system/cpu/cpuidle/current_driver

I'm not sure about thermals, but they probably aren't the best. I'm using the 4 x 2.5G Topton systems that look like large heatsinks/car amps. These are fanless, but generally do better with a fan. I'd be happy to provide this info if they're an easy way to get this data.

Here's the other info you asked about:

Code:
root@pve1:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate
root@pve1:~# cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle
 
Tested with:
- pve-manager/7.3-6/723bb6ec (running kernel: 6.1.10-1-pve)
- microcode 0x24000023
- intel_pstate=disable in grub

pfsense rebooted within 1 day. Just installed 0x24000023 ... test restart !!!
 
@R1CH how's it going for u with your Microcode 0x24000024, date = 2022-09-02 update test?

For my side, it resolves the crashing, hanging of my VMs in Promox, for the OPNsense VM, random reconnects with gateway has also disappeared as well, which shows that there is definitely stability improvements. I don't need to bother with P-State or C-State limiting.

I have contacted with CWWK and requested they update the Intel Microcode in a BIOS release, I don't know if they will do it, but at least I tried.
For those who gave up on this issue, I would recommend either going bare metal or use ESXi 8 instead. Otherwise, to update Microcode to 0x24000024, date = 2022-09-02.

@zUpm4n don't bother with the 0x24000023, try R1CH's update to 0x24000024 instead. While the 0x24000023 microcode somewhat helps in certain cases, it doesn't fully resolve the issues. I have not tried or checked R1CH's package, but my own update to the 0x24000024 does indeed work.
 
Last edited:
@zUpm4n don't bother with the 0x24000023, try R1CH's update to 0x24000024 instead. While the 0x24000023 microcode somewhat helps in certain cases, it doesn't fully resolve the issues. I have not tried or checked R1CH's package, but my own update to the 0x24000024 does indeed work.

thank you for support. i'm testing 0x24000024 disabling pstate. cstate is enabled in bios (default was disabled) and in grub .. so i have temp more low and cpu with no turbo always actived. Now i'm @ 1d 13h. Sometimes i have reached 5 days ... i hope it's the solution. ps: i'm using r1ch package
 
Last edited:
You can remove the pstate and cstate thing for the kernel. I am not using it.

I am not sure if it is required, but you might need to do a "update-initramfs -c -k all" after after replacing the microcode.
 
You can remove the pstate and cstate thing for the kernel. I am not using it.

I am not sure if it is required, but you might need to do a "update-initramfs -c -k all" after after replacing the microcode.
ok i don't kmow to restart again with your suggestion removing them from grub. your cstate in bios is enabled ?
 
My BIOS is default, nothing changed, except for Last state upon power loss.

My /etc/default/grub has "GRUB_CMDLINE_LINUX_DEFAULT="quiet initcall_blacklist=sysfb_init intel_iommu=on iommu=pt""
Since I use PCIe Passthrough for the LAN ports and GPU passthrough.
I use XFS, with UEFI. Governor is Performance.

So basically, other than PCIe Passthrough, GPU Passthrough which requires Kernel 6.1, and Microcode, everything else is pretty much default.

VM CPU is host, ballooning off, and PCIe Passthrough for LAN ports 1 to 4 (My unit is a 6 port N6005).
 
My BIOS is default, nothing changed, except for Last state upon power loss.

My /etc/default/grub has "GRUB_CMDLINE_LINUX_DEFAULT="quiet initcall_blacklist=sysfb_init intel_iommu=on iommu=pt""
Since I use PCIe Passthrough for the LAN ports and GPU passthrough.
I use XFS, with UEFI. Governor is Performance.

So basically, other than PCIe Passthrough, GPU Passthrough which requires Kernel 6.1, and Microcode, everything else is pretty much default.

VM CPU is host, ballooning off, and PCIe Passthrough for LAN ports 1 to 4 (My unit is a 6 port N6005).


ok in my case bios only enabled cstate that was disabled by default but with disabled i have always speed core @2.8ghz (n5105) .. so i have cpu scaling freq

grub now is:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"
after your post i removed all cstate line.

PCIe passthrough in my case is for 2 ports (wan/lan pfsense)
i haven't xfs, with 16gb ram i preferred default installation, ok uefi, same Governor. i don't use GPU Passthrough but i have kernel 6.1.10 and microcode just updated. host cpu also for my vms and booling not checked.
Viewing if new microcode resolve the iussue,
 
Hi

On my end, still on kernel 5.19.17-1-pve, with 32 days uptime, two VMs, OPNsense ( 3 (1 sockets, 3 cores) [host,flags=-pcid;-spec-ctrl;-ssbd;+aes] [cpuunits=2048] ) with VirtIO NICs and HomeAssistant, and two LXC containers (PiHole and TP-Link Omada Controller, based on Ubuntu 22.04).

root@pve:~# last reboot | head -n 1
reboot system boot 5.19.17-1-pve Sat Nov 26 20:14 still running
root@pve:~# uptime
11:48:25 up 32 days, 15:33, 1 user, load average: 0.39, 0.32, 0.29
root@pve:~# uname -a
Linux pve 5.19.17-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-1 (Mon, 14 Nov 2022 20:25:12 x86_64 GNU/Linux

Host is a Topton N5105 (CW-6000) with i225 B3 NICs, BIOS date 29/09/2022, 2x8GB RAM, 1x NVMe SSD WD SN530. Extra Noctua 40mm fan 12v (NF-A4x10 PWM) as exhaust is inaudible (as intake the noise would be noticeable).

But I've applied several options to the kernel cmdline, see below.

Kernel cmdline options:

intel_idle.max_cstate=1 (disable C-states below 1 (such as C3))
intel_iommu=on iommu=pt (Enable iommu, since at the begining I was going to use passthrough NICs to the OPNsense VM, but ended up using Virtio NICs, while testing for the crashes, and kept them)
mitigations=off (Self explanatory)
i915.enable_guc=2 ( Enable low-power H264 encoding, https://01.org/linuxgraphics/downloads/firmware , https://jellyfin.org/docs/general/administration/hardware-acceleration/#intel-gen9-and-gen11-igpus )
initcall_blacklist=sysfb_init ( GPU passthrough , https://wiki.tozo.info/books/server/page/proxmox-gpu-passthrough )
nvme_core.default_ps_max_latency_us=14900 ( https://esc.sh/blog/nvme-ssd-and-linux-freezing/ )

Also, due to i2c-6 NAK errors ( [Sat Nov 26 20:14:37 2022] i2c i2c-6: sendbytes: NAK bailout. ) related to the iGPU I've connected a dummy HDMI dongle after confirming that with a monitor plugged in the errors stoppped and so did system crashes, but by then I've had already applied other kernel parameters.

Didn't test if those were related to the enabling of i915 GuC/HuC or not.

And due to errors related to the NVMe SSD (WD SN530 M.2 2242) I've applied the nvme_core.default_ps_max_latency_us parameter as well.

[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: device [15b7:5008] error status/mask=00000001/0000e000
[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: [ 0] RxErr

- edit -
Also updated the intel microcode:

root@pve:~# dmesg -T | grep microcode
[Sat Nov 26 20:14:32 2022] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19
[Sat Nov 26 20:14:32 2022] SRBDS: Vulnerable: No microcode
[Sat Nov 26 20:14:33 2022] microcode: sig=0x906c0, pf=0x1, revision=0x24000023
[Sat Nov 26 20:14:33 2022] microcode: Microcode Update Driver: v2.2.

- edit -

Hopefully someone else finds this information helpful.

Still running stable, now with kernel 6.1.

Code:
root@pve:~# uname -r
6.1.10-1-pve
root@pve:~# uptime  
17:35:49 up 16 days, 23:08,  2 users,  load average: 0.09, 0.13, 0.12
root@pve:~# cat /proc/cmdline
initrd=\EFI\proxmox\6.1.10-1-pve\initrd.img-6.1.10-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_idle.max_cstate=1 intel_iommu=on iommu=pt mitigations=off i915.enable_guc=2 initcall_blacklist=sysfb_init nvme_core.default_ps_max_latency_us=14900
root@pve:~# dmesg -T | grep microcode
[Wed Feb 15 18:26:58 2023] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19
[Wed Feb 15 18:26:58 2023] SRBDS: Vulnerable: No microcode
[Wed Feb 15 18:26:59 2023] microcode: sig=0x906c0, pf=0x1, revision=0x24000023
[Wed Feb 15 18:26:59 2023] microcode: Microcode Update Driver: v2.2.
 
Last edited:
Here's an update to my situation. I was getting a freeze every 4 days with just a debian 11 VM running a few containers on my oDroid H3+ N6005 CPU.

1. I installed the microcode package currently available in the debian distros

Once I did this, I got random freezes and never went passed a day without this VM halting

2. Thanks to the script provided by @Dobbler , I was able to at least recover within a few minutes.

3. Once the grub pstate setting was mentioned, I put that in place on the VM host and knock on wood, its been 2.5 days thus far without an issue.

Here are my settings just in case it helps anyone else.

Bash:
root@drill:~# cat /etc/default/grub|grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable quiet"

root@drill:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
acpi-cpufreq

root@drill:~# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

root@drill:~#  cat /proc/cpuinfo|grep 'microcode\|model name'
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000023
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000023
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000023
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000023

root@drill:~# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
 
A few of us are testing the 0x24000024 microcode. I am having a good experience with it, I am not sure about the others.
You are on the 0x24000023 microcode.

You disabled your intel_pstate, I did not. I did not change the governor and am using the default performance.
I have not experience any rebooting or freezing of the VM yet, but I will let it run for 10/20/30 days before posting screenshots.

Bash:
# cat /etc/default/grub|grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet initcall_blacklist=sysfb_init intel_iommu=on iommu=pt"

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate

# cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle

# cat /proc/cpuinfo|grep 'microcode\|model name'
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024

# sensors
acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +119.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +41.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +40.0°C  (high = +105.0°C, crit = +105.0°C)

# cat /proc/cpuinfo | grep MHz
cpu MHz         : 3299.935
cpu MHz         : 3300.000
cpu MHz         : 3300.000
cpu MHz         : 3299.969
 
A few of us are testing the 0x24000024 microcode. I am having a good experience with it, I am not sure about the others.
You are on the 0x24000023 microcode.

You disabled your intel_pstate, I did not. I did not change the governor and am using the default performance.
I have not experience any rebooting or freezing of the VM yet, but I will let it run for 10/20/30 days before posting screenshots.

Bash:
# cat /etc/default/grub|grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet initcall_blacklist=sysfb_init intel_iommu=on iommu=pt"

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate

# cat /sys/devices/system/cpu/cpuidle/current_driver
acpi_idle

# cat /proc/cpuinfo|grep 'microcode\|model name'
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024

# sensors
acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +119.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +41.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +40.0°C  (high = +105.0°C, crit = +105.0°C)

# cat /proc/cpuinfo | grep MHz
cpu MHz         : 3299.935
cpu MHz         : 3300.000
cpu MHz         : 3300.000
cpu MHz         : 3299.969

Your cpuidle driver is acpi_idle instead of Intel_idle. Did you disable it or did the microcode update do it?

Your temps are high for idle and so is your frequency. It's possible your CPU doesn't enter C-state idle anymore.
 
I guess I'll share too. Recently updated to 0x24000024 after my last VM crash. Too early to tell so far.

Bash:
# cat /etc/default/grub|grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet"

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate

# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

# cat /proc/cpuinfo|grep 'microcode\|model name'
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024

# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +35.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +34.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +34.0°C  (high = +105.0°C, crit = +105.0°C)

# cat /proc/cpuinfo | grep MHz
cpu MHz         : 2000.000
cpu MHz         : 2000.000
cpu MHz         : 2473.630
cpu MHz         : 2000.000
 
Your cpuidle driver is acpi_idle instead of Intel_idle. Did you disable it or did the microcode update do it?

Your temps are high for idle and so is your frequency. It's possible your CPU doesn't enter C-state idle anymore.
Hmm... I didn't do anything to the idle, which is weird. I can't get into the BIOS to check now since it is still running stable. So, I really don't know if it is the BIOS or the microcode. But sdjaime is on the same microcode, and his is intel_idle, so I will keep note of it.
CPU is high temp, because activity is high, that is actually quite low already, usually it is around 60C.

Maybe my governor is on performance, and is running at Turbo 3.3GHz. Base is suppose to be 2GHz.

@sdjaime did you change your governor settings?
 
Last edited:
@AdriftAtlas

The contractor replacing my ceiling lights tripped the breaker.
So might as well check the BIOS, and you are right, CSTATE was disabled.
So now, BIOS is default except for Upon Power Lost = Last State, CSTATE=enable.

Now it is.
Bash:
# cat /etc/default/grub|grep GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet initcall_blacklist=sysfb_init intel_iommu=on iommu=pt"

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate

# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

# cat /proc/cpuinfo|grep 'microcode\|model name'
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
microcode       : 0x24000024

# sensors
acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +119.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +40.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +38.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +38.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +38.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +38.0°C  (high = +105.0°C, crit = +105.0°C)

# cat /proc/cpuinfo | grep MHz
cpu MHz         : 2936.286
cpu MHz         : 3026.681
cpu MHz         : 3025.192
cpu MHz         : 3139.646
At this rate, I will never get to 10 days, and it isn't because of VM freezing or rebooting... :p
 
Last edited:
The contractor replacing my ceiling lights tripped the breaker.
So might as well check the BIOS, and you are right, CSTATE was disabled.
So now, BIOS is default except for Upon Power Lost = Last State, CSTATE=enable.
At this rate, I will never get to 10 days, and it isn't because of VM freezing or rebooting... :p

I've been wanting to update from pve-qemu-kvm 7.1.0-4 to 7.2.0-5 and the microcode from 0x24000023 to 0x24000024. My pfSense 23.01 uptime is at 12 Days 20 Hours since I last changed things. I kind of wonder if its newer FreeBSD kernel makes a difference so I am waiting till it croaks.

Why not put it on a UPS though? I run mine on a mini UPS that outputs 12VDC. It's rated up to 75W so it runs my 12V switch and APs too. I stacked four battery packs for a total of ~308Wh so the runtime should approach a day. My ONT (integrated voice) is on a separate unit with two battery packs.
https://goprecisiongroup.com/product/li-75-micro-ups-12v-or-48v-75w-indoor-pp75l/

You can get them for $50 or less on eBay assuming you're in US, search "PP75L"
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!