VM freezes irregularly

Installation came with 5.15, I upgraded immediately to 5.19. A few days later I went for 6.1:
Code:
reboot   system boot  6.1.2-1-pve      Fri Jan 20 11:23   still running
reboot   system boot  6.1.0-1-pve      Thu Dec 15 14:14 - 10:21 (22+20:07)
reboot   system boot  5.19.17-1-pve    Sat Dec 10 10:43 - 08:02  (21:18)
reboot   system boot  5.15.30-2-pve    Sat Dec 10 08:33 - 09:05  (00:32)
All my VMs were migrated from other nodes. Cold, not live. Unfortunately I have several different CPUs in my homelab and live-migration is not stable under all circumstances.

Just to verify I am on the correct node:
Code:
~# dmidecode | grep ODROID
        Product Name: ODROID-H3
What BIOS are you running in the H3 board ?, BIOS settings default, or Do you have any particular setting in bios ?, thx.
 
What BIOS are you running in the H3 board ?, BIOS settings default, or Do you have any particular setting in bios ?, thx.
As far as I remember I changed nothing in the Bios. It came with:

Code:
~# dmidecode | grep -A3 BIOS\ Inf
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 5.19
        Release Date: 08/23/2022

Best regards
 
Mine Odroid H3 also works fine.
BIOS 1.11
Kernel 5.19
No special BIOS settings. No kernel settings.

With Kernel 6.1 my Debian/DietPi VM hung. Now switched back to kernel 5.19 und replaced Debian/DietPi with Alpine linux and everything works. Other VMs are HomeAssistant OS and RHEL.
 
Same here. Running OPNSense and Unraid guests, as well as a few LXC containers

* 5.15 – frequent crashes and freezes
* 5.19 – no freezes at all
* 6.1 – crashes and freezes, but not as frequent
 
Hi

On my end, still on kernel 5.19.17-1-pve, with 32 days uptime, two VMs, OPNsense ( 3 (1 sockets, 3 cores) [host,flags=-pcid;-spec-ctrl;-ssbd;+aes] [cpuunits=2048] ) with VirtIO NICs and HomeAssistant, and two LXC containers (PiHole and TP-Link Omada Controller, based on Ubuntu 22.04).

root@pve:~# last reboot | head -n 1
reboot system boot 5.19.17-1-pve Sat Nov 26 20:14 still running
root@pve:~# uptime
11:48:25 up 32 days, 15:33, 1 user, load average: 0.39, 0.32, 0.29
root@pve:~# uname -a
Linux pve 5.19.17-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-1 (Mon, 14 Nov 2022 20:25:12 x86_64 GNU/Linux

Host is a Topton N5105 (CW-6000) with i225 B3 NICs, BIOS date 29/09/2022, 2x8GB RAM, 1x NVMe SSD WD SN530. Extra Noctua 40mm fan 12v (NF-A4x10 PWM) as exhaust is inaudible (as intake the noise would be noticeable).

But I've applied several options to the kernel cmdline, see below.

Kernel cmdline options:

intel_idle.max_cstate=1 (disable C-states below 1 (such as C3))
intel_iommu=on iommu=pt (Enable iommu, since at the begining I was going to use passthrough NICs to the OPNsense VM, but ended up using Virtio NICs, while testing for the crashes, and kept them)
mitigations=off (Self explanatory)
i915.enable_guc=2 ( Enable low-power H264 encoding, https://01.org/linuxgraphics/downloads/firmware , https://jellyfin.org/docs/general/administration/hardware-acceleration/#intel-gen9-and-gen11-igpus )
initcall_blacklist=sysfb_init ( GPU passthrough , https://wiki.tozo.info/books/server/page/proxmox-gpu-passthrough )
nvme_core.default_ps_max_latency_us=14900 ( https://esc.sh/blog/nvme-ssd-and-linux-freezing/ )

Also, due to i2c-6 NAK errors ( [Sat Nov 26 20:14:37 2022] i2c i2c-6: sendbytes: NAK bailout. ) related to the iGPU I've connected a dummy HDMI dongle after confirming that with a monitor plugged in the errors stoppped and so did system crashes, but by then I've had already applied other kernel parameters.

Didn't test if those were related to the enabling of i915 GuC/HuC or not.

And due to errors related to the NVMe SSD (WD SN530 M.2 2242) I've applied the nvme_core.default_ps_max_latency_us parameter as well.

[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: device [15b7:5008] error status/mask=00000001/0000e000
[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: [ 0] RxErr

- edit -
Also updated the intel microcode:

root@pve:~# dmesg -T | grep microcode
[Sat Nov 26 20:14:32 2022] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19
[Sat Nov 26 20:14:32 2022] SRBDS: Vulnerable: No microcode
[Sat Nov 26 20:14:33 2022] microcode: sig=0x906c0, pf=0x1, revision=0x24000023
[Sat Nov 26 20:14:33 2022] microcode: Microcode Update Driver: v2.2.

- edit -

Hopefully someone else finds this information helpful.
I continue with problems, some crash with 6.1 Kernel, now back to 5.19 and try some settings, It is a nightmare.
People that is using flag "intle_idle_max cstate=1" in kernel boot options, also Are you disable BIOS C.-states ?.
 
Last edited:
I continue with problems, some crash with 6.1 Kernel, now back to 5.19 and try some settings, It is a nightmare.
People that is using flag "intle_idle_max cstate=1" in kernel boot options, also Are you disable BIOS C.-states ?.
I have them enabled in BIOS, and also enhanced c-states (iirc). To control the maximum c-state I have that kernel opt.
 
I also had irregular freezes and crashes with stock Proxmox 7 Kernel.
I'm running one OPNsense VM and one Debian LXC container on a budget N5105 barebone from aliexpress with no special BIOS and Kernel settings.

Switched to Proxmox 6.0.15 Edge Kernel 10 days ago and my OPNsense VM is now running without any problems!

1675756037902.png
 
Sounds kind of weird to me, I put my machine to full production now with Kernel 5.19 and my opnsense is at 30 days + now and my Debian at 20+ without issues, so this seems the failsafe kernel but on the other hand no long term option.

Hesitating to update to a higher kernel, if 6.1 crashes again in some cases I'm curious what's the diff here.
 
I use two NUC11 with N5105 BIOS (ATJSLCPX.0038.2022.1114.1802) and one NUC 11 N6005 same BIOS Version

PVE Manager Version
pve-manager/7.3-4/d69b70d4

PVE Kernel Version
Linux 5.19.17-2-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-2 (Sat, 28 Jan 2023 16:40:25

Several Linux VMs freeze sporadic. One core of this VMs display as full load inside the VM. Only seset or STOP&START of VM help.
I see that on several different Linux VMs (kernel 5.10.x/5.15.x/5.19.x/6.0.x) and different host, but no time with a Windows VM.
Other VMs on that host still working at that time.
No log entries or Kernel dumps inside the freeze VMs. the VMs simply freezed.

I think there is a problem with KVM/vt-d with the N5105 CPUs.
On the N6005 I didn't see such freezes until now.

I run the NUC11 N5105 base metal with the same Linux distributions as the VM - no problems then.
 
Last edited:
Currently I have the following miniPC boxes in testing with the following processors.
N6005 (6x i226) CW-NW11v2
N5105 (4x i226) CW-FMI01v5
J4125 (4x i226) J4125-4L-i226

J4125 works flawlessly. So this is the last I will be mentioning it on this post.

N5105 and N6005 both suffers frequent VM (OPNsense) reboots or crashes.
My BIOS settings for all the miniPCs are default.

While people on Proxmox suggest to disable C STATE in BIOS, mine are disabled in BIOS by default.
Turbo Boost by default in BIOS for N6005 is 3300MHz and N5105 is 2900MHz.
Both N6005/N5105 base frequency is 2000MHz.

When in Proxmox shell using cli command
Code:
watch "lscpu | grep MHz

N6005 will be constantly at 3300MHz
N5105 will be constantly at 2900MHz

This should not be the "correct" states as they are Boost Frequencies and should be at this frequency for short periods of time.

Solution 1:
Goto BIOS disable MAX Turbo Boost
Both N6005/N5105 will be constantly at base max frequency 2000MHz instead of their boost frequencies.
The CPU will never use their Turbo Boost frequencies

Solution 2:
This method is not recommended by the people at Proxmox which prefers to have CPU Frequency at the maximum state at all times (Something about stability or another). I am currently testing this method.

Change governor to "ondemand" instead of "performance"

In Proxmox shell use the cli command
Code:
echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

You can see the frequency of the CPU with 2 second intervals using command
Code:
watch "lscpu | grep MHz"
or
Code:
cat /proc/cpuinfo |grep "cpu MH"

To set a cron to change governor to "ondemand" every reboot
Code:
crontab -e
If this is your first time editing your crontab, choose your editor and add
Code:
@reboot echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

The reason I choose ondemand instead of powersave is because powersave seems to stick at 800MHz and does not increase upon load.

Again, I am just starting to test this, and so far, have no issues. I will continue to monitor this and post if there is any further crashes. I am on 5.15 Kernel. I will try 6.0.15 Kernel if there is a crash.
 
Update... 4 SBC Odroid H3 Intel N5105 boards running k3s cluster in 6 VM, Kernel 5.19.17-1 after 3 days uptime since I configured "intel_idle.max_cstate=1" in kernel boot options, yesterday I had a reboot from one of the VM, no logs about the crash, only reboot, It is a nightmare.... any idea ?, maybe could updated kernel to 5.19.17-2 ( I had install the last week 6.1 kernel with no success ) ?, anyone knows if also I must add to kernel boot options "processor.max_cstate=1" ?
Anyone try Windows running in VM ?, Maybe the problem could also is related with kernel from VM ?, all my guest VM are running 5.15 kernel versions.
 
Last edited:
This method does not work!
After stress testing connection for 2 hours, the VM (OPNsense) crashed.

Currently testing with the following combination:

1. Opt-in Kernel 6.1
Code:
apt update
apt install pve-kernel-6.1
2. Add in grub (I am using XFS with UEFI)
Code:
intel_idle.max_cstate=1 processor.max_cstate=1
Code:
update-grub
3. Update Intel-Microcode to 3.20220510.1
edit /etc/apt/sources.list add non-free to
Code:
deb http://ftp.debian.org/debian bullseye main contrib non-free
deb http://ftp.debian.org/debian bullseye-updates main contrib non-free
deb http://security.debian.org bullseye-security main contrib non-free
update by
Code:
apt update
apt install intel-microcode
Then remove the non-free from sources.list because you don't need it anymore, and reboot.

Currently VM (OPNsense) has been running for 19 hours without any crashes, in which during this duration, I have been stressing it with high internet activity.

I do not know what might be working, it could be one, to the combination of all 3. I will revise it if the VM doesn't crash after a week with a clean install.
 
Last edited:
  • Like
Reactions: Pramde and josemmm
1 day 21 hours later, seems to be working fine so far.
 

Attachments

  • Proxmox.jpg
    Proxmox.jpg
    71.9 KB · Views: 27
  • OPNsense.jpg
    OPNsense.jpg
    36.8 KB · Views: 27
  • CPU-Speed.jpg
    CPU-Speed.jpg
    9 KB · Views: 25
  • Core-Temp.jpg
    Core-Temp.jpg
    38.2 KB · Views: 26
Had also issues with freezing VMs every few days. With the move from Debian 11 to AlmaLinux 9.1 the problems seem to be solved for me without any further investigations in kernels, c-states etc.
21 days uptime until now without any problems. Running on stock Proxmox VE 5.15.83-1-pve on a BNUC11ATKC40002 with latest BIOS from december and on Intel NVMe local storage for both Proxmox and my VMs.
So everyone who is not dependent on a Debian derivate should have a look at the Red Hat camp.
 
Had also issues with freezing VMs every few days. With the move from Debian 11 to AlmaLinux 9.1 the problems seem to be solved for me without any further investigations in kernels, c-states etc.
21 days uptime until now without any problems. Running on stock Proxmox VE 5.15.83-1-pve on a BNUC11ATKC40002 with latest BIOS from december and on Intel NVMe local storage for both Proxmox and my VMs.
So everyone who is not dependent on a Debian derivate should have a look at the Red Hat camp.

AlmaLinux 9.1 is based on RHEL 9.1. RHEL 9.1 release notes say:
Red Hat Enterprise Linux 9.1 is distributed with the kernel version 5.14.0-162

Source:
https://access.redhat.com/documenta...le/9.1_release_notes/index#enhancement_kernel

I've seen somewhere that these issues began with Kernel 5.15. It's possible that downgrading the VM's kernel makes it more stable.

Also the crashing and freezing is not exclusive to Linux. FreeBSD is also affected as pfSense and OPNsense are based on it and also experience issues.
 
AlmaLinux 9.1 is based on RHEL 9.1. RHEL 9.1 release notes say:
Red Hat Enterprise Linux 9.1 is distributed with the kernel version 5.14.0-162

Source:
https://access.redhat.com/documenta...le/9.1_release_notes/index#enhancement_kernel

I've seen somewhere that these issues began with Kernel 5.15. It's possible that downgrading the VM's kernel makes it more stable.

Also the crashing and freezing is not exclusive to Linux. FreeBSD is also affected as pfSense and OPNsense are based on it and also experience issues.
But Debian 11 has an even older kernel with 5.10 and is still unstable. So don't take my hint as a clarification of the cause, but simply as a possible solution for desperate users. For one or the other it could be the solution.
 
My OPNsense VM is still running, no crashing or random reboots yet. I am gonna wait till next weekend, and if the VM still doesn't crash or hang, I am going to revert back to Kernel 5.15, and remove the CSTATE disable from grub.

Personally, I think the motherboard BIOS has an outdated Processor Microcode.

The Intel-Microcode update can be built into the Kernel, so those who "updated" by switching Linux distros wouldn't even know.

Would be nice if they would post the reply from
Code:
cat /proc/cpuinfo | grep micro
and state what processor they are using, and whether their VM crashes or randomly reboot would be nice.

But Debian 11 has an even older kernel with 5.10 and is still unstable. So don't take my hint as a clarification of the cause, but simply as a possible solution for desperate users. For one or the other it could be the solution.
 
Last edited:
microcode : 0x24000023

Processor: N6005
Proxmox 7.3-6
Kernel Version: Linux 6.1.6-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.6-1 (2023-01-28T00:00Z)
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt intel_idle.max_cstate=1 processor.max_cstate=1"

Almost 3 days, VM (OPNsense) has not crashed or randomly rebooted ***YET***!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!