VM freezes irregularly

Has anyone attempted to play with idle mode in the *guest VM*:

https://bugzilla.kernel.org/show_bug.cgi?id=196683
Describes a similar issue in Ryzen CPUs that crash when left to idle. Apparently, setting an idle=halt kernel parameter helped.

https://www.ibm.com/support/pages/l...ed-extensible-firmware-interface-uefi-servers
Also, it appears that the intel_idle module loads and messes with C-States regardless of UEFI settings. I have no clue what it does if it loads in a VM guest. To disable the module completely the following kernel parameter is suggested:
intel_idle.max_cstate=0

https://docs.kernel.org/admin-guide...el-command-line-options-and-module-parameters
According to this, setting idle=halt or intel_idle.max_cstate=0 will cause intel_idle initialization to fail.

https://lists.freebsd.org/pipermail/freebsd-current/2018-June/069799.html
A similar issue was reported for freeBSD. Setting the following helped:
sysctl machdep.idle_mwait=0 sysctl machdep.idle=hlt

I have run the freeBSD commands in pfSense shell and hope it somehow helps. So far at 3 days and 9 hours uptime.
 
Code:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address    = 0x1008e
fault code        = supervisor write data, page not present
instruction pointer    = 0x20:0xffffffff80da2d71
stack pointer            = 0x28:0xfffffe0025782b00
frame pointer            = 0x28:0xfffffe0025782b60
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = resume, IOPL = 0
current process        = 11 (idle: cpu0)
trap number        = 12
panic: page fault
cpuid = 0
time = 1672654637
KDB: enter: panic

db:0:kdb.enter.default>  bt

Tracing pid 11 tid 100003 td 0xfffff8000520d000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00257828c0
vpanic() at vpanic+0x194/frame 0xfffffe0025782910
panic() at panic+0x43/frame 0xfffffe0025782970
trap_fatal() at trap_fatal+0x38f/frame 0xfffffe00257829d0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0025782a30
calltrap() at calltrap+0x8/frame 0xfffffe0025782a30
--- trap 0xc, rip = 0xffffffff80da2d71, rsp = 0xfffffe0025782b00, rbp = 0xfffffe0025782b60 ---
callout_process() at callout_process+0x1b1/frame 0xfffffe0025782b60
handleevents() at handleevents+0x188/frame 0xfffffe0025782ba0
cpu_activeclock() at cpu_activeclock+0x70/frame 0xfffffe0025782bd0
cpu_idle() at cpu_idle+0xa8/frame 0xfffffe0025782bf0
sched_idletd() at sched_idletd+0x326/frame 0xfffffe0025782cb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0025782cf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0025782cf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

db:0:kdb.enter.default>  alltrace

Tracing command sleep pid 35878 tid 100632 td 0xfffff80057237740
sched_switch() at sched_switch+0x606/frame 0xfffffe003671b9c0
mi_switch() at mi_switch+0xdb/frame 0xfffffe003671b9f0
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe003671ba40
sleepq_timedwait_sig() at sleepq_timedwait_sig+0x14/frame 0xfffffe003671ba80
_sleep() at _sleep+0x1c6/frame 0xfffffe003671bb00
kern_clock_nanosleep() at kern_clock_nanosleep+0x1c1/frame 0xfffffe003671bb80
sys_nanosleep() at sys_nanosleep+0x3b/frame 0xfffffe003671bbc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe003671bcf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe003671bcf0
--- syscall (240, FreeBSD ELF64, sys_nanosleep), rip = 0x80038c9fa, rsp = 0x7fffffffec18, rbp = 0x7fffffffec60 ---

Tracing command sh pid 15762 tid 100600 td 0xfffff80016b8e000
sched_switch() at sched_switch+0x606/frame 0xfffffe00366cb970
mi_switch() at mi_switch+0xdb/frame 0xfffffe00366cb9a0
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe00366cb9f0
sleepq_wait_sig() at sleepq_wait_sig+0xf/frame 0xfffffe00366cba20
_sleep() at _sleep+0x1f1/frame 0xfffffe00366cbaa0
pipe_read() at pipe_read+0x3fe/frame 0xfffffe00366cbb10
dofileread() at dofileread+0x95/frame 0xfffffe00366cbb50
sys_read() at sys_read+0xc0/frame 0xfffffe00366cbbc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00366cbcf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00366cbcf0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x80044f03a, rsp = 0x7fffffffe3d8, rbp = 0x7fffffffe900 ---

Tracing command sh pid 15703 tid 100633 td 0xfffff80057237000
sched_switch() at sched_switch+0x606/frame 0xfffffe0036720800
mi_switch() at mi_switch+0xdb/frame 0xfffffe0036720830
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe0036720880
sleepq_wait_sig() at sleepq_wait_sig+0xf/frame 0xfffffe00367208b0
_sleep() at _sleep+0x1f1/frame 0xfffffe0036720930
kern_wait6() at kern_wait6+0x59e/frame 0xfffffe00367209c0
sys_wait4() at sys_wait4+0x7d/frame 0xfffffe0036720bc0
amd64_sy


Looking solely on the kernel error message (first part at least), this does actually look quite similiar: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234296 .......... This was a LLE lookup race bug. Don't think anything can be concluded on that though.

I will be receiving my unit tomorrow and will try the different things as well. Reading through the pages, I think we really need the post this to bugzilla.kernel and attempt to get some attention, with logs and a kernel dump for a kernel expert to investigate. Following this could in particular be interesting as it may point the origin of the error: https://www.linuxjournal.com/content/oops-debugging-kernel-panics-0. From there all the debugging would start.

Another sidenote. The kernel version 6.2RC3 has been released recently. There has been quite many changes to drivers and KVM and I would like to try that version as well.
 
Last edited:
I've the same experience on two server, ML380 G9 and ML160 G10. All worked fine until i've upgraded from 7.2 to 7.3 and kernel 5.19, then the vms randomly freeze with 100%cpu usage. Now i've revert to kernel 5.15, i report back if this fixes the freezes
Reverting to kernel 5.15 does not fix the problem. May be related to qemu and not to the kernel?
 
Quick update on bare metal:

Linux kernel version v6.1.3 and boot parameter intel_idle.max_cstate=1.

Have 25 computers running and have seen 1 totally lock up. This time, no CPU lockups were indicated in the journalctl logs. Instead, the logs just stopped completely.
 
Last edited:
So updating microcode did not fix it, pfSense hung after 3 days. Now trying 5.19 kernel...

My family is not happy that pfSense is crashing. I am seriously considering buying another SSD and rebuilding everything from scratch with ESXi 8. Apparently the I-226V driver is now bundled by default. Other people have reported that ESXi 7 is stable on these machines.

As much as I like the open source nature of Proxmox; I've used vCenter at work for years and other than a few bungled updates last year it has been rock solid.

QEMU/KVM and/or the Linux kernel seems to have compatibility issues with Jasper Lake CPUs. This is not an issue specific to Proxmox as the Unraid community is reporting the same.
I dont like to say this, but I'm trying ESXi 7 and what happening after install a docker in Ubuntu 22.04.1, "freesing" .... I'm so sad...
 
I dont like to say this, but I'm trying ESXi 7 and what happening after install a docker in Ubuntu 22.04.1, "freesing" .... I'm so sad...

My intel NUC with N5105 is running XCP-NG alpha 8.3 since release, I can report the freezing issues - so far - disappeared.
I am running 3 VMs Ubuntu 22.04 with Docker, also everything is surprisingly stable being an alpha. Quite happy so far.
My older NUC gen 8 and 6 are on Proxmox and they are working fine.
 
Last edited:
I dont like to say this, but I'm trying ESXi 7 and what happening after install a docker in Ubuntu 22.04.1, "freesing" .... I'm so sad...
Well your Guest is still a Linux kernel...? :/ I am hardly thinking about returning that unit of mine...

Mine does run fine now for 7 days but its somewhat a constant fear of it stopping and I don't notice early enough or am not in reach.
 
Last edited:
My intel NUC with N5105 is running XCP-NG alpha 8.3 since release, I can report the freezing issues - so far - disappeared.
I am running 3 VMs Ubuntu 22.04 with Docker, also everything is surprisingly stable being an alpha. Quite happy so far.
My older NUC gen 8 and 6 are on Proxmox and they are working fine.
Is there a reason you're running an alpha build? Is XCP-NG 8.2 LTS not stable for you?
How do you like it compared to Proxmox (and ESXi if you've used that too)?
 
Did that crash occur only once? Or did your time between failure go back to 8-24 hours after the first crash?

It somewhat scares me, that even the "reference design" does have this error... Otherwise I would have thought it's a bios error in cheap third party producers, but that way it feels like being narrowed down to the CPU or the software implementation of QEMU cause bare metal seems to work flawless doesn't it?
Only once, has been running OK since that happened but who knows.

I set the governor to powersave but did not touch cstates yet as I need to reboot.

The thing that bothers me is that it has always only happened to the same VM even though I have many more, even with the same OS.
 
Is there a reason you're running an alpha build? Is XCP-NG 8.2 LTS not stable for you?
How do you like it compared to Proxmox (and ESXi if you've used that too)?
The alpha 8.3 introduced support for the intel network and video card of the NUC10 and 11, the current stable release does not supported them.

I do like more the Proxmox interface, but as far as functions and stability Proxmox and XCP-NG are the same. I used to use ESXi but the free version is too limited and licenses are pretty expensive for a small business or home lab.

A good comparison is done here https://forums.lawrencesystems.com/t/proxmox-vs-xcp-ng/7148
 
Good afternoon, maybe someone has a successful solution for activating IGPU for VM via Proxmox on N5105 (Beelink U59 VT-D and VT-X box in Bios are activated but system agen reports that VT-D is Unsupported)
[КОД=bash]root@Beelink:~# dmesg | grep -e DMAR -e IOMMU
[ 0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[ 0.009952] ACPI: DMAR 0x00000000726D4000 000088 (v02 INTEL EDK2 00000002 01000013)
[ 0.009989] ACPI: Reserving DMAR table memory at [mem 0x726d4000-0x726d4087]
[ 0.041841] DMAR: IOMMU enabled
[ 0.115027] DMAR: Host address width 39
[ 0.115028] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.115036] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 49e2ff0505e
[ 0.115039] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[ 0.115044] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[ 0.115047] DMAR: RMRR base: 0x00000079800000 end: 0x0000007dbfffff
[ 0.115050] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1
[ 0.115052] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[ 0.115053] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.116763] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 0.281559] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[ 0.361751] DMAR: No ATSR found
[ 0.361752] DMAR: No SATC found
[ 0.361754] DMAR: IOMMU feature fl1gp_support inconsistent
[ 0.361755] DMAR: IOMMU feature pgsel_inv inconsistent
[ 0.361756] DMAR: IOMMU feature nwfs inconsistent
[ 0.361757] DMAR: IOMMU feature pds inconsistent
[ 0.361757] DMAR: IOMMU feature eafs inconsistent
[ 0.361758] DMAR: IOMMU feature prs inconsistent
[ 0.361759] DMAR: IOMMU feature nest inconsistent
[ 0.361759] DMAR: IOMMU feature mts inconsistent
[ 0.361760] DMAR: IOMMU feature sc_support inconsistent
[ 0.361761] DMAR: IOMMU feature dev_iotlb_support inconsistent
[ 0.361762] DMAR: dmar0: Using Queued invalidation
[ 0.361765] DMAR: dmar1: Using Queued invalidation
[ 0.362213] DMAR: Intel(R) Virtualization Technology for Directed I/O
[/CODE]
 
Last edited:
Good afternoon, maybe someone has a successful solution for activating IGPU for VM via Proxmox on N5105 (Beelink U59 VT-D and VT-X box in Bios are activated but system agen reports that VT-D is Unsupported)
[КОД=bash]root@Beelink:~# dmesg | grep -e DMAR -e IOMMU
[ 0.000000] Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
[ 0.009952] ACPI: DMAR 0x00000000726D4000 000088 (v02 INTEL EDK2 00000002 01000013)
[ 0.009989] ACPI: Reserving DMAR table memory at [mem 0x726d4000-0x726d4087]
[ 0.041841] DMAR: IOMMU enabled
[ 0.115027] DMAR: Host address width 39
[ 0.115028] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.115036] DMAR: dmar0: reg_base_addr fed90000 ver 4:0 cap 1c0000c40660462 ecap 49e2ff0505e
[ 0.115039] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[ 0.115044] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[ 0.115047] DMAR: RMRR base: 0x00000079800000 end: 0x0000007dbfffff
[ 0.115050] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1
[ 0.115052] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[ 0.115053] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.116763] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 0.281559] pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
[ 0.361751] DMAR: No ATSR found
[ 0.361752] DMAR: No SATC found
[ 0.361754] DMAR: IOMMU feature fl1gp_support inconsistent
[ 0.361755] DMAR: IOMMU feature pgsel_inv inconsistent
[ 0.361756] DMAR: IOMMU feature nwfs inconsistent
[ 0.361757] DMAR: IOMMU feature pds inconsistent
[ 0.361757] DMAR: IOMMU feature eafs inconsistent
[ 0.361758] DMAR: IOMMU feature prs inconsistent
[ 0.361759] DMAR: IOMMU feature nest inconsistent
[ 0.361759] DMAR: IOMMU feature mts inconsistent
[ 0.361760] DMAR: IOMMU feature sc_support inconsistent
[ 0.361761] DMAR: IOMMU feature dev_iotlb_support inconsistent
[ 0.361762] DMAR: dmar0: Using Queued invalidation
[ 0.361765] DMAR: dmar1: Using Queued invalidation
[ 0.362213] DMAR: Intel(R) Virtualization Technology for Directed I/O
[/CODE]

https://pve.proxmox.com/wiki/Pci_passthrough#GPU_Passthrough
 
Thanks, just follow this instruction + https://3os.org/infrastructure/prox...mox-configuration-for-gvt-g-split-passthrough but alas, the status is "IOMMU enabled" but the video card is not forwarded
View attachment 45579
Why you did not follow the official guide? The guide you used is for "iGPU split passthrough" maybe you can start with the basic guide and also note al the exclusions. Once you got the GPU passthrough working, looking for more advanced fine-tuning via the guide you mentioned.


There are two ways to use iGPU passthrough to VM. The first way is to use the Full iGPU Passthrough to VM. The second way is to use the iGPU GVT-g technology which allows as to split the iGPU into two parts. We will be covering the Split iGPU Passthrough. If you want to use the split Full iGPU Passthrough you can find the guide here.
 
Has anyone attempted to play with idle mode in the *guest VM*:

https://bugzilla.kernel.org/show_bug.cgi?id=196683
Describes a similar issue in Ryzen CPUs that crash when left to idle. Apparently, setting an idle=halt kernel parameter helped.

https://www.ibm.com/support/pages/l...ed-extensible-firmware-interface-uefi-servers
Also, it appears that the intel_idle module loads and messes with C-States regardless of UEFI settings. I have no clue what it does if it loads in a VM guest. To disable the module completely the following kernel parameter is suggested:
intel_idle.max_cstate=0

https://docs.kernel.org/admin-guide...el-command-line-options-and-module-parameters
According to this, setting idle=halt or intel_idle.max_cstate=0 will cause intel_idle initialization to fail.

https://lists.freebsd.org/pipermail/freebsd-current/2018-June/069799.html
A similar issue was reported for freeBSD. Setting the following helped:
sysctl machdep.idle_mwait=0 sysctl machdep.idle=hlt

I have run the freeBSD commands in pfSense shell and hope it somehow helps. So far at 3 days and 9 hours uptime.

My uptime is 7 days and 8 hours on my pfSense VM. I am running the stock PVE kernel and the newest microcode. It was crashing with this config until I ran the two sysctl commands mentioned above. I'll let it run for a few more days and try to put them into system tunables so they activate on reboot.

The issue is likely in the Linux kernel, QEMU, and/or KVM. Likely the VM guest makes a CPU power management call of some sort that is not properly virtualized and it results in a VM panic.
 
I have 6 days uptime.
n5105 odroid h3
Kernel 5.19
no microcode
no special bios or kernel settlngs

VM
Dietpi
Home Assistant
Red Hat 9
Alpine 3.17
 
My uptime is 7 days and 8 hours on my pfSense VM. I am running the stock PVE kernel and the newest microcode. It was crashing with this config until I ran the two sysctl commands mentioned above. I'll let it run for a few more days and try to put them into system tunables so they activate on reboot.

The issue is likely in the Linux kernel, QEMU, and/or KVM. Likely the VM guest makes a CPU power management call of some sort that is not properly virtualized and it results in a VM panic.
What confuses me, is that this problem does occur rather random than predictable.

Some guest OS panic, some don't. Even if they are the same OS, their power management "should" behave the same. Plus the fact, that there seem to be some people having problems on bare metal too.

Fingers crossed, I'm at 9 days now, but still investigating and watching if something new pops up...
 
I've got a Topton 'NAS board' with N6005 with 3 VMs running on it – OPNSense, Unraid and Ubuntu Server.

The unit came with C-States and ACPI disabled, and with the kernel 5.15 my Unraid VM kept crashing with a kernel panic.

After reading through this thread and some others, I've installed the intel-microcode package and the latest 6.1 kernel. I've also enabled C-States and ACPI in the BIOS.

That helped. So far the VMs have been running fine today (with the kernel 5.15 the Unraid VM would crash after 4-5 hours of running), with one exception.

The OPNSense VM, which hadn't crashed up until that point, spat out a bunch of scary looking text in the console and rebooted, all by itself. I've executed these commands suggested for a pfSense guest, let's see if that happens again.
sysctl machdep.idle_mwait=0
sysctl machdep.idle=hlt

Update: the hypervisor force rebooted by itself at midnight, with nothing in the logs that would indicate the reason for a reboot. After that, the Unraid VM triggered a parity check and eventually crashed, just like it used to on 5.15.
I just downgraded the kernel to 5.19, let's see if it helps.
 
Last edited:
Has anyone attempted to play with idle mode in the *guest VM*:

https://bugzilla.kernel.org/show_bug.cgi?id=196683
Describes a similar issue in Ryzen CPUs that crash when left to idle. Apparently, setting an idle=halt kernel parameter helped.

https://www.ibm.com/support/pages/l...ed-extensible-firmware-interface-uefi-servers
Also, it appears that the intel_idle module loads and messes with C-States regardless of UEFI settings. I have no clue what it does if it loads in a VM guest. To disable the module completely the following kernel parameter is suggested:
intel_idle.max_cstate=0

https://docs.kernel.org/admin-guide...el-command-line-options-and-module-parameters
According to this, setting idle=halt or intel_idle.max_cstate=0 will cause intel_idle initialization to fail.

https://lists.freebsd.org/pipermail/freebsd-current/2018-June/069799.html
A similar issue was reported for freeBSD. Setting the following helped:
sysctl machdep.idle_mwait=0 sysctl machdep.idle=hlt

I have run the freeBSD commands in pfSense shell and hope it somehow helps. So far at 3 days and 9 hours uptime.

After 11 days and 12 hours pfSense hung with one core stuck at 100%. The settings have helped but there is likely a deeper issue with KVM that gets tripped eventually.

Just installed the latest 6.1.2 kernel tonight in hopes of not crashing at all:
Linux 6.1.2-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.2-1 (2023-01-10T00:00Z)

I'm noticing that the CPU spends more time around 800Mhz now than 2000Mhz with the powersave governor on 6.1. Which might be a good sign or very bad sign considering the VM guest panics seem to be CPU power management related. CPU thermals are about the same though.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!