VM freezes irregularly

The new kernel 6.2 with some KVM improvements will be released on Sunday.

https://www.phoronix.com/news/Linux-6.2-KVM

I hope that will solve the problem.

NAS motherboard N5105 (4x i226)
Proxmox 7.3-6
Kernel Linux 6.1.2-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.2-1
and activated c-state 1 in the bios.

Have 4 machines running Debian 11 and one running IPFire.
Last week an IPFire crash, the other machines have been running stably for weeks.

Code:
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: KVM internal error. Suberror: 1
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[0]: 0x0000000000000001
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[1]: 0xe8ff6afcca010f0f
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[2]: 0x48c4894800000535
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[3]: 0x000000000000001e
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[4]: 0x0000000000000010
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[5]: 0x0000000000000000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[6]: 0x0000000000000000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: extra data[7]: 0x0000000000000000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: emulation failure
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: RAX=ffffffffb067c560 RBX=ffffffffb1215940 RCX=0000000000000000 RDX=0000000000000000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffffb1203e68
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: R12=0000000000000000 R13=0000000000000000 R14=ffff96abffff7fc0 R15=ffffffffb1215118
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: RIP=ffffffffb0800df0 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: ES =0000 0000000000000000 ffffffff 00c00000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: DS =0000 0000000000000000 ffffffff 00c00000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: FS =0000 0000000000000000 ffffffff 00c00000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: GS =0000 ffff96abf9c00000 ffffffff 00c00000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: LDT=0000 0000000000000000 ffffffff 00c00000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: GDT=     fffffe0000001000 0000007f
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: IDT=     fffffe0000000000 00000fff
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: CR0=80050033 CR2=000071a442e13008 CR3=0000000116138000 CR4=00350ef0
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: DR6=00000000ffff0ff0 DR7=0000000000000400
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: EFER=0000000000000d01
Feb 09 23:33:17 127.0.0.1 QEMU[1777]: Code=89 c4 48 89 e7 e8 5a 1a e7 ff e9 65 06 00 00 0f 1f 44 00 00 <0f> 01 ca fc 6a ff e8 35 05 00 00 48 89 c4 48 89 e7 e8 0a 19 e7 ff e9 45 06 00 00 0f 1f 44
 
Last edited:
pve-kernel-6.1.10-1-pve is crashing windows server 2022 vm.
The vm will crash after about 5 mins after boot with volmgr error.

The last stable kernel is 6.1.2-1-pve
 
6 days, and still no crashing... ***YET***
This actually looks promising.
 

Attachments

  • Proxmox - 16FEB2023.jpg
    Proxmox - 16FEB2023.jpg
    72.6 KB · Views: 67
  • Proxmox 2 - 16FEB2023.jpg
    Proxmox 2 - 16FEB2023.jpg
    30.6 KB · Views: 59
  • OPNsense - 16FEB2023.jpg
    OPNsense - 16FEB2023.jpg
    36.6 KB · Views: 61
Last edited:
Just as another "me too": since installation of:
Code:
reboot   system boot  6.1.10-1-pve     Fri Feb 10 16:20 - 15:05  (22:44)
on my low-end
Code:
~# dmidecode | grep Product\ Name
        Product Name: ODROID-H3
I've had three occurrences of this: all VMs keep running and react normal. The system-load rises linear with a constant rate, today it was over 6000 :) while the cpu is nearly idle:
Code:
# w                                                                     
 12:09:24 up 4 days,  2:47,  1 user,  load average: 6094.80, 6072.72, 6013.51
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT     
root     pts/0    10.1.52.5        Sun09    1.00s  0.34s  0.27s w

The last time the reason was "sensors" running in several hundred instances, so this is probably not direct PVE related. Today I didn't check.

This box had bee running since mid of December without any issues. With 6.1.0-1-pve ... 6.1.6-1-pve and with some more VMs than today (in nearly idle state).
 
This helped for me (as wrote member Life1688 - Thank You very much :)

Update Intel-Microcode:
edit /etc/apt/sources.list add non-free and add this:

deb http://ftp.debian.org/debian bullseye main contrib non-free
deb http://ftp.debian.org/debian bullseye-updates main contrib non-free
deb http://security.debian.org bullseye-security main contrib non-free

apt update
apt install intel-microcode

After update remove the non-free from sources.list file
reboot
 
  • Like
Reactions: voldzi and LiFE1688
Had pfSense Plus 22.05 panic on me after 15 days running the 6.1.6-1-pve kernel. Though this time the panic is different...

Code:
Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer    = 0x20:0xffffffff80dd92a0
stack pointer            = 0x28:0xfffffe00257e9158
frame pointer            = 0x28:0xfffffe00257e9260
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = resume, IOPL = 0
current process        = 0 (if_io_tqg_0)
trap number        = 1
panic: privileged instruction fault
cpuid = 0
time = 1676886739
KDB: enter: panic

Code:
db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100022 td 0xfffff80005235740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00257e8f70
vpanic() at vpanic+0x194/frame 0xfffffe00257e8fc0
panic() at panic+0x43/frame 0xfffffe00257e9020
trap_fatal() at trap_fatal+0x38f/frame 0xfffffe00257e9080
calltrap() at calltrap+0x8/frame 0xfffffe00257e9080
--- trap 0x1, rip = 0xffffffff80dd92a0, rsp = 0xfffffe00257e9158, rbp = 0xfffffe00257e9260 ---
printf() at printf/frame 0xfffffe00257e9260
calltrap() at calltrap+0x8/frame 0xfffffe00257e9260
--- trap 0x1, rip = 0xffffffff80d93be1, rsp = 0xfffffe00257e9338, rbp = 0xfffffe00257e9350 ---
wakeup_any() at wakeup_any+0x1/frame 0xfffffe00257e9350
iflib_fast_intr_rxtx() at iflib_fast_intr_rxtx+0x88/frame 0xfffffe00257e93b0
intr_event_handle() at intr_event_handle+0x92/frame 0xfffffe00257e9400
intr_execute_handlers() at intr_execute_handlers+0x52/frame 0xfffffe00257e9430
Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00257e9430
--- interrupt, rip = 0xffffffff80e7e390, rsp = 0xfffffe00257e9500, rbp = 0xfffffe00257e9530 ---
hfsc_dequeue() at hfsc_dequeue+0x150/frame 0xfffffe00257e9530
tbr_dequeue() at tbr_dequeue+0xd1/frame 0xfffffe00257e9580
iflib_altq_if_start() at iflib_altq_if_start+0x9b/frame 0xfffffe00257e95b0
iflib_altq_if_transmit() at iflib_altq_if_transmit+0x103/frame 0xfffffe00257e95e0
ether_output_frame() at ether_output_frame+0xb4/frame 0xfffffe00257e9610
ether_output() at ether_output+0x60e/frame 0xfffffe00257e96a0
ip_output() at ip_output+0x1507/frame 0xfffffe00257e97f0
ip_forward() at ip_forward+0x3aa/frame 0xfffffe00257e98c0
ip_input() at ip_input+0x854/frame 0xfffffe00257e9970
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00257e99c0
ether_demux() at ether_demux+0x16a/frame 0xfffffe00257e99f0
ether_nh_input() at ether_nh_input+0x33b/frame 0xfffffe00257e9a50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00257e9aa0
ether_input() at ether_input+0x89/frame 0xfffffe00257e9b00
iflib_rxeof() at iflib_rxeof+0xaa6/frame 0xfffffe00257e9be0
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00257e9c20
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 0xfffffe00257e9c80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xd2/frame 0xfffffe00257e9cb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00257e9cf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00257e9cf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
 
Thanks @kubatko for your kind words.

If possible, may I request that you answer whether you did anything else other than installing the Intel-Microcode for the processor?

Did you upgrade your Kernel?
Did you limit / disable Processor CSTATE?
 
Similar to @kubatko 's experience, @LiFE1688 's suggestion to update the microcode seems to have helped my OPNsense VM not crash--going on 2 days strong now. Thanks for sharing your fix! Obviously I will update should my situation change.

Intel N6005 processor.
Only performed the microcode update, revision 0x24000023, date = 2022-02-19
No kernel upgrade.
No change to processor CSTATE from default.
 
  • Like
Reactions: LiFE1688
I have closely following this discussion from January, so let's throw in my two cents.
Beelink U59 N5105, 16gb ram, 512gb ssd, PVE 7.3-3
VMs:
-RouterOS x86 6.49.6
-Android 10 x86
-Two Debian 11 with raspap, USB and WIFI passthrough there
and dozen of LXC containers.
All stated above work flawlessly for 11 days after
-update microcode to 0x24000023 (intel-microcode_3.20221108.1_amd64.deb)
-update kernel to 5.19.17
No change of C-states or VM settings, only RAM ballooning is off.
So thanks a lot to @magingale for his post #348 and all other participants too. Let's see what would be happening further.
 
Two same nodes from Aliexpress https://www.aliexpress.com/item/100....order_list.order_list_main.76.21ef1802KYWU4a
Same behavior of both with virtual machine freezing randomly.
One vm with PfSense and eight with debian 11 (zabbix, graylog, sql, www ...).
Because the system was randomly freezing (100% processor) each machine only has 1 processor and enough RAM. I still have the kernel version Linux 5.15.85-1-pve #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z). After upgdare microcode, stable for the fourth day so far. I'll keep watching...
 
  • Like
Reactions: LiFE1688
So, yesterday, I decided that the testing I did was stable enough.

N6005 + Proxmox v7.3-6
Updated Intel Microcode (non-free)

So, I decided to change stuff up to see if the updated intel-microcode does indeed help.
Updated Proxmox, which updated the Kernel to 6.1.10
Updated OPNsense to 23.1.1

and, the VM froze after a few hours.

Found a file /etc/modprobe.d/intel-microcode-blacklist.conf
Code:
# The microcode module attempts to apply a microcode update when
# it autoloads.  This is not always safe, so we block it by default.

blacklist microcode

So, I deleted the file, and rebooted, and it is stable again.
 
So, yesterday, I decided that the testing I did was stable enough.

N6005 + Proxmox v7.3-6
Updated Intel Microcode (non-free)

So, I decided to change stuff up to see if the updated intel-microcode does indeed help.
Updated Proxmox, which updated the Kernel to 6.1.10
Updated OPNsense to 23.1.1

and, the VM froze after a few hours.

Found a file /etc/modprobe.d/intel-microcode-blacklist.conf
Code:
# The microcode module attempts to apply a microcode update when
# it autoloads.  This is not always safe, so we block it by default.

blacklist microcode

So, I deleted the file, and rebooted, and it is stable again.
Please report in 14-21 days if it is still stable without reboots. One day is hardly proof of stability.

Good find on the intel-microcode-blacklist.conf btw!
 
Last edited:
Of course, I will continue to monitor longer durations, previously it was running for more than 14 days.
I, however, would like to see it being able to withstand updates in the future, and in different scenarios as well (Different Kernel versions), hence updating proxmox, as well as opnsense, and the other containers and vm.

Currently, I have PCIe Passthrough for VM, GPU passthrough for LXC (Tested) and VM (Not Tested).

Pretty much whatever I can throw at it. So, 24 hrs of not crashing is quite a significant feat. Usually it crashes in an hour or two.

IMO, this VM crashing issue is probably a CPU Errata, and the microcode patches it, however, a long term solution would be to get the board manufacturers to release a BIOS update so that software microcode patch that has to be reapplied every boot is no longer required.

Please report in 14-21 days if it is still stable without reboots. One day is hardly proof of stability.

Good find on the intel-microcode-blacklist.conf btw!
 
Last edited:
After 5 days, I ended up getting a full freeze on my OPNsense VM in the middle of a work Zoom. My only change was the microcode--everything else (including the kernel) was stock. Looks like updating the microcode is not the only piece to the stability puzzle. Not sure where I should go next... maybe the edge kernel.
 
That was not good / nice.

The only reason why I use the opt-in Kernel 6.1 is for the iGPU support for GPU passthrough. It might have other fixes, but I am unsure.
Check /etc/modprobe.d/ to see if you have the intel-microcode-blacklist.conf file as well. Either comment out the blacklist or delete the file and see if that helps.

More people "testing" will provide more info, since everyone's usage and setup is different.
My setup is performing the same before I decided to mess with it, so I do give it much hope.

After 5 days, I ended up getting a full freeze on my OPNsense VM in the middle of a work Zoom. My only change was the microcode--everything else (including the kernel) was stock. Looks like updating the microcode is not the only piece to the stability puzzle. Not sure where I should go next... maybe the edge kernel.
 
So, yesterday, I decided that the testing I did was stable enough.

N6005 + Proxmox v7.3-6
Updated Intel Microcode (non-free)

So, I decided to change stuff up to see if the updated intel-microcode does indeed help.
Updated Proxmox, which updated the Kernel to 6.1.10
Updated OPNsense to 23.1.1

and, the VM froze after a few hours.

Found a file /etc/modprobe.d/intel-microcode-blacklist.conf
Code:
# The microcode module attempts to apply a microcode update when
# it autoloads.  This is not always safe, so we block it by default.

blacklist microcode

So, I deleted the file, and rebooted, and it is stable again.

Was the microcode loaded or not before?

What was the output of:
Code:
cat /proc/cpuinfo

This is my output without modifying the blacklist showing the latest microcode from the package:
Code:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 156
model name      : Intel(R) Celeron(R) N5105 @ 2.00GHz
stepping        : 0
microcode       : 0x24000023
cpu MHz         : 788.958
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 27
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms rdt_a rdseed smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req umip waitpkg gfni rdpid movdiri movdir64b md_clear flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data
bogomips        : 3993.60
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

There is newer microcode out for Jasper Lake but unfortunately it's not available for Debian.

https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases
 
Last edited:
The file appeared after I messed with it, it was not there before on the first try which lasted 14 days, it would probably continue working strong, but I decided it was good enough, and made some updates and changes.

I got everything I want to run on the box working, plus some, so I will probably go dark for a month unless it crashes before 25th next mth.
 
If you want to compare cat /proc/cpuinfo
Code:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 156
model name      : Intel(R) Pentium(R) Silver N6005 @ 2.00GHz
stepping        : 0
microcode       : 0x24000023
cpu MHz         : 3299.987
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 27
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms rdt_a rdseed smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req umip waitpkg gfni rdpid movdiri movdir64b md_clear flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs srbds mmio_stale_data
bogomips        : 3993.60
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!