VM freezes irregularly

Interesting. that is the one which freezes.
what i also noticed: it is the only vm which does not shutdown, when i hit the shutdown button in the UI!
Qemu guest tools are installed.
 
Hi everyone, I bought your same box a few days ago and I have the same freezes. Can you help me? My current kernel is 5.15.74-1-pve
 
My Topton unit (N5105) seems stable for now (at least for 5 days). I've recently updated Kernel to 5.19 and microcode to 221108, but those things didn't help. In my configuration now I've set:
- GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on intel_idle.max_cstate=1 mitigations=off"
- Updated CPU type and flags for OPNsense VM [kvm64,flags=-pcid;-spec-ctrl;-ssbd;-ibpb;-virt-ssbd;-amd-ssbd;-amd-no-ssb;+aes] (as seen in a post)
- Memory ballooning off for all VMs

I've maked those changes in the OPNsense VM to increase reliability, but that wasn't the VM that frozes more often: my Ubuntu Server VM with Docker installed with more than 10 containers running froze once a day. That said, I'm sure that the ting that makes this thing more reliable is the intel_idle.max_cstate=1 flag. Now I'm let it run for another 5 days approximately and than I'm gonna try to remove the flag for C-state to make sure.
 
Last edited:
  • Like
Reactions: s0x
Read this thread carefully.
Which box do you have and which CPU?
What VMs freeze?
I have a topton with 5105 celeron. I have just one VM with pfSense and it crashed 2 times in 2 days with kernel panic message. I just upgraded Proxmox with pve-kernel-6.1, is it right? Do i need to upgrade bios too? To be precise pfSense crashes and restarts itself while Proxmox seems to have no problems.
This is the message i can read on pfSense crash report:
spin lock 0xffffffff836e0f00 (smp rendezvous) held by 0xfffff80006c77740 (tid 100645) too long
panic: spin lock held too long
cpuid = 1
time = 1673039279
KDB: enter: panic
If you need i can attach also logs.
 
Last edited:
Things seem to be running well since upgrading to the 6.1 kernel.

Have a CWWK N5105 running Proxmox (6.1 kernel) and an OPNsense VM. Currently at over 2 weeks uptime. No microcode loaded and BIOS settings are basically default (didn't change C states or anything like that). OPNsense is using Linux bridges, no passthrough. PowerD disabled in OPNsense.

Out of the 4 NICs, I use them for:
1) WAN - To Cable Modem
2) Proxmox Management
3&4) LAGG to my main switch - LACP Layer 2+3

Really just bringing this up since my current setup really doesnt have any tweaks or changes other than the 6.1 kernel. Wonder if some of those having major issues should try a more vanilla setup? Guess the hard part here is that the different mini-PCs have different BIOS's it seems which may cause some of the issues.

EDIT: I should mention that the "downtime" two weeks ago was really just running a few updates and not a crash.

Linux pxcw 6.1.0-1-pve

Code:
root@pxcw:~# cat /proc/cpuinfo | grep micro
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d


1673124340089.png
 
Last edited:
  • Like
Reactions: rRobbie
Things seem to be running well since upgrading to the 6.1 kernel.

Have a CWWK N5105 running Proxmox (6.1 kernel) and an OPNsense VM. Currently at over 2 weeks uptime. No microcode loaded and BIOS settings are basically default (didn't change C states or anything like that). OPNsense is using Linux bridges, no passthrough. PowerD disabled in OPNsense.

Out of the 4 NICs, I use them for:
1) WAN - To Cable Modem
2) Proxmox Management
3&4) LAGG to my main switch - LACP Layer 2+3

Really just bringing this up since my current setup really doesnt have any tweaks or changes other than the 6.1 kernel. Wonder if some of those having major issues should try a more vanilla setup? Guess the hard part here is that the different mini-PCs have different BIOS's it seems which may cause some of the issues.

EDIT: I should mention that the "downtime" two weeks ago was really just running a few updates and not a crash.

Linux pxcw 6.1.0-1-pve

Code:
root@pxcw:~# cat /proc/cpuinfo | grep micro
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d


View attachment 45348

  • Are C-States and Enhanced C-States enabled in the BIOS?
  • Is ASPM set to Auto for all PCIe ports in the BIOS?
  • What CPU governor are you using in proxmox?
  • What do your thermals look like in proxmox?

Install thermal sensor package on proxmox:
apt install lm-sensors
run it:
watch sensors

Check CPU governor:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set CPU governor until next reboot to powersave:
echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Set it back to performance:
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set it automatically at reboot:
crontab -e @reboot echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

watch CPU frequency, it should go down to 800Mhz or so if in powersave:
watch "lscpu | grep MHz"
 
  • Like
Reactions: necrosis
Things seem to be running well since upgrading to the 6.1 kernel.

Have a CWWK N5105 running Proxmox (6.1 kernel) and an OPNsense VM. Currently at over 2 weeks uptime. No microcode loaded and BIOS settings are basically default (didn't change C states or anything like that). OPNsense is using Linux bridges, no passthrough. PowerD disabled in OPNsense.

Out of the 4 NICs, I use them for:
1) WAN - To Cable Modem
2) Proxmox Management
3&4) LAGG to my main switch - LACP Layer 2+3

Really just bringing this up since my current setup really doesnt have any tweaks or changes other than the 6.1 kernel. Wonder if some of those having major issues should try a more vanilla setup? Guess the hard part here is that the different mini-PCs have different BIOS's it seems which may cause some of the issues.

EDIT: I should mention that the "downtime" two weeks ago was really just running a few updates and not a crash.

Linux pxcw 6.1.0-1-pve

Code:
root@pxcw:~# cat /proc/cpuinfo | grep micro
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d


View attachment 45348

This would be great news, having the same processor so far have been able to mitigate it but randomly - like every 3 weeks - still a vm freezes.

How do you install the kernel 6.1? I was on edge (fabian) kernel, but for now 6.1 is not available yet.

Thanks
 
A few answers below. Dont have access to the BIOS at the moment.

  • Are C-States and Enhanced C-States enabled in the BIOS?

Dont know - dont expect to touch the BIOS anytime soon, but Ill look next time
  • Is ASPM set to Auto for all PCIe ports in the BIOS? Dont know
  • What CPU governor are you using in proxmox?
  • What do your thermals look like in proxmox? Good - see below

Install thermal sensor package on proxmox:
apt install lm-sensors
run it:
watch sensors
Code:
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +33.0°C  (high = +105.0°C, crit = +105.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +40.0°C  (crit = +119.0°C)

nvme-pci-0100
Adapter: PCI adapter
Composite:    +32.9°C  (low  =  -0.1°C, high = +69.8°C)
                       (crit = +84.8°C)
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Sensor 1:     +43.9°C  (low  =  +0.0°C, high =  +0.0°C)

Check CPU governor:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Code:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance


Set CPU governor until next reboot to powersave:
echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Set it back to performance:
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set it automatically at reboot:
crontab -e @reboot echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

watch CPU frequency, it should go down to 800Mhz or so if in powersave:
watch "lscpu | grep MHz"
 
  • Like
Reactions: AdriftAtlas
You have two Ubuntu VMs of the same version except one is stable and one is not? Do they by any chance have different power management settings? Are they running with the same virtual CPU and flags? How are they different? Is the one that's stable always under constant load?

I have a suspicion that the VM guests attempt to idle the CPU in a way that it doesn't support when virtualized. The two backtraces from my pfSense both mention idling the CPU.

Interesting the VM which is crashing does have a higher load then the one that does not crash and is less loaded with tasks.

I have the CPU set to "host" don't know how to check the flags?
 
  • Like
Reactions: AdriftAtlas
  • Like
Reactions: rRobbie
Kernel 6.1 made my HomeAssistant VM unstable and it rebooted after a couple of hours. I am now back at 5.19.
 
Interesting the VM which is crashing does have a higher load then the one that does not crash and is less loaded with tasks.

I have the CPU set to "host" don't know how to check the flags?

It's definitely some sort of power management bug in the kernel and/or CPU.

If you have it set to host then it will pass through all of the flags of the host CPU. The reason I asked is because some people have said if you set it to kvm64 and disable all mitigation flags that it helps. That tends to drastically decrease performance and increase load; which is maybe why it helps.
 
  • Like
Reactions: magingale
A few answers below. Dont have access to the BIOS at the moment.



Dont know - dont expect to touch the BIOS anytime soon, but Ill look next time

Code:
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +33.0°C  (high = +105.0°C, crit = +105.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +40.0°C  (crit = +119.0°C)

nvme-pci-0100
Adapter: PCI adapter
Composite:    +32.9°C  (low  =  -0.1°C, high = +69.8°C)
                       (crit = +84.8°C)
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Sensor 1:     +43.9°C  (low  =  +0.0°C, high =  +0.0°C)



Code:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

Those temperatures look good which suggests that you may have C-States enabled. Without C-States mine will idle at 40C or so, with C-States I get an idle of 31-35C.
 
With C1 state limit via kernel argument, I have the below temperatures, the higher ones were while running a speedtest.

root@pve:~# while(true); do cat /proc/cpuinfo | grep 'cpu MHz'; sensors | grep Core; echo; sleep 5; done
cpu MHz : 2544.287
cpu MHz : 2650.960
cpu MHz : 2464.522
cpu MHz : 2668.844
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2000.000
cpu MHz : 991.122
cpu MHz : 1770.288
cpu MHz : 1885.248
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +36.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 1441.799
cpu MHz : 2000.000
cpu MHz : 2069.034
cpu MHz : 2000.000
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +34.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2397.635
cpu MHz : 2446.423
cpu MHz : 2403.496
cpu MHz : 2554.709
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2546.931
cpu MHz : 2799.992
cpu MHz : 2799.991
cpu MHz : 2799.985
Core 0: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +43.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2791.067
cpu MHz : 2790.872
cpu MHz : 2790.299
cpu MHz : 2793.320
Core 0: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +43.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2784.890
cpu MHz : 2789.184
cpu MHz : 2785.929
cpu MHz : 2774.717
Core 0: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +44.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +44.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2800.000
cpu MHz : 2800.000
cpu MHz : 2800.000
cpu MHz : 2800.000
Core 0: +45.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +45.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +45.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +45.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2614.696
cpu MHz : 2720.019
cpu MHz : 2564.087
cpu MHz : 2558.963
Core 0: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +38.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2743.360
cpu MHz : 2673.087
cpu MHz : 2269.640
cpu MHz : 2387.826
Core 0: +39.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +39.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +39.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +39.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 1247.746
cpu MHz : 635.949
cpu MHz : 1831.650
cpu MHz : 2169.053
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +36.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2589.830
cpu MHz : 2520.912
cpu MHz : 2577.229
cpu MHz : 2491.158
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +36.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 550.108
cpu MHz : 1085.714
cpu MHz : 2000.000
cpu MHz : 1537.398
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2654.562
cpu MHz : 2476.600
cpu MHz : 2167.342
cpu MHz : 2573.204
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2000.000
cpu MHz : 2429.084
cpu MHz : 2000.000
cpu MHz : 1228.404
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

^C
 
Same issue here, had VMs crashing almost daily on a fresh Proxmox installation and they had been running for almost a month after updating to kernel 5.19 and microcode.

This night a VM crashed again (always the same one, Debian 11 with Docker)... I'll be trying the C-State fix.

Intel NUC with N5105 CPU.
 
Same issue here, had VMs crashing almost daily on a fresh Proxmox installation and they had been running for almost a month after updating to kernel 5.19 and microcode.

This night a VM crashed again (always the same one, Debian 11 with Docker)... I'll be trying the C-State fix.

Intel NUC with N5105 CPU.
Did that crash occur only once? Or did your time between failure go back to 8-24 hours after the first crash?

It somewhat scares me, that even the "reference design" does have this error... Otherwise I would have thought it's a bios error in cheap third party producers, but that way it feels like being narrowed down to the CPU or the software implementation of QEMU cause bare metal seems to work flawless doesn't it?
 
bare metal seems to work flawless doesn't it?
It doesn't in my case, posted my experiences a couple of days ago in this thread. And there are others - it seems down to the processor (at least N5105/N6005) and a combination with the 5.1x kernel and c-states. I ran into issues running both bare metal (couple of times a week) and later same disk in vm under proxmox (daily).

Since updating pve to 6.1 kernel 2 days ago it has been running fine (governor set to powersave both before and after the update).
The only thing I didn't thouroughly check are how c states are set up in my bios, as I need to disconnect and move my pve box to a screen.
 
Last edited:
  • Like
Reactions: dobber

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!