VM freezes irregularly

Pramde · Jan 7, 2023

some1one said:
What are you using as docker host?

Debian in a VM

some1one · Jan 7, 2023

Interesting. that is the one which freezes.
what i also noticed: it is the only vm which does not shutdown, when i hit the shutdown button in the UI!
Qemu guest tools are installed.

userunix · Jan 7, 2023

Hi everyone, I bought your same box a few days ago and I have the same freezes. Can you help me? My current kernel is 5.15.74-1-pve

some1one · Jan 7, 2023

userunix said:
Hi everyone, I bought your same box a few days ago and I have the same freezes. Can you help me? My current kernel is 5.15.74-1-pve

Read this thread carefully.
Which box do you have and which CPU?
What VMs freeze?

enricoross98 · Jan 7, 2023

My Topton unit (N5105) seems stable for now (at least for 5 days). I've recently updated Kernel to 5.19 and microcode to 221108, but those things didn't help. In my configuration now I've set:
- GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on intel_idle.max_cstate=1 mitigations=off"
- Updated CPU type and flags for OPNsense VM [kvm64,flags=-pcid;-spec-ctrl;-ssbd;-ibpb;-virt-ssbd;-amd-ssbd;-amd-no-ssb;+aes] (as seen in a post)
- Memory ballooning off for all VMs

I've maked those changes in the OPNsense VM to increase reliability, but that wasn't the VM that frozes more often: my Ubuntu Server VM with Docker installed with more than 10 containers running froze once a day. That said, I'm sure that the ting that makes this thing more reliable is the intel_idle.max_cstate=1 flag. Now I'm let it run for another 5 days approximately and than I'm gonna try to remove the flag for C-state to make sure.

userunix · Jan 7, 2023

some1one said:
Read this thread carefully.
Which box do you have and which CPU?
What VMs freeze?

I have a topton with 5105 celeron. I have just one VM with pfSense and it crashed 2 times in 2 days with kernel panic message. I just upgraded Proxmox with pve-kernel-6.1, is it right? Do i need to upgrade bios too? To be precise pfSense crashes and restarts itself while Proxmox seems to have no problems.
This is the message i can read on pfSense crash report:

spin lock 0xffffffff836e0f00 (smp rendezvous) held by 0xfffff80006c77740 (tid 100645) too long
panic: spin lock held too long
cpuid = 1
time = 1673039279
KDB: enter: panic

If you need i can attach also logs.

gregg098 · Jan 7, 2023

Things seem to be running well since upgrading to the 6.1 kernel.

Have a CWWK N5105 running Proxmox (6.1 kernel) and an OPNsense VM. Currently at over 2 weeks uptime. No microcode loaded and BIOS settings are basically default (didn't change C states or anything like that). OPNsense is using Linux bridges, no passthrough. PowerD disabled in OPNsense.

Out of the 4 NICs, I use them for:
1) WAN - To Cable Modem
2) Proxmox Management
3&4) LAGG to my main switch - LACP Layer 2+3

Really just bringing this up since my current setup really doesnt have any tweaks or changes other than the 6.1 kernel. Wonder if some of those having major issues should try a more vanilla setup? Guess the hard part here is that the different mini-PCs have different BIOS's it seems which may cause some of the issues.

EDIT: I should mention that the "downtime" two weeks ago was really just running a few updates and not a crash.


Linux pxcw 6.1.0-1-pve

Code:

root@pxcw:~# cat /proc/cpuinfo | grep micro
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d
microcode       : 0x1d

AdriftAtlas · Jan 8, 2023

gregg098 said:
Things seem to be running well since upgrading to the 6.1 kernel.

Have a CWWK N5105 running Proxmox (6.1 kernel) and an OPNsense VM. Currently at over 2 weeks uptime. No microcode loaded and BIOS settings are basically default (didn't change C states or anything like that). OPNsense is using Linux bridges, no passthrough. PowerD disabled in OPNsense.

Out of the 4 NICs, I use them for:
1) WAN - To Cable Modem
2) Proxmox Management
3&4) LAGG to my main switch - LACP Layer 2+3

Really just bringing this up since my current setup really doesnt have any tweaks or changes other than the 6.1 kernel. Wonder if some of those having major issues should try a more vanilla setup? Guess the hard part here is that the different mini-PCs have different BIOS's it seems which may cause some of the issues.

EDIT: I should mention that the "downtime" two weeks ago was really just running a few updates and not a crash.

Linux pxcw 6.1.0-1-pve

Code:

root@pxcw:~# cat /proc/cpuinfo | grep micro microcode : 0x1d microcode : 0x1d microcode : 0x1d microcode : 0x1d

View attachment 45348

Are C-States and Enhanced C-States enabled in the BIOS?
Is ASPM set to Auto for all PCIe ports in the BIOS?
What CPU governor are you using in proxmox?
What do your thermals look like in proxmox?

Install thermal sensor package on proxmox:
apt install lm-sensors
run it:
watch sensors

Check CPU governor:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set CPU governor until next reboot to powersave:
echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Set it back to performance:
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set it automatically at reboot:

crontab -e
@reboot echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

watch CPU frequency, it should go down to 800Mhz or so if in powersave:
watch "lscpu | grep MHz"

rRobbie · Jan 8, 2023

gregg098 said:
Things seem to be running well since upgrading to the 6.1 kernel.

Have a CWWK N5105 running Proxmox (6.1 kernel) and an OPNsense VM. Currently at over 2 weeks uptime. No microcode loaded and BIOS settings are basically default (didn't change C states or anything like that). OPNsense is using Linux bridges, no passthrough. PowerD disabled in OPNsense.

Out of the 4 NICs, I use them for:
1) WAN - To Cable Modem
2) Proxmox Management
3&4) LAGG to my main switch - LACP Layer 2+3

Really just bringing this up since my current setup really doesnt have any tweaks or changes other than the 6.1 kernel. Wonder if some of those having major issues should try a more vanilla setup? Guess the hard part here is that the different mini-PCs have different BIOS's it seems which may cause some of the issues.

EDIT: I should mention that the "downtime" two weeks ago was really just running a few updates and not a crash.

Linux pxcw 6.1.0-1-pve

Code:

root@pxcw:~# cat /proc/cpuinfo | grep micro microcode : 0x1d microcode : 0x1d microcode : 0x1d microcode : 0x1d

View attachment 45348

This would be great news, having the same processor so far have been able to mitigate it but randomly - like every 3 weeks - still a vm freezes.

How do you install the kernel 6.1? I was on edge (fabian) kernel, but for now 6.1 is not available yet.

Thanks

gregg098 · Jan 8, 2023

A few answers below. Dont have access to the BIOS at the moment.

AdriftAtlas said:
Are C-States and Enhanced C-States enabled in the BIOS?

Dont know - dont expect to touch the BIOS anytime soon, but Ill look next time

AdriftAtlas said:
Is ASPM set to Auto for all PCIe ports in the BIOS? Dont know

What CPU governor are you using in proxmox?

What do your thermals look like in proxmox? Good - see below

Install thermal sensor package on proxmox:
apt install lm-sensors
run it:
watch sensors

Code:

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +33.0°C  (high = +105.0°C, crit = +105.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +40.0°C  (crit = +119.0°C)

nvme-pci-0100
Adapter: PCI adapter
Composite:    +32.9°C  (low  =  -0.1°C, high = +69.8°C)
                       (crit = +84.8°C)
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Sensor 1:     +43.9°C  (low  =  +0.0°C, high =  +0.0°C)

AdriftAtlas said:
Check CPU governor:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Code:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

AdriftAtlas said:
Set CPU governor until next reboot to powersave:
echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Set it back to performance:
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set it automatically at reboot:
crontab -e @reboot echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

watch CPU frequency, it should go down to 800Mhz or so if in powersave:
watch "lscpu | grep MHz"

magingale · Jan 8, 2023

AdriftAtlas said:
You have two Ubuntu VMs of the same version except one is stable and one is not? Do they by any chance have different power management settings? Are they running with the same virtual CPU and flags? How are they different? Is the one that's stable always under constant load?

I have a suspicion that the VM guests attempt to idle the CPU in a way that it doesn't support when virtualized. The two backtraces from my pfSense both mention idling the CPU.

Interesting the VM which is crashing does have a higher load then the one that does not crash and is less loaded with tasks.

I have the CPU set to "host" don't know how to check the flags?

gregg098 · Jan 8, 2023

rRobbie said:
This would be great news, having the same processor so far have been able to mitigate it but randomly - like every 3 weeks - still a vm freezes.

How do you install the kernel 6.1? I was on edge (fabian) kernel, but for now 6.1 is not available yet.

Thanks

From
https://forum.proxmox.com/threads/opt-in-linux-6-1-kernel-for-proxmox-ve-7-x-available.119483/

How to install:

apt update
apt install pve-kernel-6.1
reboot

rRobbie · Jan 8, 2023

gregg098 said:
From
https://forum.proxmox.com/threads/opt-in-linux-6-1-kernel-for-proxmox-ve-7-x-available.119483/

How to install:

apt update

apt install pve-kernel-6.1

reboot

Thanks! I totally missed that post

some1one · Jan 8, 2023

Kernel 6.1 made my HomeAssistant VM unstable and it rebooted after a couple of hours. I am now back at 5.19.

AdriftAtlas · Jan 9, 2023

magingale said:
Interesting the VM which is crashing does have a higher load then the one that does not crash and is less loaded with tasks.

I have the CPU set to "host" don't know how to check the flags?

It's definitely some sort of power management bug in the kernel and/or CPU.

If you have it set to host then it will pass through all of the flags of the host CPU. The reason I asked is because some people have said if you set it to kvm64 and disable all mitigation flags that it helps. That tends to drastically decrease performance and increase load; which is maybe why it helps.

AdriftAtlas · Jan 9, 2023

gregg098 said:

A few answers below. Dont have access to the BIOS at the moment.

Dont know - dont expect to touch the BIOS anytime soon, but Ill look next time

Code:

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 0:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:        +33.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:        +33.0°C  (high = +105.0°C, crit = +105.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +40.0°C  (crit = +119.0°C)

nvme-pci-0100
Adapter: PCI adapter
Composite:    +32.9°C  (low  =  -0.1°C, high = +69.8°C)
                       (crit = +84.8°C)
ERROR: Can't get value of subfeature temp2_min: I/O error
ERROR: Can't get value of subfeature temp2_max: I/O error
Sensor 1:     +43.9°C  (low  =  +0.0°C, high =  +0.0°C)

Code:

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

Those temperatures look good which suggests that you may have C-States enabled. Without C-States mine will idle at 40C or so, with C-States I get an idle of 31-35C.

s0x · Jan 10, 2023

With C1 state limit via kernel argument, I have the below temperatures, the higher ones were while running a speedtest.

root@pve:~# while(true); do cat /proc/cpuinfo | grep 'cpu MHz'; sensors | grep Core; echo; sleep 5; done
cpu MHz : 2544.287
cpu MHz : 2650.960
cpu MHz : 2464.522
cpu MHz : 2668.844
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2000.000
cpu MHz : 991.122
cpu MHz : 1770.288
cpu MHz : 1885.248
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +36.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 1441.799
cpu MHz : 2000.000
cpu MHz : 2069.034
cpu MHz : 2000.000
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +34.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2397.635
cpu MHz : 2446.423
cpu MHz : 2403.496
cpu MHz : 2554.709
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2546.931
cpu MHz : 2799.992
cpu MHz : 2799.991
cpu MHz : 2799.985
Core 0: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +43.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2791.067
cpu MHz : 2790.872
cpu MHz : 2790.299
cpu MHz : 2793.320
Core 0: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +43.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2784.890
cpu MHz : 2789.184
cpu MHz : 2785.929
cpu MHz : 2774.717
Core 0: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +43.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +44.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +44.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2800.000
cpu MHz : 2800.000
cpu MHz : 2800.000
cpu MHz : 2800.000
Core 0: +45.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +45.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +45.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +45.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2614.696
cpu MHz : 2720.019
cpu MHz : 2564.087
cpu MHz : 2558.963
Core 0: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +38.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +38.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2743.360
cpu MHz : 2673.087
cpu MHz : 2269.640
cpu MHz : 2387.826
Core 0: +39.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +39.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +39.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +39.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 1247.746
cpu MHz : 635.949
cpu MHz : 1831.650
cpu MHz : 2169.053
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +36.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2589.830
cpu MHz : 2520.912
cpu MHz : 2577.229
cpu MHz : 2491.158
Core 0: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +36.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +36.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 550.108
cpu MHz : 1085.714
cpu MHz : 2000.000
cpu MHz : 1537.398
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2654.562
cpu MHz : 2476.600
cpu MHz : 2167.342
cpu MHz : 2573.204
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

cpu MHz : 2000.000
cpu MHz : 2429.084
cpu MHz : 2000.000
cpu MHz : 1228.404
Core 0: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 2: +35.0°C (high = +105.0°C, crit = +105.0°C)
Core 3: +35.0°C (high = +105.0°C, crit = +105.0°C)

^C

Edoardo396 · Jan 10, 2023

Same issue here, had VMs crashing almost daily on a fresh Proxmox installation and they had been running for almost a month after updating to kernel 5.19 and microcode.

This night a VM crashed again (always the same one, Debian 11 with Docker)... I'll be trying the C-State fix.

Intel NUC with N5105 CPU.

Pramde · Jan 10, 2023

Edoardo396 said:
Same issue here, had VMs crashing almost daily on a fresh Proxmox installation and they had been running for almost a month after updating to kernel 5.19 and microcode.

This night a VM crashed again (always the same one, Debian 11 with Docker)... I'll be trying the C-State fix.

Intel NUC with N5105 CPU.

Did that crash occur only once? Or did your time between failure go back to 8-24 hours after the first crash?

It somewhat scares me, that even the "reference design" does have this error... Otherwise I would have thought it's a bios error in cheap third party producers, but that way it feels like being narrowed down to the CPU or the software implementation of QEMU cause bare metal seems to work flawless doesn't it?

mrjmg · Jan 10, 2023

Pramde said:
bare metal seems to work flawless doesn't it?

It doesn't in my case, posted my experiences a couple of days ago in this thread. And there are others - it seems down to the processor (at least N5105/N6005) and a combination with the 5.1x kernel and c-states. I ran into issues running both bare metal (couple of times a week) and later same disk in vm under proxmox (daily).

Since updating pve to 6.1 kernel 2 days ago it has been running fine (governor set to powersave both before and after the update).
The only thing I didn't thouroughly check are how c states are set up in my bios, as I need to disconnect and move my pve box to a screen.

VM freezes irregularly

New Member

Member

New Member

Member

New Member

New Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Member

Member

New Member

Member

New Member

New Member

We value your privacy