VM freezes irregularly

tcamargo · Jan 4, 2023

magingale said:
Which microcode is loaded?

Code:

||/ Name            Version              Architecture Description
+++-===============-====================-============-===========================================
ii  intel-microcode 3.20220510.1~deb11u1 amd64        Processor microcode firmware for Intel CPUs

OPNsense VM Uptime: 10d and counting...

s0x · Jan 5, 2023

AdriftAtlas said:
So if you reenable C-States in the kernel then your VMs will crash? Have you tried it after all of your other changes?

So far I installed the intel-microcode package from the non-free Debian repository and will see what happens. Kernel is stock for now.

I really don't want to disable C-states as it will increase power usage and heat. This is what I'm getting at idle with no fans, enabled Enhanced C-States, enabled ASPM, and powersave governor:

Code:

coretemp-isa-0000 Adapter: ISA adapter Package id 0: +35.0°C (high = +105.0°C, crit = +105.0°C) Core 0: +31.0°C (high = +105.0°C, crit = +105.0°C) Core 1: +31.0°C (high = +105.0°C, crit = +105.0°C) Core 2: +31.0°C (high = +105.0°C, crit = +105.0°C) Core 3: +31.0°C (high = +105.0°C, crit = +105.0°C) acpitz-acpi-0 Adapter: ACPI interface temp1: +39.0°C (crit = +119.0°C) nvme-pci-0100 Adapter: PCI adapter Composite: +35.9°C (low = -273.1°C, high = +81.8°C) (crit = +84.8°C) Sensor 1: +35.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +36.9°C (low = -273.1°C, high = +65261.8°C)

Currently WFH, and have sold the other network equipment already, so this currently is in "production". Once things calm down yes, I have in my plans to revert some of the tweaks done to find what eventually made the system stable.

FYI, I had complete host freeze and not only VM crash. At the beginning when configuring the system/testing (during July/2022), I had VM only crash (OPNsense) and a fix I found was using the CPU profile Westmere (+aes flag). With that the VM stopped crashing but the host freeze kept on occurring.

dobber · Jan 5, 2023

s0x said:
Hi

On my end, still on kernel 5.19.17-1-pve, with 32 days uptime, two VMs, OPNsense ( 3 (1 sockets, 3 cores) [host,flags=-pcid;-spec-ctrl;-ssbd;+aes] [cpuunits=2048] ) with VirtIO NICs and HomeAssistant, and two LXC containers (PiHole and TP-Link Omada Controller, based on Ubuntu 22.04).

root@pve:~# last reboot | head -n 1
reboot system boot 5.19.17-1-pve Sat Nov 26 20:14 still running
root@pve:~# uptime
11:48:25 up 32 days, 15:33, 1 user, load average: 0.39, 0.32, 0.29
root@pve:~# uname -a
Linux pve 5.19.17-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-1 (Mon, 14 Nov 2022 20:25:12 x86_64 GNU/Linux

Host is a Topton N5105 (CW-6000) with i225 B3 NICs, BIOS date 29/09/2022, 2x8GB RAM, 1x NVMe SSD WD SN530. Extra Noctua 40mm fan 12v (NF-A4x10 PWM) as exhaust is inaudible (as intake the noise would be noticeable).

But I've applied several options to the kernel cmdline, see below.

Kernel cmdline options:

intel_idle.max_cstate=1 (disable C-states below 1 (such as C3))
intel_iommu=on iommu=pt (Enable iommu, since at the begining I was going to use passthrough NICs to the OPNsense VM, but ended up using Virtio NICs, while testing for the crashes, and kept them)
mitigations=off (Self explanatory)
i915.enable_guc=2 ( Enable low-power H264 encoding, https://01.org/linuxgraphics/downloads/firmware , https://jellyfin.org/docs/general/administration/hardware-acceleration/#intel-gen9-and-gen11-igpus )
initcall_blacklist=sysfb_init ( GPU passthrough , https://wiki.tozo.info/books/server/page/proxmox-gpu-passthrough )
nvme_core.default_ps_max_latency_us=14900 ( https://esc.sh/blog/nvme-ssd-and-linux-freezing/ )

Also, due to i2c-6 NAK errors ( [Sat Nov 26 20:14:37 2022] i2c i2c-6: sendbytes: NAK bailout. ) related to the iGPU I've connected a dummy HDMI dongle after confirming that with a monitor plugged in the errors stoppped and so did system crashes, but by then I've had already applied other kernel parameters.

Didn't test if those were related to the enabling of i915 GuC/HuC or not.

And due to errors related to the NVMe SSD (WD SN530 M.2 2242) I've applied the nvme_core.default_ps_max_latency_us parameter as well.

[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: device [15b7:5008] error status/mask=00000001/0000e000
[Tue Nov 29 11:46:52 2022] nvme 0000:01:00.0: [ 0] RxErr

- edit -
Also updated the intel microcode:

root@pve:~# dmesg -T | grep microcode
[Sat Nov 26 20:14:32 2022] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19
[Sat Nov 26 20:14:32 2022] SRBDS: Vulnerable: No microcode
[Sat Nov 26 20:14:33 2022] microcode: sig=0x906c0, pf=0x1, revision=0x24000023
[Sat Nov 26 20:14:33 2022] microcode: Microcode Update Driver: v2.2.

- edit -

Hopefully someone else finds this information helpful.

I followed this guide and my vm:s have been stable for about a week now with kernel 5.19.17-1. Today I experimented with removing intel_idle.max_cstate=1 and had a crash within 3 hours. I strongly believe cstates is the issue, at least on my setup.
I'm running a N6005. domoticz and pfsense as vm:s.
Thank you s0x for the guide. Let's hope cstates are useable again in the near future with a newer kernel and/or microcode.

R1CH · Jan 5, 2023

It's too early to say for sure, but I'm also running 6.1.0-1-pve with intel_idle.max_cstate=1 and no crashes in almost a week. Unfortunately power consumption is up by about 2 watts and approx. +5c CPU temps.

I purchased directly from Changwang so will ask them about this too and see if they have any insight.

gregg098 · Jan 5, 2023

R1CH said:
It's too early to say for sure, but I'm also running 6.1.0-1-pve with intel_idle.max_cstate=1 and no crashes in almost a week. Unfortunately power consumption is up by about 2 watts and approx. +5c CPU temps.

I purchased directly from Changwang so will ask them about this too and see if they have any insight.

If it works, why not just leave it alone? 2 watts is only a few $/year

R1CH · Jan 5, 2023

Oh I absolutely plan to leave it like this, I just hope it gets fixed eventually since a platform that's unstable out-of-the-box is not great for anyone.

AdriftAtlas · Jan 6, 2023

So updating microcode did not fix it, pfSense hung after 3 days. Now trying 5.19 kernel...

My family is not happy that pfSense is crashing. I am seriously considering buying another SSD and rebuilding everything from scratch with ESXi 8. Apparently the I-226V driver is now bundled by default. Other people have reported that ESXi 7 is stable on these machines.

As much as I like the open source nature of Proxmox; I've used vCenter at work for years and other than a few bungled updates last year it has been rock solid.

QEMU/KVM and/or the Linux kernel seems to have compatibility issues with Jasper Lake CPUs. This is not an issue specific to Proxmox as the Unraid community is reporting the same.

mrjmg · Jan 6, 2023

Hi all, I wanted to chip in here as I found this thread very interesting.
My old 4th gen I5 (the laptop serving as my home server, running Ubuntu 22.04 bare metal) died a month or so ago. Until then running all fine, no freezes or whatsoever (unless I was trying to disable the nvidia dgpu).

I bought myself an odroid H3, featuring the J5105. First I simply added the disk, and not so simply finally booted up. So far so good, however I'd get random freezes of the VM (my box actually powered down altogether) - I suspected the power supply of being not so good quality. This happened in waves, so maybe 2-3 times during a couple of days, then a week or so all good.

I installed proxmox yesterday, dd'd the old disk to my VM and am now running using the old data & OS install in the VM. It has happened twice already where the VM just froze, while proxmox itself was functioning well.
More info: I am not running pfsense, it happened when I was watching a movie on Plex, and another time in the middle of the night, so roughly 10 hours in between. I do run a docker stack containing transmission-openvpn, plex, home assistant, z2m, etc. Quite a stack however not intensive, always ran great on my 8GB 4th gen I5.

tldr: to me it seems not related to Proxmox but to Ubuntu & J5105.

DougDimmadome · Jan 6, 2023

Just wanted to add on to this thread - I am having issues CPU soft/hard lockups with Arch Linux using an Intel NUC11ATKC4 on Linux Kernel 5.15.83. It seems out of 20 computers running, 1 of them will lock up per week.

Currently trying the 6.1.3 and intel_idle.max_cstate=1 boot parameter to see if the lock ups will disappear.

Just wanted to share as I believe it definitely is a Kernel issue as I do not have Docker or any VM software running on these NUCs, just Linux and my own software.

schmuessla · Jan 6, 2023

1. It is not limited to Proxmox
2. It has nothing to do with I225/I226
3. It has nothing to do with power supply

I started running into this issue about 3 months ago when I was running a debian bullseye VM on a Intel NUC11ATKPE with debian bullseye Host. The NUC doesn't have a Intel NIC at all and Proxmox wasn't installed at all.

After a couple of days I migrated the services from the VM to the host and the problem was gone.

Two weeks ago I received my cwwk mini itx board with 4 I226. I installed pve with 5.19 there and a VM again and the problem came back. Random freezes, not so frequently but there. Disabling advanced C states also didn't help anything.

magingale · Jan 6, 2023

schmuessla said:
Two weeks ago I received my cwwk mini itx board with 4 I226. I installed pve with 5.19 there and a VM again and the problem came back. Random freezes, not so frequently but there. Disabling advanced C states also didn't help anything.

Which CPU?

schmuessla · Jan 6, 2023

@magingale n6005

Pramde · Jan 7, 2023

HSIPC with N5105 and Jasper Lake here...

Ran fine for approx 5 days with OpnSense and a Debian DockerHost VM, on new year the crashes started...

Right now I am running stable 4 days now with:

Kernel 5.19
New Microcode
Governor on Powersave

Additionaly I changed the VMs CPU Type to KVM, but since it seems to be a problem on the host system I doubt this is necessary...

AdriftAtlas · Jan 7, 2023

So the 5.19 kernel resulted in a crash of pfSense in roughly 20 hours instead of 3 days. Removed it and went back to 5.15. Same exact back trace. I wonder if there is a way to turn off C states on freeBSD/pfSense while leaving them on in Proxmox?

Code:

db:0:kdb.enter.default>  bt
Tracing pid 11 tid 100004 td 0xfffff80005208740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe0000d33500
vpanic() at vpanic+0x194/frame 0xfffffe0000d33550
panic() at panic+0x43/frame 0xfffffe0000d335b0
trap_fatal() at trap_fatal+0x38f/frame 0xfffffe0000d33610
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0000d33670
calltrap() at calltrap+0x8/frame 0xfffffe0000d33670
--- trap 0xc, rip = 0xffffffff81358a70, rsp = 0xfffffe0000d33740, rbp = 0xfffffe0025787b70 ---
Xprot_pti() at Xprot_pti+0x90/frame 0xfffffe0025787b70
acpi_cpu_idle() at acpi_cpu_idle+0x2ef/frame 0xfffffe0025787bb0
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0025787bd0
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0025787bf0
sched_idletd() at sched_idletd+0x326/frame 0xfffffe0025787cb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0025787cf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0025787cf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

some1one · Jan 7, 2023

Where do you get these backtraces from?
Running on a odriod h3 with a n5105 and also a frozen vm. in my case it's a Debian/dietpi vm. RHEL and home assistant seem to work.
Kernel 5.19.
Installed microcode yesterday and it seems to make it worse.

magingale · Jan 7, 2023

I can't find a way where it's stable for a week or more. Used kernel 5.19 and 6.1, used old and new microcode and with and without c-states.

Never crashing VMs:
-homeassistant VM (with USB device passthrough)
-Windows 10 VM
-Ubuntu 22.04.01 jumpserver
-DSMR power meting readings (with USB device passthrough)

Crashing VM's
-Ubuntu 22.04.01 with docker CE, almost every day with a complete crashed Ubuntu
-pfSense 22.05 every 4 days

I moved all the VM's away to my older NUC i5-6260U which was running this VM's for months without a glitch. I'm thinking about moving to a native pfSense installation. Thinking to do that on a separate NVME disk so I can move back to the current disk if this topic finds a solution.

But the charme of this little box was that it could run all the VM's I wanted ;( I'm clueless

AdriftAtlas · Jan 7, 2023

some1one said:
Where do you get these backtraces from?
Running on a odriod h3 with a n5105 and also a frozen vm. in my case it's a Debian/dietpi vm. RHEL and home assistant seem to work.
Kernel 5.19.
Installed microcode yesterday and it seems to make it worse.

When pfSense based on freeBSD has a kernel panic it creates a crash dump that is downloadable from its web interface.

Pramde · Jan 7, 2023

magingale said:
I can't find a way where it's stable for a week or more. Used kernel 5.19 and 6.1, used old and new microcode and with and without c-states.

[ ... ]

But the charme of this little box was that it could run all the VM's I wanted ;( I'm clueless

What is making me curious and scared alike is the question what all the crashing VMs have in common, because I'm on the same boat and wanted to consolidate all VMs on one host...
I've seen a lot of Kernel Traps in the *Sense VMs now, but they don't really seem to be "the same", which makes it harder for me to understand what the underlying problem could be.

Plus: My Docker Host never crashed, did not even have a hickup. Temperatures where fine even before the governour switch (Was ~45 now is ~36)

I'm not sure If I rather should return my box or hope there will be a fix, but in case it's a silicon problem... yeah well f*** me.

Can anybody confirm this does happen on Protectli Boxes with Jasper Lake too?

some1one · Jan 7, 2023

What are you using as docker host?

AdriftAtlas · Jan 7, 2023

magingale said:
I can't find a way where it's stable for a week or more. Used kernel 5.19 and 6.1, used old and new microcode and with and without c-states.

Never crashing VMs:
-homeassistant VM (with USB device passthrough)
-Windows 10 VM
-Ubuntu 22.04.01 jumpserver
-DSMR power meting readings (with USB device passthrough)

Crashing VM's
-Ubuntu 22.04.01 with docker CE, almost every day with a complete crashed Ubuntu
-pfSense 22.05 every 4 days

I moved all the VM's away to my older NUC i5-6260U which was running this VM's for months without a glitch. I'm thinking about moving to a native pfSense installation. Thinking to do that on a separate NVME disk so I can move back to the current disk if this topic finds a solution.

But the charme of this little box was that it could run all the VM's I wanted ;( I'm clueless

You have two Ubuntu VMs of the same version except one is stable and one is not? Do they by any chance have different power management settings? Are they running with the same virtual CPU and flags? How are they different? Is the one that's stable always under constant load?

I have a suspicion that the VM guests attempt to idle the CPU in a way that it doesn't support when virtualized. The two backtraces from my pfSense both mention idling the CPU.

VM freezes irregularly

New Member

New Member

Member

New Member

Well-Known Member

New Member

Member

New Member

New Member

New Member

Member

New Member

New Member

Member

Member

Member

Member

New Member

Member

Member

We value your privacy