VM freezes irregularly

gyrex · Sep 21, 2022

I've got an uptime of 24 days on my Ubuntu VM running docker and the same with my pfSense/FreeBSD VM. I haven't switched over to the official Proxmox 5.19 edge kernel yet.

root@pve:~# uname -a
Linux pve 5.19.4-edge #1 SMP PREEMPT_DYNAMIC PVE Edge 5.19.4-1 (2022-08-25) x86_64 GNU/Linux

john@docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
john@docker:~$ uptime
00:48:29 up 24 days, 4:34, 1 user, load average: 0.18, 0.27, 0.12

pfSense:

[2.6.0-RELEASE][root@router.local]/root: uptime
12:53AM up 24 days, 4:16, 2 users, load averages: 0.71, 0.45, 0.33

gseeley · Sep 21, 2022

Just another data point for this thread.

I have a Supermicro X10SDV-4C-TLN2F with Xeon CPU D-1521 running the latest Proxmox and one Windows 10 VM that I use for ~8 hours a day doing Remote Desktop, so nothing taxing. It does have a Quadro P620 GPU PCIe passed through in addition to an onboard USB bus for the keyboard and mouse, USB sound device.

On kernel 5.13.x I had no issues. When Proxmox updated to 5.15.x I had random hangs of the VM, sometimes multiple times in a day and I was never able to get a full 8 hours without a hang. Pinned 5.13.x and it was stable again. Found this thread, switched to 5.19 (Proxmox opt in) and it's been stable for several days now.

Marsupilani · Sep 22, 2022

With the 5.15.x VE kernels I have the same problems with freezing VMs under debian 11.5; rocky 8/9 and fedora 36 and even LXC Containers - all used as docker hosts and freezes ever 1-3 days.
A Windows 10 21H1 VM on the same hosts has no problems.
The VEs are on 2 NUC11 with N5105, 1TB nvme and 24/32GB RAM.
At time of freeze the is no high load on the VMs an no high memory usage in the VMs. I had opened top and docker stats via SSH on VM and at time of freeze I don't see any overload. I have free memory and buffers, nearly unused swap and a lot of idle cpu time. The iowaits are about 2-5%, the load between 2-3 on a 3 core VM. The logs of the VM so not have any errors around the time of freeze. They simply stop working and after reset I see the normal boot messages and repair processes for file system.
I already limited the docker CPU and Memory usage to n*CPU-1 and 66% RAM but the problems still the same.
The VE hosts are still usable, but the VMs can only be resetted. A reboot command to the VM times out. The freeze happened to different VMs on the same host at different times. I didn't find any hints in the pve logs...
So for me it looks like a problem with qemu/kvm and/or kernel of the VE.
The NUCs do not have high load and run in unheated environment an the crashes are mostly at night (~10°C). So the temperature should not be a problem...Today I updated to the VE to 5.19 - I will hope this solves the problems...

Marsupilani · Sep 23, 2022

This does not solve the problem for me. The VM freezes again - no overprovision on the host . The host is still usable...it seems proxmox VE is not usable for VM linux clients at the moment on N5105 CPUs even with 5.19er kernel ... again.

rRobbie · Sep 23, 2022

rRobbie said:
Brief update on my NUC11ATKC4, 3 days running the edge kernel with no issue. I now switched to the pve-kernel-5.19.

View attachment 41162

I am testing under mild load (it's only a NUC) Fedora and Ubuntu VMs.

stress-ng --cpu 0 -l 50 --io 2 --vm 2 --vm-bytes 20% -t 0

Update in a few days.

Brief update on my NUC11ATKC4 after 7 days running the 5.19.7-1-pve kernel under mild load, no issues to report, no VMs freezing.

I now stopped the testing environment and I will be migrating some "production" (well, it's a home-lab) VMs.

ToniCipriani · Sep 26, 2022

Just hit 10 days with no freezes on mine.

Leghk · Sep 27, 2022

Still running into stability issues on a N5105:

Edge kernel:
Linux proxmox-4-fw1 5.19.8-edge #1 SMP PREEMPT_DYNAMIC PVE Edge 5.19.8-1 (2022-09-08) x86_64 GNU/Linux

Latest microcode:
[ 0.000000] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19

Linux VM made it about 7 days, FreeBSD VM made it about 3.5.

Did somebody say 5.19.7-edge was stable? Should I jump to that and see if that helps?

If it helps my freebsd VM is pfsense, and is supposed to be my firewall. However I have both NICs going to that VM link_down=1 in proxmox. So there would literally be no traffic flowing in/out. Although technically I'm sure it's sending some packets out into the ether. I didn't catch why the pfsense rebooted, the logs don't include anything useful and /var/crash is empty.

catcraft · Sep 27, 2022

Hi guys,

I also want to jump in here. My first post - hello everyone.

My Toptonbox with N6005/4xi226-V Nics ( "Model A" https://a.aliexpress.com/_EIFAnSJ ) "was" also affected on random freezes on my opnsense vm, although Proxmox was running without issues.

I tried the 5.19.7-1-pve kernel and updated the microcode to revision=0x2400001f - so far no more freezes. It is running about 3 Days now.. i tried everything to crash opensense but without luck

VM Settings:

Raindeer · Sep 29, 2022

Hi Proxmox community,

I also experience these random freezes multiple times per day on new Dell Optiplex 3000 (12th Gen Intel(R) Core(TM) i5-12500T). I have tried different kernels and BIOS settings but it doesn't help. I also updated newest BIOS (1.5.2) and disabled power states, WLAN, Bluetooth etc. but still Proxmox crashes with or without VM's running on it.
It was stable on june and july, then I was travelling on august and when I came back I did update/upgrade and it started to crash.
When it freeze / crash, it doesn't even ping, I have to do hard reset.

Version
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.35-3-pve)

root@prox:~# dmesg | grep microcode
[ 1.066361] microcode: sig=0x90675, pf=0x1, revision=0x1e
[ 1.066628] microcode: Microcode Update Driver: v2.2.

Kernels: (5.15 and 5.19 freezes same way)
pve-kernel-5.15.30-2-pve/stable,now 5.15.30-3 amd64 [installed]
pve-kernel-5.15.35-3-pve/stable,now 5.15.35-6 amd64 [installed,auto-removable]
pve-kernel-5.15.39-1-pve/stable,now 5.15.39-1 amd64 [installed,auto-removable]
pve-kernel-5.15.53-1-pve/stable,now 5.15.53-1 amd64 [installed,auto-removable]
pve-kernel-5.15.60-1-pve/stable,now 5.15.60-1 amd64 [installed,automatic]
pve-kernel-5.15/stable,now 7.2-11 all [installed]
pve-kernel-5.19.7-1-pve/stable,now 5.19.7-1 amd64 [installed,automatic]
pve-kernel-5.19/stable,now 7.2-11 all [installed]
pve-kernel-helper/stable,now 7.2-12 all [installed]

lspci:
00:00.0 Host bridge: Intel Corporation Device 4650 (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 4690 (rev 0c)
00:04.0 Signal processing controller: Intel Corporation Device 461d (rev 05)
00:08.0 System peripheral: Intel Corporation Device 464f (rev 05)
00:14.0 USB controller: Intel Corporation Device 7ae0 (rev 11)
00:14.2 RAM memory: Intel Corporation Device 7aa7 (rev 11)
00:16.0 Communication controller: Intel Corporation Device 7ae8 (rev 11)
00:17.0 SATA controller: Intel Corporation Device 7ae2 (rev 11)
00:1a.0 PCI bridge: Intel Corporation Device 7ac8 (rev 11)
00:1c.0 PCI bridge: Intel Corporation Device 7aba (rev 11)
00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
00:1f.3 Audio device: Intel Corporation Device 7ad0 (rev 11)
00:1f.4 SMBus: Intel Corporation Device 7aa3 (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 7aa4 (rev 11)
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/980PRO
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 1b)

I also get screen flickering with error on login console: [drm] *ERROR* CPU pipe A FIFO underrun: transcoder

Interesting things from logs (not sure if they are related)
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
hpet_acpi_add: no address or irqs in _CRS
secureboot: Secure boot could not be determined (mode 0)
ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
pnp 00:04: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
ep 28 11:31:01 prox kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
Sep 28 11:31:01 prox kernel: device-mapper: uevent: version 1.0.3
Sep 28 11:31:01 prox kernel: device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised: dm-devel@redhat.com
Sep 28 11:31:01 prox kernel: platform eisa.0: Probing EISA bus 0
Sep 28 11:31:01 prox kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Sep 28 11:31:01 prox kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Sep 28 11:31:01 prox kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: wmi_bus wmi_bus-PNP0C14:02: WQBC data block query control method not found
Sep 28 11:31:01 prox kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: version 3.0
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: AHCI 0001.0301 32 slots 4 ports 6 Gbps 0x50 impl SATA mode
Sep 28 11:31:01 prox kernel: ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part ems deso sadm sds
Sep 28 11:31:01 prox kernel: r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
Sep 28 11:31:01 prox kernel: spl: loading out-of-tree module taints kernel.
Sep 28 11:31:01 prox kernel: znvpair: module license 'CDDL' taints kernel.
Sep 28 11:31:01 prox kernel: Disabling lock debugging due to kernel taint
Sep 28 11:31:02 prox kernel: cfg80211: Loading compiled-in X.509 certificates for regulatory database
Sep 28 11:31:02 prox kernel: cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
Sep 28 11:31:02 prox kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Sep 28 11:31:02 prox kernel: cfg80211: failed to load regulatory.db
Sep 28 11:31:02 prox kernel: Creating 1 MTD partitions on "0000:00:1f.5":
Sep 28 11:31:02 prox kernel: 0x000000000000-0x000003000000 : "BIOS"
Sep 28 11:31:02 prox kernel: mtd: partition "BIOS" extends beyond the end of device "0000:00:1f.5" -- size truncated to 0x1000000
Sep 28 11:31:02 prox kernel: bluetooth hci0: Direct firmware load for mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin failed with error -2
Sep 28 11:31:02 prox kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
ep 28 11:31:02 prox kernel: i915 0000:00:02.0: GuC firmware i915/tgl_guc_70.1.1.bin: fetch failed with error -2
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware(s) can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC firmware i915/tgl_guc_70.1.1.bin version 0.0
Sep 28 11:31:02 prox kernel: i915 0000:00:02.0: [drm] GuC is uninitialized
Sep 28 11:31:02 prox kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1ba
ep 28 11:31:04 prox kernel: kauditd_printk_skb: 4 callbacks suppressed
Sep 28 11:31:04 prox kernel: audit: type=1400 audit(1664382664.161:15): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/lxc-start" pid=978 comm="apparmor_parser"

Let me know If there is more information I could provide to help solve this problem? BTW Is it possible to try older 5.13 kernel with proxmox 7.2? if, how?

thanks

fabian · Sep 29, 2022

that sounds like a different issue, this thread is about VMs freezing on specific CPUs.. please open a new one!

McKajVah · Oct 1, 2022

I had problems with Ubuntu VM on a N5105. Changed to a Slitaz Linux distro and no longer any problems. Uptime of 1 month. I'm also running a Mikrotik RouterOS as a VM. No problems with that one either.

ToniCipriani · Oct 3, 2022

fabian said:
that sounds like a different issue, this thread is about VMs freezing on specific CPUs.. please open a new one!

Maybe a mod can edit the title of the OP? Title doesn't seem to indicate that we're only talking about Jasper Lake chips.

gyrex · Oct 5, 2022

ToniCipriani said:
Maybe a mod can edit the title of the OP? Title doesn't seem to indicate that we're only talking about Jasper Lake chips.

Agreed.

Also, we had a power outage overnight so my uptime was reset but my uptime prior to the reset was 37 days. I'm now on the official 5.15 kernel - we'll see how that goes.

oJ15K9y · Oct 5, 2022

McKajVah said:
I'm also running a Mikrotik RouterOS as a VM. No problems with that one either.

Which Version of CHR are you using inside your VM and what host kernel?
I'm having the exact opposite of your experience, my RouterOS CHR VM (Version 7.5) is constantly crashing after about 16-48 hours of uptime.
Already tried running Proxmox Edge Kernel 5.19.12 (which had the same instability as 5.15.x) and 5.19.7 (better stability, but still crashing at least once every two days).

Raindeer · Oct 5, 2022

I'm not sure it this help you on this issue, but I finally found solution for my random crashes on new Dell Optiplex hardware (with Intel CPU).
My proxmox 7.2 crashed multiple times per day with 5.15 and 5.19 kernels. It didn't do it on 5.13.

When I changed several BIOS settings I noticed that when I disable C-State control (Feature enables the CPU's ability to enter and exit low power states) under performance settings it stopped crashing on 5.19, propably also with 5.15.

TDW0kxvche9 · Oct 7, 2022

Leghk said:
Been lurking on this thread for some time, running into the same issues as other folks on both freeBSD and Linux guests. I'm on a N5105 with 4 x 2.5GbE I225-V from HUNSN. Thanks to all who proceeded me to get us this far!

I upgraded to:
Linux xxxx 5.19.8-edge #1 SMP PREEMPT_DYNAMIC PVE Edge 5.19.8-1 (2022-09-08) x86_64 GNU/Linux

And was stable for about 3 days (a record!) running a FreeBSD VM (pfSense), then got the dreaded freebsd kernel panic:
Sep 15 14:21:40 kernel Fatal trap 12: page fault while in kernel mode
Sep 15 14:21:40 kernel cpuid = 0; apic id = 00
Sep 15 14:21:40 kernel fault virtual address = 0x20
Sep 15 14:21:40 kernel fault code = supervisor write data, page not present
Sep 15 14:21:40 kernel instruction pointer = 0x20:0xffffffff80bac0e3

But I was not running the latest microcode at the time. I just upgraded to:
[ 0.000000] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19

So for anybody joining this thread late, the edge kernel alone doesn't appear to fix it.

I'll see if the microcode + edge kernel gets me the stability everybody else has been experiencing, been 12 hours so far.

FWIW, I had been trying to track this issue down before I found this thread. The box seemed quite stable before I joined it to a cluster. However perhaps that was just unrelated... But it made it a week just running pfSense in a VM. However since then, I rarely could make it >24 hours w/o a crash. However after finding this thread I'm going to stop chasing that red herring...

Hello Leghk,

i've probably the same appliance from HUNSN (N5105, 4x 2,5Gbe I225) and the same problems with VM-Crashs of OPNsense, also with the Edge-Kernels.
Could you give me a advice, how you updated the microcode?

Thanks in advance,
Bastian

catcraft · Oct 7, 2022

TDW0kxvche9 said:
Hello Leghk,

i've probably the same appliance from HUNSN (N5105, 4x 2,5Gbe I225) and the same problems with VM-Crashs of OPNsense, also with the Edge-Kernels.
Could you give me a advice, how you updated the microcode?

Thanks in advance,
Bastian

https://forum.proxmox.com/threads/what-is-correct-way-to-install-intel-microcode.75664/post-337060

dmesg | grep -i microcode

Latest Microcode Information:
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/blob/main/releasenote.md

Regards,
Damian

fgerardi · Oct 8, 2022

catcraft said:
https://forum.proxmox.com/threads/what-is-correct-way-to-install-intel-microcode.75664/post-337060

dmesg | grep -i microcode

Latest Microcode Information:
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/blob/main/releasenote.md

Regards,
Damian

the intel-microcode package is included in the official non-free repository. So it is not mandatory to enable backports. Also, by enabling backports you may potentially bring other unsolicited (and untested) upgraded packages to a system that is not designed to work as a standard debian system. So, at the very least, one should verify if a more recent version of intel-microcode is present in backports: if that is the case then you can enable backports, install the intel-microcode and finally disable it. At the moment the non-free standard repository contains the latest version of the debian package.

Just to report my current situation, I got a stable system for two weeks and counting with this combination:

- I opted for the official 5.19.7-1-pve kernel
- installed intel microcode from debian non-free repository
- memory ballooning disabled for opnsense vm where I have pci passthrough enabled for the nics. Please note that I have another vm with memory balloning enabled (but no pci passthrough)

Best regards,
Fabrizio

Neobin · Oct 8, 2022

catcraft said:
https://forum.proxmox.com/threads/what-is-correct-way-to-install-intel-microcode.75664/post-337060

dmesg | grep -i microcode

Latest Microcode Information:
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/blob/main/releasenote.md

Regards,
Damian

Be aware that the linked post refers to Debian Buster and therefore PVE 6. PVE 7 is Debian Bullseye.
General info about the microcode package on Debian can be found here: [1]

fgerardi said:
Also, by enabling backports you may potentially bring other unsolicited (and untested) upgraded packages to a system that is not designed to work as a standard debian system.

Per default the backports repositories have only a priority of 100 [2], while the stable repositories have a priority of 500. So unless you specify otherwise, there should not be packages pulled in from the backports repositories.

[1] https://wiki.debian.org/Microcode
[2] https://backports.debian.org/Instructions/#index2h2

R1CH · Oct 8, 2022

I think I'm hitting this issue as well, again with a Chinese mini PC (ChangWang N6005) with my own RAM and SSD. Latest BIOS from the vendor is installed. I also tried the microcode update from backports earlier this week which doesn't seem to have helped. Running kernel PVE 5.15.60-1. My guest VM is OpenWrt and one crash so far looks like this:

Code:

[22919.227360] WARNING: stack recursion on stack type 5                                                                                                                                                                                [662/684]
[22919.227387] int3: 0000 [#1] SMP NOPTI                                                                                                                                                                                               [661/684]
[22919.227387] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.138 #0                                                                                                                                                                   [660/684]
[22919.227387] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014                                                                                                     [659/684]
[22919.227388] RIP: 0010:0xffffffff81a2d901                                                                                                                                                                                            [658/684]
[22919.227388] Code: e8 84 36 00 00 48 8b 45 d0 65 48 2b 04 25 28 00 00 00 0f 85 c3 01 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc <cc> fb 66 0f 1f 44 00 00 e9 56 fe ff ff 48 8d b5 68 ff ff ff 4c 89                 [657/684]
[22919.227389] RSP: 0018:fffffe0000009a30 EFLAGS: 00000092                                                                                                                                                                             [656/684]
[22919.227390] RAX: 0000000000000002 RBX: ffffffff82413580 RCX: fffffe0000009950                                                                                                                                                       [655/684]
[22919.227390] RDX: 0000000000000008 RSI: 0000000000000000 RDI: fffffe0000009ab8                                                                                                                                                       [654/684]
[22919.227391] RBP: 00000081c01060b9 R08: 0000000000000002 R09: 0000000000000001                                                                                                                                                       [653/684]
[22919.227391] R10: 0000000000000000 R11: 0000000000000000 R12: fffffe0000009ab8                                                                                                                                                       [652/684]
[22919.227391] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000                                                                                                                                                       [651/684]
[22919.227392] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000                                                                                                                                            [650/684]
[22919.227392] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                                                                                       [649/684]
[22919.227393] CR2: 0000000000000008 CR3: 0000000122706000 CR4: 0000000000350ef0                                                                                                                                                       [648/684]
[22919.227393] Call Trace:

I'll give some of the suggestions in this thread a try and see how it goes.

VM freezes irregularly

Member

Active Member

Member

Member

Member

Member

New Member

New Member

New Member

Proxmox Staff Member

Renowned Member

Member

Member

Member

New Member

Member

New Member

Member

Distinguished Member

New Member

We value your privacy