VM freezes irregularly

Just for record.

Hardware:
mini PC with N5105, 4 x 2.5Gb I-226 Ethernet from Aliexpress.

Software:
pve-kernel-5.15.60-2-pve: OPNsense VM hangs every 1-2 days (KVM internal error).
pve-kernel-5.19.7-2-pve: OPNsense VM run over a week now without problem. Note that I also install enable microcode update. Not sure if that also affected the stability.

2 NICs were passed-through.
 
Last edited:
An update to my testing... so far the following did NOT help on my N6005:

Updated microcode (3.20220207.1~bpo11+1)
Machine type Q35
Disabling memory ballooning

Now trying 5.19 kernel.
 
Last edited:
Upgrade to kernel 5.19.16-edge. QEMU VM lasted 3.5 days before another kernel panic. This is way better than a few hours on 5.15 kernel.
 
my 10 cents:
1) N5105 is unstable on Hyper-V too :D
It is just much more stable than on Linux. So far my node01 fell just once on Hyper-V with exactly the same symptoms - 100% CPU load after kernel panic.

2) disabling turbo mode and limiting CPU frequency in BIOS to 1500MHz did not help

4) enabling CPU emulation as it is suggested here (https://forum.opnsense.org/index.php?topic=30230.15) didn't help though I had unbelievable uptime about 20 hours on Linux with "host-model" CPU.
kvm64 CPU died in an hour.

5) kernel 5.19.7 on Ubuntu 22.04 does not change anything in terms of stability for me either.
 
Just an update.

I'm still running 5.19.7-1-pve kernel with updated microcode for n5105.

Never had any issues with LXC machines.

Qemu-KVM machines:
Mikrotik RouterOS v6.48.6: 26 days uptime. No issues.
Mikrotik RouterOS v7.6rc1: 19 days uptime. No issues.
Antix Linux, kernel 5.10.57: 19 days uptime. No issues.

Seems the problems are solved for me at least. RouterOS v7 would crash within days and Antix within hours on Proxmox 5.15 kernel.
 
Last edited:
  • Like
Reactions: local-host
Just an update.

I'm still running 5.19.7-1-pve kernel with updated microcode for n5105.

Never had any issues with LXC machines.

Qemu-KVM machines:
Mikrotik RouterOS v6.48.6: 26 days uptime. No issues.
Mikrotik RouterOS v7.6rc1: 19 days uptime. No issues.
Antix Linux, kernel 5.10.57: 19 days uptime. No issues.

Seems the problems are solved for me at least. RouterOS v7 would crash within days and Antix within hours on Proxmox 5.15 kernel.
How "busy" are your KVM machines? ie, do they handle internet traffic as a router/gateway (so constant processing) or not as much?
 
How "busy" are your KVM machines? ie, do they handle internet traffic as a router/gateway (so constant processing) or not as much?
Mikrotik RouterOS machines is handling a 500/500 fibre connection.
The Antix machine is only doing light stuff, but nonetheless it used to crash/lockup before I did the kernel and microcode update.
 
No luck with 5.19.7-2-pve for me, VM crash / reboot after ~ 3 days. I did notice the bullseye-backports microcode package did not actually have the latest microcode for the N6005. I installed it from unstable and I am now at 0x24000023, date = 2022-02-19. The previous revision was 0x2400001f, date = 2021-08-09. I really hope this helps as I am running out of ideas :(.
 
No luck with 5.19.7-2-pve for me, VM crash / reboot after ~ 3 days. I did notice the bullseye-backports microcode package did not actually have the latest microcode for the N6005. I installed it from unstable and I am now at 0x24000023, date = 2022-02-19. The previous revision was 0x2400001f, date = 2021-08-09. I really hope this helps as I am running out of ideas :(.
You can try use ESXi 8.0, It's stable for me.
 
n5105.
5.15.64-1: pbs vm(with same latest kernel) and debian vm(cloud kernel 5.10.0-19) up for 7 days, still ok.
these 2 vms up for 5 days with previous 5.15.60-2, until I upgrade kernel and reboot host.
with 5.15.60-1 above VMs will hang up in 1 day.
 
I'm now on that point, that Proxmox seems to be the issue here... unfortunately I think that the proxmox dev's are not investigating in this issues
At least for me, VMs on ESXi are even more unstable than Proxmox VMs. On Proxmox my OPNSense VM is rocksolid and only my RouterOS VM keeps crashing regularly, on ESXi both VMs were able to achieve less than a day of uptime.

So far I was able to somewhat stabilize the situation on Proxmox by disabling C-States (which isn't a longterm solution due to electricity cost ...) and using PVE Kernel 5.19.7-1. Also disabled watchdog in RouterOS itself, but not sure if this did anything.
It's nowhere near perfect but my RouterOS VM managed to achieve 14 days of uptime without crashing, then crashed again. At least no more daily crashes it seems.
 
I've been pulling my hair out for the past 2 months with my pfSense VM either hard crashing or temporarily losing access to the network card (the latter self corrects in a couple minutes). Also have a second windows VM that has crashed; just not frequently.

I just found this discussion.

I flattened and rebuilt the pfSense install this morning. I re-installed running it as UEFI bios, q35 machine and bridge network (previously was seabios and default machine; switched to see if it mattered). PFSense is set-up very generic almost entirely running off the set-up wizard and I just crashed again after 4 hours. My box is an HUNSN RJ03 with the N5105 CPU with 32g ram (now 16 as I pulled one module as part of debugging)

After crashing the CPU typically pegs to 100%. It looks like it might of went through a reboot after crashing and then had a kernel panic. This is the end of the logs:

kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xffffffff8135a848
fault code = supervisor write data, protection violation
instruction pointer = 0x20:0xffffffff80da98bd
stack pointer = 0x28:0xfffffe0025782980
frame pointer = 0x28:0xfffffe00257829e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (idle: cpu0)
trap number = 12
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x1
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff8135a840
stack pointer = 0x28:0xfffffe0000c83c00
frame pointer = 0x28:0xfffffe00005fa4f0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 12 (irq260: virtio_pci1)
trap number = 12
timeout stopping cpus
panic: page fault
cpuid = 0
time = 1667430428
KDB: enter: panic

Is this most likely the instability everyone is talking about? I am fully updated with the latest non-subscription install.

And is there a way to write a script that detects when the pfSense VM crashes, kills the VM and restarts it?

Thanks!
 
I've been pulling my hair out for the past 2 months with my pfSense VM either hard crashing or temporarily losing access to the network card (the latter self corrects in a couple minutes). Also have a second windows VM that has crashed; just not frequently.

I just found this discussion.

I flattened and rebuilt the pfSense install this morning. I re-installed running it as UEFI bios, q35 machine and bridge network (previously was seabios and default machine; switched to see if it mattered). PFSense is set-up very generic almost entirely running off the set-up wizard and I just crashed again after 4 hours. My box is an HUNSN RJ03 with the N5105 CPU with 32g ram (now 16 as I pulled one module as part of debugging)

After crashing the CPU typically pegs to 100%. It looks like it might of went through a reboot after crashing and then had a kernel panic. This is the end of the logs:

kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0xffffffff8135a848
fault code = supervisor write data, protection violation
instruction pointer = 0x20:0xffffffff80da98bd
stack pointer = 0x28:0xfffffe0025782980
frame pointer = 0x28:0xfffffe00257829e0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (idle: cpu0)
trap number = 12
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x1
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff8135a840
stack pointer = 0x28:0xfffffe0000c83c00
frame pointer = 0x28:0xfffffe00005fa4f0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 12 (irq260: virtio_pci1)
trap number = 12
timeout stopping cpus
panic: page fault
cpuid = 0
time = 1667430428
KDB: enter: panic

Is this most likely the instability everyone is talking about? I am fully updated with the latest non-subscription install.

And is there a way to write a script that detects when the pfSense VM crashes, kills the VM and restarts it?

Thanks!
What kernel are you running on the PVE host? Since upgrading to the 5.19.x kernel, my VMs (Ubuntu and pfSense) with uptimes of 30+ days which have only been marred by a power outage.
 
What kernel are you running on the PVE host? Since upgrading to the 5.19.x kernel, my VMs (Ubuntu and pfSense) with uptimes of 30+ days which have only been marred by a power outage.
Here are my versions:

proxmox-ve: 7.2-1 (running kernel: 5.15.64-1-pve) pve-manager: 7.2-11 (running version: 7.2-11/b76d3178) pve-kernel-5.15: 7.2-13 pve-kernel-helper: 7.2-13 pve-kernel-5.15.64-1-pve: 5.15.64-1 pve-kernel-5.15.60-2-pve: 5.15.60-2 pve-kernel-5.15.30-2-pve: 5.15.30-3 ceph-fuse: 15.2.16-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-3 libpve-guest-common-perl: 4.1-4 libpve-http-server-perl: 4.1-4 libpve-storage-perl: 7.2-10 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.7-1 proxmox-backup-file-restore: 2.2.7-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-3 pve-docs: 7.2-2 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-6 pve-firmware: 3.5-6 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 7.0.0-4 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-4 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.6-pve1

How do I upgrade to 5.19? I have been doing all the updates via the GUI (which are currently all up to date)
 
I thought 5.15.64-1 is stable enough (pbs vm is up for 13 days with some backup jobs), so create a new debian vm, while pulling docker images, kernel panic got me.o_O

1667888768771.png
 
Hey guys, my first post here, I have been suffering the same issue, my machine has CPU(s) 4 x Intel(R) Celeron(R) N5095A @ 2.00GHz (1 Socket) Kernel Version Linux 5.15.64-1-pve #1 SMP PVE 5.15.64-1 (Thu, 13 Oct 2022 10:30:34 +0200) PVE Manager Version pve-manager/7.2-11/b76d3178, similar issue as you reported, after some days of experimenting this issue I found this post https://forum.proxmox.com/threads/hardware-watchdog-at-a-per-vm-level.104051/ for installing a watchdog, so if someone is experimenting this issue at least you can set this and it would self-reboot.
This is the screen shoot of what had happened to my VM, I have seen other errors like segmentation fault at kernel level.
Any additional suggestion to avoid the problem?
Thanks
EDIT: apart from changing the kernel to https://github.com/fabianishere/pve-edge-kernel, does this improve the situation?

1667953643375.png
 
Last edited:
i tried PVE Edge 6.0.6-1 with latest firmware/microcode .It's not better . VM with docker crash in a day..But pfsense VM has not crash yet...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!