VM freezes irregularly

After 11 days and 12 hours pfSense hung with one core stuck at 100%. The settings have helped but there is likely a deeper issue with KVM that gets tripped eventually.

Just installed the latest 6.1.2 kernel tonight in hopes of not crashing at all:
Linux 6.1.2-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.2-1 (2023-01-10T00:00Z)

I'm noticing that the CPU spends more time around 800Mhz now than 2000Mhz with the powersave governor on 6.1. Which might be a good sign or very bad sign considering the VM guest panics seem to be CPU power management related. CPU thermals are about the same though.

OPnsense at 16 Days now...

Kernel 5.19
CPU Type for VM is KVM
Microcode updated
Governour: Powersave

Did not touch C-States or anything else... Thermals dropped by 10 degrees C with powersave activated.

I don't dare to change CPU type back to host nor upgrade Kernel to 6.1 although I can't really think of a reason 6.1 should not work anymore..
 
OPnsense at 16 Days now...

Kernel 5.19
CPU Type for VM is KVM
Microcode updated
Governour: Powersave

Did not touch C-States or anything else... Thermals dropped by 10 degrees C with powersave activated.

I don't dare to change CPU type back to host nor upgrade Kernel to 6.1 although I can't really think of a reason 6.1 should not work anymore..

Did you add or remove any CPU flags for kvm64? Like disabling mitigations and enabling AES-NI? What BIOS are you emulating?

This is beyond frustrating. My pfSense VM was running for 11 days and then hung, not even a panic. Omada controller running as Ubuntu 20 LXC didn't even flinch.

I wonder if it makes a difference that OPNsense is running FreeBSD 13.1-RELEASE and pfSense is running FreeBSD 12.3-STABLE.
 
Did you add or remove any CPU flags for kvm64? Like disabling mitigations and enabling AES-NI? What BIOS are you emulating?

This is beyond frustrating. My pfSense VM was running for 11 days and then hung, not even a panic. Omada controller running as Ubuntu 20 LXC didn't even flinch.

I wonder if it makes a difference that OPNsense is running FreeBSD 13.1-RELEASE and pfSense is running FreeBSD 12.3-STABLE.

CPU: kvm64 / 1 Socket 2 Cores / +aes

Bios: SeaBios

Machine: i440fx

NIC: Passthrough 2 Interfaces




My Debian Docker Host never froze, but I still hesitadet do put it in production, might be a reason... so I will take it in semi-production for further testing tomorrow or something...
Debian Host is configurated with CPU Host Passthrough and SeaBios


Edit: The idling around 800mhz is similar to my config.
 
Last edited:
Code:
[1899419.241761] invalid opcode: 0000 [#1] SMP NOPTI
[1899419.243050] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.146 #0
[1899419.244437] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[1899419.246558] RIP: 0010:0xffffffff81a30c10
[1899419.247678] Code: 4c 89 e2 48 83 ef 08 48 c7 c6 10 7f 05 81 65 ff 05 89 af 5e 7e e8 90 04 1d 00 65 ff 0d 7d af 5e 7e eb af cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 49 89 fd 41 54 49 89 f4 53 48 83 ec
[1899419.251296] RSP: 0018:ffffc900000f0c40 EFLAGS: 00010097
[1899419.252663] RAX: 0000000081c01060 RBX: 0000000000000000 RCX: ffffffff81c01060
[1899419.254267] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc900000f0c48
[1899419.255873] RBP: ffffc900000f0c49 R08: 0000000000000000 R09: 0000000000000000
[1899419.257465] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[1899419.259028] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[1899419.260574] FS:  0000000000000000(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
[1899419.262247] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1899419.263648] CR2: ffffc8ffc7cf033c CR3: 0000000102a78000 CR4: 0000000000350ee0

Kernel 6.1 with intel_idle.max_cstate=1 lasted about 3 weeks, OpenWrt guest VM crashed during a 400mbps download.
 
Kernel 5.19, intel_idle.max_cstate=1 processor.max_cstate=1 mitigations=off and governor in powersave mode still froze my docker VM.

I'm kinda pissed for this, how could they advertise Virtualization features on a Hardware that clearly cannot work reliably even for a few days?
I've setup Proxmox watchdog for that specific VM (which is always the one to crash) let's see...
 
* NUC N5105, 8GB RAM running Proxmox Virtual Enviroment 7.x lastest and updated *
* Running 01 VM -> Proxmox Backup Server 7,x lastest and updated, 04 cpus, 4gb*

- Symptons:
VM Randomly freezing irregularly and stop responding at all (only the VM, not the pve-hypervisor - stays normal)

- Actions taken:
Kernel 5.19 + microcode + *disable memory balooning in vm config*

- Result:
12 days running smooth, no freezes until now and counting
Just to share this experience and ""solution"" that is working "FOR ME UNTIL NOW", with this annoying problem with NUC N5105, maybe it helps someone.
 
Last edited:
* NUC N5105, 8GB RAM running Proxmox Virtual Enviroment 7.x lastest and updated *
* Running 01 VM -> Proxmox Backup Server 7,x lastest and updated, 04 cpus, 4gb*

- Symptons:
VM Randomly freezing irregularly and stop responding at all (only the VM, not the pve-hypervisor - stays normal)

- Actions taken:
Kernel 5.19 + microcode + *disable memory balooning in vm config*

- Result:
12 days running smooth, no freezes until now and counting
Just to share my experience and "solution" with this annoying problem with NUC N5105, maybe it helps someone.

I took exact same actions and pfSense VM froze on me after 11 days. Now I'm trying 6.1.2 kernel. I had 7 days VM uptime on it but had to reboot Proxmox to perform an SSD firmware update.
 
Hello
Same problem here, I have 4 Odroid H3 boards with Intel 5105, I have proxmox installed in each board with a total of 6 VMs ( k3s cluster ) , I have random reboots on VMs. I had installed 6.1.2-1-pve kernel the last Thursday ( from 5.15 ) and Its looks ok with +3 days uptime, but the last nigh I've had a reboot from a VM of the cluster.
My actual settings
- Proxmox host with 6.1.2-1-pve Kernel
- VM memory ballon=0
- VM Processor host ( I just to change to KVM64+aes on this VM to check.. )
- BIOS seabios
- Machine q35 ( previously I had i440fx )
- Powersave enabled

VM's run Ubuntu 22.04 with 5.15.0-1026-kvm kernel version, any idea ?.
 
I took exact same actions and pfSense VM froze on me after 11 days. Now I'm trying 6.1.2 kernel. I had 7 days VM uptime on it but had to reboot Proxmox to perform an SSD firmware update.

pfSense rebooted on me spontaneously after a few days now with 6.1.2 kernel. No kernel panic trace either. Proxmox doesn't even realize the VM restarted. So the new kernel doesn't fix the issue.
 
Hello
Same problem here, I have 4 Odroid H3 boards with Intel 5105, I have proxmox installed in each board with a total of 6 VMs ( k3s cluster ) , I have random reboots on VMs. I had installed 6.1.2-1-pve kernel the last Thursday ( from 5.15 ) and Its looks ok with +3 days uptime, but the last nigh I've had a reboot from a VM of the cluster.
My actual settings
- Proxmox host with 6.1.2-1-pve Kernel
- VM memory ballon=0
- VM Processor host ( I just to change to KVM64+aes on this VM to check.. )
- BIOS seabios
- Machine q35 ( previously I had i440fx )
- Powersave enabled

VM's run Ubuntu 22.04 with 5.15.0-1026-kvm kernel version, any idea ?.
pfSense rebooted on me spontaneously after a few days now with 6.1.2 kernel. No kernel panic trace either. Proxmox doesn't even realize the VM restarted. So the new kernel doesn't fix the issue.
Still going strong on 5.19 with Opnsense

Sounds like they fixed it in the intermediate Version and brought it back with 6.x. Do you use CPU Host or KVM?

Maybe you are even more unlucky and it's in fact fixed,but you got a faulty CPU...
 
Still going strong on 5.19 with Opnsense

Sounds like they fixed it in the intermediate Version and brought it back with 6.x. Do you use CPU Host or KVM?

Maybe you are even more unlucky and it's in fact fixed,but you got a faulty CPU...
With 5.15 kernel I was using KVM but with 6.x version I changed to CPU Host, now I have back to KVM64+AES only in the VM that I have had the last reboot.
 
Still going strong on 5.19 with Opnsense

Sounds like they fixed it in the intermediate Version and brought it back with 6.x. Do you use CPU Host or KVM?

Maybe you are even more unlucky and it's in fact fixed,but you got a faulty CPU...

I use Host as it appears to be more performant than KVM64. Although maybe sacrificing a bit of performance for stability is worth it.

I also have a theory that it may be due to the kernel running in the VM. pfSense Plus 23.01 should come out soon with FreeBSD 14-Current, quite a jump from FreeBSD 12.3-STABLE. It would be interesting to see if it fixes the issue.
 
Giving the Host option to the guest vm makes all dependent from it if I understand correct... So if you update your host kernel to 5.19+ but use an older guest kernel with sub 5.19 or Bsd equivalent it might make a call that triggers the panic.

Just a wild guess here... But on the other hand my Debian docker is running with kernel 5.10 and host CPU... Never crashed. I am eager to find the reason for this.
 
>I'm kinda pissed for this, how could they advertise Virtualization features on a Hardware that clearly cannot work reliably even for a few days?

@Edoardo396 , who is "they" ?
I think the hardware works, The problem is related with KVM/QEMU/Kernel, probably jasper lake not 100% supported platform at this moment.
 
Unfortunately I can not offer any actual help with those crashes on Odroid, I just want to add another datapoint as it seems I had more luck: I have an ODROID-H3 with "Intel(R) Celeron(R) N5105 @ 2.00GHz", 64 GiB Ram and two cheap (consumer class, not recommended!) SSD as rpool.

It is up and running since 15th of December with slowly increasing load. Currently it runs 10 Linux VMs; all are low-load but continuously in use. This node was on 6.1.0-1-pve for five weeks and runs 6.1.2-1-pve since 20th January. I had zero crashes yet.

Good luck!
 
  • Like
Reactions: Pramde
Unfortunately I can not offer any actual help with those crashes on Odroid, I just want to add another datapoint as it seems I had more luck: I have an ODROID-H3 with "Intel(R) Celeron(R) N5105 @ 2.00GHz", 64 GiB Ram and two cheap (consumer class, not recommended!) SSD as rpool.

It is up and running since 15th of December with slowly increasing load. Currently it runs 10 Linux VMs; all are low-load but continuously in use. This node was on 6.1.0-1-pve for five weeks and runs 6.1.2-1-pve since 20th January. I had zero crashes yet.

Good luck!
Nice , Do you have some special config in proxmox, Odroid bios.. VM or anything else ?, What kernel version are running on VM machines ?
 
Do you have some special config in proxmox, Odroid bios.. VM or anything else ?,
No, nothing. Absolutely straight forward configuration - with the exception of the opt-in kernels. (And additional services for salt-minion and zabbix-agent. But this could possibly increase the risk, not decrease it...)
What kernel version are running on VM machines ?
My guests are Debian (Bullseye + some Buster) and Ubuntu (Jammy + some Focal). Nothing special.

That's why I said "can not offer any actual help" - it just works... for me.
 
Maybe you can help and just don't know :D Did you went for the opt-in kernels straight after setting up the machines and _before_ creating any guest VM?

Or did you use the 5.15 default kernel some time and upgraded after?
 
Maybe you can help and just don't know :D Did you went for the opt-in kernels straight after setting up the machines and _before_ creating any guest VM?
Or did you use the 5.15 default kernel some time and upgraded after?

Installation came with 5.15, I upgraded immediately to 5.19. A few days later I went for 6.1:
Code:
reboot   system boot  6.1.2-1-pve      Fri Jan 20 11:23   still running
reboot   system boot  6.1.0-1-pve      Thu Dec 15 14:14 - 10:21 (22+20:07)
reboot   system boot  5.19.17-1-pve    Sat Dec 10 10:43 - 08:02  (21:18)
reboot   system boot  5.15.30-2-pve    Sat Dec 10 08:33 - 09:05  (00:32)
All my VMs were migrated from other nodes. Cold, not live. Unfortunately I have several different CPUs in my homelab and live-migration is not stable under all circumstances.

Just to verify I am on the correct node:
Code:
~# dmidecode | grep ODROID
        Product Name: ODROID-H3
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!