VM freezes irregularly

Hi!
A year ago I decided to virtualize a couple of my homeservers.
I tried Proxmox but found it too complex for my purpose. I used Debian Bullseye
as a host OS, two qemu VMs with Ubuntu 20.04 and OpenWRT VM + OpenVSwitch.
This system has been working without a problem till this September, when I decided to swap
my old Celeron847 - based "server" to the new shiny Beelink U59Pro (N5105).

I installed Ubuntu 22.04 as a host OS and moved old VMs to it.
Except OpenWRT.
I've bought a dedicated router because I anticipated problems and I have a wife too.

Of two VMs one worked fine and the other, which was more loaded, hung during two hours.
I experienced at least 10 freezes of the VM, between which I tried to change some of the parameters
in hope to stabilize it.
I am very happy that I accidentally came across this thread or I would still be trying
to find the problem in a wrong place.

So. I had the latest microcode from ubuntu sources (seems not the latest from Intel)
and stock kernel 5.15
My VM froze absolutely predictably somewhere about 1h 20m of uptime.
It happened more than 10 times during two days.
VM never worked more than two hours.
It always hung in unresponsive state with its sole CPU 100% loaded.

Now I installed kernel 5.19 and I can say - kernel does matter.
When time came for VM to freeze - I only lost video streams from IP cameras.
Syslog shows that "rtsp simple server", written in Go - died:
"fatal error: unknown caller pc" an has been relaunched.

So, kernel 5.19.0 on Ubuntu 22.04 does improve stability a lot, but it is still far
from reliability that server should have.

Next I will try to install the latest kernel 5.19.14 to see how it goes and then...
Then I will consider what I have never thought I would be considering :)
If this hardware works well on windows and it came with windows pro preinstalled -
maybe I should use Hyper-V.

Either Windows and Hyper-V or go back to my old Celeron 847, which was reliable as a brick, and wait when linux catch up with Jasper Lake.
 
Which Version of CHR are you using inside your VM and what host kernel?
I'm having the exact opposite of your experience, my RouterOS CHR VM (Version 7.5) is constantly crashing after about 16-48 hours of uptime.
Already tried running Proxmox Edge Kernel 5.19.12 (which had the same instability as 5.15.x) and 5.19.7 (better stability, but still crashing at least once every two days).
Hi.

This is running CHR 6.48.6 (long term). I'm running the latest official 5.19.7 kernel and latest microkernel (0x24000023).
But from memory this CHR version was stable even before I did any changes (5.15 kernel).

I'll boot ut a new RouterOs CHR version 7, and report back. I'll also try again with an Ubuntu VM.

edit: There must be a problem with newer kernel versions.

Slitaz linux = 100% stable. Running kernel 3.16.55
RouterOs CHR 6.48.6 = 100% stable. Some flavour of kernel 3.3.
 
Last edited:
test with disabling low power idle state, fixed reboots issues on Celeron J3455 (Apollo Lake)
#nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="intel_idle.max_cstate=1 processor.max_cstate=1"
#update-grub
 
Well, i after the "official" 5.19 who crash my server when the vm start, i try PVE Edge 5.19.14-2

I also disable c-states il the bios, and the balloing for the VM. I hope it's more stable.

But i had tu use cpupower to lower the frequency, unless it was 2800 all the time ( N5105) ( kernel problem or cstates ?)

More info to come
 
Quick update:

On a N5105 I ran into guest Linux (7 days til crash) and FreeBSD (3.5 days til crash) VM crashes with 5.19.8-edge w/ 0x24000023 microcode.

I moved to 5.19.7-edge w/ 0x24000023 microcode. 15 days no crashes yet on Ubuntu VM. FreeBSD VM (pfsense) made it about 7 days before it crashed with trap: General Protection Fault. So 5.19-7-edge seems maybe more stable, but still not there.

Is the 5.13 kernel still compatible with proxmox 7.2? Wasn't the 5.13 line stable?
 
I've got an uptime of 24 days on my Ubuntu VM running docker and the same with my pfSense/FreeBSD VM. I haven't switched over to the official Proxmox 5.19 edge kernel yet.

root@pve:~# uname -a
Linux pve 5.19.4-edge #1 SMP PREEMPT_DYNAMIC PVE Edge 5.19.4-1 (2022-08-25) x86_64 GNU/Linux

john@docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
john@docker:~$ uptime
00:48:29 up 24 days, 4:34, 1 user, load average: 0.18, 0.27, 0.12

pfSense:

[2.6.0-RELEASE][root@router.local]/root: uptime
12:53AM up 24 days, 4:16, 2 users, load averages: 0.71, 0.45, 0.33
linux tobe stable;but Windows sudden to restart
 
Update to my post #281
I had no luck with kernels up to 6.0 on Ubuntu 22.04 host OS.
My VM node01 freezes with each kernel I tried. The longest uptime was 3.5 hours.

I even installed Ubuntu 22.10 beta where not only kernel but quemu itself has been updated.
node01 died after 3 hours.

Now I installed Win10 pro, stripped it down to bare minimum using MSMG Toolkit to run Hyper-V only.
I have uptime of 24 hours for node01 currently.
I am not very comfortable with windows as a host OS, but it seems that I have no choice right now.

I will keep you updated how it goes and apparently will wait for linux to get stable on N5105.
The only strange thing I noticed so far is increased memory usage by Transmisison3.0. It began to use swap and even became unresponsive. I never saw anything like this on my previous setup.
 
i also had freezes with 5.15 and 5.19-7-1 but with 5.19.7-2 which was released some days ago it seems its gone now. My Opnsense and Debian VM are running now for over 4 days without crashed. Also a note i updated the microcode in the same timespan.


Machine intel n6005 from cwwk
 
Unfortunately I don't think 4 days is enough to call it stable, I also thought I had it fixed but it crashed again after 6 days. Looking at the changelogs, there's nothing seemingly relevant to this issue between those versions.
 
I don't think 4 days is enough to call it stable
You are right, but...
It seems that I have a peculiar VM.
On plane Ubuntu 22.04 host OS it consistently crashes in less than 2-3 hours.

On some kernels sometimes it doesn't crash but one particular service dies with different kind on memory violation messages.
On Hyper-V this VM doesn't crash at all but a few times a day I see that particular service died and was relaunched.
On Hyper-V this service always dies with "segmentation fault" message in syslog.

Of course my "always" is based on a couple of days of experience.

The service is rtsp-simple-server (https://github.com/aler9/rtsp-simple-server) which I use for a year or so.
I am not sure, but I think I never saw it crashing on my old hardware.
This service receives rtsp streams from three cheap IP cameras.
Then these streams are used to record video with ffmpeg and one stream I watch live.

When service dies - live stream becomes paused - so it is noticeable and annoying for me.
I think I would have noticed it on old hardware because I play this stream often.

After I'd seen consistent "segmentation fault" errors in syslog I decided to tweak some memory related features in BIOS.
There are huge amount of settings in my BIOS most of which I don't understand and can't find in google.

After some tries and errors I switched off two options in CPU section:
"hardware prefetcher" and "adjacent cache line prefetch"
I also disabled VT-d and enabled "all thermal functions".

After that I have 27 hours uptime completely without errors in syslog.

Now I'm thinking what to do next.
Either try to pinpoint which option exactly was responsible for memory errors or reinstall Ubuntu (I'm on Hyper-V) and see if it works now...
 
I am also beginning to suspect some kind of VM-related hardware / BIOS / firmware issue on the N6005 or FMI01 board. It's very strange the host itself never has issues. I am still testing things one by one to try and isolate the root cause.

Your suggestion of disabling VT-d prevents VMs from accessing host hardware directly so there will likely be a significant performance impact, depending on the use case, disabling this is probably a last resort option.
 
depending on the use case
I don't use VT-d.
I intended to use it to passthrough one NIC and WifI modules to OpenWRT VM but that didn't work as I expected so I bought a hardware router.
BTW VT-d is disabled on stock Ubuntu 22.04. It is needed to add GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on" for it to work.

Anyway, I'm not suggesting to disable VT-d. I just gave the list of parameters that I changed in my first try.
There were other parameters which I tried to change.
For example I tried to enable "MRC ULT Safe Config" but PC didn't boot after that :)

Now I axed my Hyper-V and installed Ubuntu 22.04 as a host OS again.
First results are not very promising.
When the time came for node01 to freeze - the host OS rebooted :(

OK, let's say it was solar high energy particle reflected from the Moon.
I did full system upgrade and will wait for the morning.
 
Unfortunately, my BIOS changes didn't improve situation with Ubuntu host OS.
My node01 VM keeps freezing at the same rate.
 
Ubuntu 20.04
See my first post #281
CPU(s)

4 x Intel(R) Celeron(R) N5105 @ 2.00GHz (1 Socket)
Kernel Version

Linux 5.19.7-2-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.7-2 (Tue, 04 Oct 2022 17:18:40 +
PVE Manager Version

pve-manager/7.2-11/b76d3178

The above configuration,my Ubuntu 20.04 and debian 10 is runing about 7 day without issue,but win7 sudden to restart
 
perhaps, it depends on instructions which VM uses or conditions it creates for CPU.
I have two VMs: node01 and node02.
They both Ubuntu 20.04 but run different software.
node01 - sipmple-rtsp-server, ffmpeg, transmission
node02 - apache2/wordperss, ejebberd, dovecot/postfix
node01 also has physical disk attached.

node01 barely lives for 3 hours.
node02 never died.

I'm on Hyper-V again. Will see how long node01 survive.
 
perhaps, it depends on instructions which VM uses or conditions it creates for CPU.
I have two VMs: node01 and node02.
They both Ubuntu 20.04 but run different software.
node01 - sipmple-rtsp-server, ffmpeg, transmission
node02 - apache2/wordperss, ejebberd, dovecot/postfix
node01 also has physical disk attached.

node01 barely lives for 3 hours.
node02 never died.

I'm on Hyper-V again. Will see how long node01 survive.
yes same situation,I used to run gitlab and it would crash,don't run gitlab it stable。Just upgrade the kernel to 5.19.7,run gitlab on Ubuntu20.04 It's stable for now.
 
stable for now
I didn't try 5.19.7 kernel and I am not sure that we are talking about the same kernel because, as I understand, you are using proxmox and I use just Ubuntu 22.04.
But I tried kernels from here: https://kernel.ubuntu.com/~kernel-ppa/mainline/
I installed 5.19.0 then 5.19.13 and then 6.0
Though there were some improvements on 5.19.0 kernel - eventually my node01 died on all of them.

Do you have physical drive passed-through into VM somehow?
Before I found this thread I was certain that the problem is in my second physical drive because host OS complained on "excessive interface errors".
Then I reduced interface speed to 3Gbps and errors were gone, But VM still freezes anyway.
 
I didn't try 5.19.7 kernel and I am not sure that we are talking about the same kernel because, as I understand, you are using proxmox and I use just Ubuntu 22.04.
But I tried kernels from here: https://kernel.ubuntu.com/~kernel-ppa/mainline/
I installed 5.19.0 then 5.19.13 and then 6.0
Though there were some improvements on 5.19.0 kernel - eventually my node01 died on all of them.

Do you have physical drive passed-through into VM somehow?
Before I found this thread I was certain that the problem is in my second physical drive because host OS complained on "excessive interface errors".
Then I reduced interface speed to 3Gbps and errors were gone, But VM still freezes anyway.
Do you have physical drive passed-through into VM somehow? -- no,I only use simple virtual machine functions,The problem seems to be getting confusing....
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!