VM freezes irregularly

guys I am running
root@prox2:~# uname -a
Linux prox2 5.19.7-2-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.7-2 (Tue, 04 Oct 2022 17:18:40 + x86_64 GNU/Linux
on N5095A and now the vms survived for more than 1 day, it is the first time that my vms went for more than a day
btw I am using microcode package also, I disabled extended C states in BIOS

1668188923302.png
 
N6005, Odroid H3+, 64GB Ram, default LVM SSD storage.
Crashing repeatedly only VMs that use ZFS. In my case Nextcloud (clean install, no load) and pfSense (used as main router)
All other LXCs and VMs rock stable.
Side effect on the pfSense VM, maybe unrelated:
Used RAM slowly increases only on the graph reported by Proxmox close up (~1GB) to any allocated amount, not on VM itself
After crash stay at the same high level. When stop/start VM manually the used ram report start on low and increases slowly again.
No matter if used ballooning or not.
 

Attachments

  • 1668202776612.png
    1668202776612.png
    120.9 KB · Views: 15
I don't want to claim it's 100% fixed yet, but after upgrading to the latest microcode on my N6005 I'm at 2 weeks uptime on my OpenWrt VM which has previously never reached a week. Note that you have to install this microcode version manually - the version in the bullseye-backports repository is out of date.

Instructions:
Code:
wget http://http.us.debian.org/debian/pool/non-free/i/intel-microcode/intel-microcode_3.20220809.1_amd64.deb
dpkg -i intel-microcode_3.20220809.1_amd64.deb
update-initramfs -u -k all
reboot
 
Last edited:
What kernel are you running on the PVE host? Since upgrading to the 5.19.x kernel, my VMs (Ubuntu and pfSense) with uptimes of 30+ days which have only been marred by a power outage.
Hi @gyrex , it looks like you have solved the problem? Are you using NVMe SSDs or SATA SSDs after switching back to Proxmox from ESXi?

I'm also a N5105 series CPU host user, and I'm also having problems with VMs rebooting irregularly:

The openwrt VM reboots, but PVE shows openwrt as fine, running continuously, and the only entries in the logs seem to be related to NIC reboots. This is often the case when other VMs running in PVE are doing PT downloads at high speed.
In addition, sometimes the openwrt VM does not reboot, but the network interface does reboot as well.
In all these cases, the PVE host is running fine as well as LXC containers.


Seeing the discussion in this thread about the kernel version, I also tried to upgrade to the edge PVE kernel version 5.19, but no luck. Now I suspect a hardware-related cause:

The small box I'm using is very compact, with a high-temperatureNVMe SSD and an Intel i225v King NIC in close proximity (see attached picture).

Combined with the previous scenario of frequent problems with high-speed PT downloads, I suspect that the SSD and the NIC were working under high load at the same time, causing the NIC to drop out, which then led to a series of problems, as I found that openwrt would keep sending out a lot of attempts when the network was disconnected, leading to a reboot after exhausting all resources (or maybe for other reasons).

The above theory seems to explain the following phenomenon:
1. Everything works fine at low load, but frequently fails at high speed PT downloads
2. Migration to SATA SSDs alleviates the problem (this is yet to be verified, look forward to your reply @gyrex )

I would also recommend that users running small, compact boxes stress test both the hard drive and the network, and describe the hardware you are using when discussing the problem to verify that the problem

I plan to run more stress tests on SSDs and networks later, and will follow up with the results.

I'm not an expert in this area, so I'd also like to ask @fabian if a dropped network card due to overheating could be a possible cause of a VM reboot, and I'd like to ask for some advice on how to locate the problem (e.g. how to get valid logs, etc.), and I'd be happy to contribute if possible.

My hardware configuration is attached:
CPU: Intel N5095
NIC: Intel i225v-b3
Motherboard Brand: ChangWang
photo_2022-11-15 11.57.11.jpeg
 
Last edited:
  • Like
Reactions: SleeperXr
Our hardware, OS, working conditions, workload is different, then different phenomena may be observed, so I suggest:

Establish a sufficiently broad coverage and standardized stress testing process

Briefly describe your hardware, software, etc. when discussing this problem


I wrote a simple stress test script to simulate CPU, memory, I/O, hard disk all at the same time under high load, hope it can help reproduce the problem, maybe we can use this script to do a little stress testing?

If you want to simulate a high load on the NIC, you can run iperf on the LAN or use other speed testing services.

Comments on the script are also welcome.

Find the script here:
Bash script for a comprehensive stress testing on CPU, memory, system I/O and disk (github.com)

I'm working on stress testing with this script, and will include the results later.
 
New CWWK 5105 v5 owner here (build/test details at STH forum). FWIW, I needed the PVE kernel update (5.19.7-2-pve)+ microcode (revision 0x24000023, date = 2022-02-19) to get stability. System is now stable under load for a total of ~3 hours of testing. Will report back here if I have issues in the coming days/weeks/months.
 
I don't want to claim it's 100% fixed yet, but after upgrading to the latest microcode on my N6005 I'm at 2 weeks uptime on my OpenWrt VM which has previously never reached a week. Note that you have to install this microcode version manually - the version in the bullseye-backports repository is out of date.
Rebooted yesterday for the latest kernel (5.19.17-1-pve) and OpenWrt VM crashed again in < 12 hours :(. I think the comments about C-states might be on to something, I had switched to the powersave governor during the 20 day uptime, the reboot reverted to performance. Switched it again and let's see how long it lasts...
 
Last edited:
Rebooted yesterday for the latest kernel (5.19.17-1-pve) and OpenWrt VM crashed again in < 12 hours :(. I think the comments about C-states might be on to something, I had switched to the powersave governor during the 20 day uptime, the reboot reverted to performance. Switched it again and let's see how long it lasts...
add a crontab entry?
@reboot echo "powersave" > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Ive just ordered one of these devices, wish id seen this thread first!
 
My problem seems to be related to the Intel i225-v NIC.
The card has a hardware flaw that can cause severe packet loss failures during operation, which causes me to occasionally see the card drop and then restart in the Openwrt VM; in some cases Openwrt will continue to resend various requests when the card drops and restarts, exhausting system resources and then restarting.

This failure seems to occur frequently with heavy network traffic, so my router managed to survive the above stress test for an hour without crashing, however after running a high-speed PT download (~60MB/s) for about half an hour the Openwrt VM crashed and restarted again. Of course, it is possible to have a dropped network card with low network traffic. One night my network card suddenly dropped and I lost Internet connection at home, but probably due to the low network load at that time, the router did not reboot.

For more information about Intel i225-v NIC failures, please refer to the following:
Intel Community Discussion:
Intel Network I225-V Nework issue persist even after FW upgrade - Intel Communities
Intel's press release:
Network Issues with Intel® Ethernet Controller I225-V
Intel's report:
https://cdrdv2-public.intel.com/621...Public External Specification Update-v1.2.pdf
 
Just a note for the future visitors:
The instability persisted with j6413 topton aliexpress box and the 5.15 kernel. (reproducible by cpu stress test)
Fortunately I found this forum and upgrading to 5.19.17-1-pve kernel solved the problem completely. Thanks to everyone here who posted.
 
I can confirm that Proxmox is ustable on N5105:
Code:
root@pve:~# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           156
Model name:                      Intel(R) Celeron(R) N5105 @ 2.00GHz
Stepping:                        0
CPU MHz:                         2000.000
CPU max MHz:                     2900,0000
CPU min MHz:                     800,0000
BogoMIPS:                        3993.60
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1,5 MiB
L3 cache:                        4 MiB
NUMA node0 CPU(s):               0-3

My great uptime on my PFsense router
Skjermbilde 2022-11-24 191005.jpg

I don't see any correlation be network load and crashrate.
Screenshot 2022-11-24 at 19-37-39 Observium.png

Edit:
Update to 5.19 and is currently stable, 2 days of uptime.
https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/
 
Last edited:
My 5.19 crashed after 2 weeks (n5105). System is down and I’m on the road so can’t check the details until I get back. Probably going to dump proxmox when I get back and just try running pfsense directly. Keep losing security cameras,etc when I am on the road.
 
Hey guys
13 days without problems on my N5095A with Linux 5.19.7-2-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.7-2 (Tue, 04 Oct 2022 17:18:40
I will keep you posted about my evolution
I have 4 ubuntus running all with the similar behaivor,

1669733106965.png
 
It seems having PVE Kernel - Linux 5.15.74-1-pve - fixed the issues. I used to experience frequent crashes on AMD 4900H platform and now since end of Nov 2022 no more crashes.
 
I had switched to 5.19, but still crashed after 3-6 days. The same day I swaped the original psu for a Delta 12V 3A one, I also aplied the microcode (debian repository) updates. One or another (or both) fixed everything.

I'm 28 days without any VM crash.

I have a OpenWRT VM as a router and AP with a MT7915E IOMMU full pt (this VM was the one to often crash), a HAOS VM (crashed only once before) a Debian 11 container (pihole) and a Ubuntu 22.04 (torrent/media/NAS) container.
 
Hello, i changed from full supermicro server board to Odroid H3+x3 (Pentium Silver N6005)

I was also experiencing crazy amount of crashes, updated kernel to

5.15.74-1-pve
5.19.7-2-pve
6.0.12-edge

Updated microcode, but was still experiencing crashes

What solved my problem was disabling the enhanced c-states in BIOS. At least from my point of view the enhanced c-states are the root cause of the problem.
 
Last edited:
  • Like
Reactions: rRobbie
I am using a Celeron N5105 Mini-PC router to run an OpenWRT VM and other things.
The OpenWRT VM had random CPU spikes at 100% on Proxmox but I was unable to pinpoint an issue in the VM itself. It happened between few hours and 3 days. I needed to force stop and start the VM to "fix" the issue.

Thanks to this thread, I updated the Intel microcode using the stable repo version (3.20220510.1~deb11u1), and the Linux Kernel, first to 5.19, then to 6.1, and my OpenWRT VM seems more stable for now. I will report back there if crashing continues otherwise, it's fixed. :)
Code:
Linux pve 6.1.0-1-pve #1 SMP PREEMPT_DYNAMIC PVE 6.1.0-1 (Tue, 13 Dec 2022 15:08:53 +0 x86_64 GNU/Linux

The issue seemed to be already fixed with 5.19 kernel but I wanted to jump on a supported LTS version.
I cannot tell however if the microcode update was enough. I made both upgrades almost at the same time.

Based on other things I saw in this topic and elsewhere, in my case:
I did not install the latest intel microcode from backport 3.20220207 or testing 3.20221108, only the one in stable 3.20220510.
I did not change anything in the BIOS itself (C-States for instance).
I did not change the PSU.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!