VM freezes irregularly

Well, my previous days was exactly 14 days (2 weeks). Currently, on the new setup, it is running properly for 2 days, and no crashes. If the VM doesn't crash for 1 mth (30 days), I will post a guide. If it crashes before that, then you will see me complaining about it. So, here's to hoping for the best.
 
Following this thread and other form several weeks now i'm decided to write my experience.

Aliexpress box Topton N5105 (4 nics) with:
Corsair Vengeance SODIMM 16GB (2x8GB) DDR4 3200MHz CL22
NVME Samsung Memorie MZ-V8V500 980 SSD (with upgrade firmware solved problem high temperature in proxmox)
Samsung HDD 2.5 1.75TB

Default bios stock is with c-state disabled.

Proxmox with:

3 vms: ( 1st Nic is for management of Proxmox )
- pfsense (2 Nics passthrough for WAN and LAN - enabling intel_iommu=on )
- Xigmanas on last Nic
- Homeassistant on last Nic

2 LxC
- Ubuntu 21.10 for Plex Server
- Ubuntu 20.05.5 for Omada TpLink Controller

I have it from jul 2022 ...

Homeassistant/pfsense/xigmanas reboot randomly from 1 to 10 days... even if vm is always on. The VM with Homeassistant is the most affected.
( 1-5 times i had total block of machine for pfsense and homeassistant - never for xigmanas)
Never problems for Containers

My tests:
I have tested all kernel available form stock to 6.1.10, and also edge versions.
Disabled turbo boost mode that set cpu from fixed 2.8ghz to fixed 2.0ghz.
So i have enabled cstate in bios and in grub with intel_idle.max_cstate=1 processor.max_cstate=1.

I have also installed microcode to last version ([Fri Feb 24 18:11:21 2023] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19)
Tested also changing cpu type from host to kvm64 or qemu,
enabling/disabling qemu guest agent, with and without booling ram

Nothing Helped !!!

Now i'm test again with:
- microcode
- turbo boost mode on in bios
- cstate enabled in bios and grub
- all machine with cpu host type
- booling ram disabled

now i have insert "#" before blacklist microde in intel-microcode-blacklist.conf as suggested @LiFE1688 and is online from 21 hours.

Very frustrating this situation !!!
 
Last edited:
I also have the same issue, but my crash frequency is 4 days. I am going to install the microcode later today and report back.

Is there a way to detect the VM is unresponsive and reboot the VM perhaps automatically?
 
I also have the same issue, but my crash frequency is 4 days. I am going to install the microcode later today and report back.

Is there a way to detect the VM is unresponsive and reboot the VM perhaps automatically?
I asked ChatGPT to write a bash script for it. Just need to add a crontab for it to run every 5min or so.

Bash:
#!/bin/bash

#!/bin/bash
# Proxmox virtual machine IDs to check
vmids=(100 200)

# Iterate over the virtual machine IDs
for vmid in "${vmids[@]}"
do
    # Check if the virtual machine is running
    vm_state=$(qm status $vmid)

    # If the virtual machine is running, check its reachability
    if [ "$vm_state" == "status: running" ]; then

        # Check the reachability of the virtual machine
        check_result=$(qm guest exec $vmid -- /bin/sh -c \"echo test\")
        # If the check result does not contain "test", restart the virtual machine
        if ! echo "$check_result" | grep -q "test"; then
            echo "Virtual machine $vmid is not reachable. Restarting..."
            qm stop $vmid && qm start $vmid
        else
            echo "Virtual machine $vmid is reachable."
        fi
    else
        echo "Virtual machine $vmid is not running. Nothing to do."
    fi
done
fi
 
Last edited:
I asked ChatGPT to write a bash script for it. Just need to add a crontab for it to run every 5min or so.

Bash:
#!/bin/bash

#!/bin/bash
# Proxmox virtual machine IDs to check
vmids=(100 200)

# Iterate over the virtual machine IDs
for vmid in "${vmids[@]}"
do
    # Check if the virtual machine is running
    vm_state=$(qm status $vmid)

    # If the virtual machine is running, check its reachability
    if [ "$vm_state" == "status: running" ]; then

        # Check the reachability of the virtual machine
        check_result=$(qm guest exec $vmid -- /bin/sh -c \"echo test\")
        # If the check result does not contain "test", restart the virtual machine
        if ! echo "$check_result" | grep -q "test"; then
            echo "Virtual machine $vmid is not reachable. Restarting..."
            qm stop $vmid && qm start $vmid
        else
            echo "Virtual machine $vmid is reachable."
        fi
    else
        echo "Virtual machine $vmid is not running. Nothing to do."
    fi
done
fi

Does the script work reliably? Have you had any false positives? Where does it log these checks?
 
I asked ChatGPT to write a bash script for it. Just need to add a crontab for it to run every 5min or so.

Bash:
#!/bin/bash

#!/bin/bash
# Proxmox virtual machine IDs to check
vmids=(100 200)

# Iterate over the virtual machine IDs
for vmid in "${vmids[@]}"
do
    # Check if the virtual machine is running
    vm_state=$(qm status $vmid)

    # If the virtual machine is running, check its reachability
    if [ "$vm_state" == "status: running" ]; then

        # Check the reachability of the virtual machine
        check_result=$(qm guest exec $vmid -- /bin/sh -c \"echo test\")
        # If the check result does not contain "test", restart the virtual machine
        if ! echo "$check_result" | grep -q "test"; then
            echo "Virtual machine $vmid is not reachable. Restarting..."
            qm stop $vmid && qm start $vmid
        else
            echo "Virtual machine $vmid is reachable."
        fi
    else
        echo "Virtual machine $vmid is not running. Nothing to do."
    fi
done
fi
This worked like a charm - my VM, even with the microcode change, the VM died within 24 hours, script kicked in and restarted the VM after detecting it was unresponsive. The last "fi" is not necessary. Thanks for sharing.
 
I cut short my test and rebooted because, I figure, I might as well get test it with the microcode-20230214 Release

Code:
[    0.000000] microcode: microcode updated early to revision 0x24000024, date = 2022-09-02
[    0.145853] SRBDS: Vulnerable: No microcode
[    1.161258] microcode: sig=0x906c0, pf=0x1, revision=0x24000024
[    1.161274] microcode: Microcode Update Driver: v2.2.
 
I cut short my test and rebooted because, I figure, I might as well get test it with the microcode-20230214 Release

Code:
[    0.000000] microcode: microcode updated early to revision 0x24000024, date = 2022-09-02
[    0.145853] SRBDS: Vulnerable: No microcode
[    1.161258] microcode: sig=0x906c0, pf=0x1, revision=0x24000024
[    1.161274] microcode: Microcode Update Driver: v2.2.

How did you install it? I don't see it in the Debian repos.
 
Does the script work reliably? Have you had any false positives? Where does it log these checks?

No false triggers this far. I added a line to write date/time to a file when script has passed "qm stop $vmid && qm start $vmid"
The line I added is "echo $(date +"%F %T") >> /root/vmcheck.log" before the else line.
 
For those who can't wait for the Debian repo to be updated and want to test the newest microcode, I've re-packaged the older version with the updated microcode data files (20230214). Install this over the top of the existing one and it should update for you. The version is intentionally kept old so that when the newer Debian package comes out it will take precedence over this hack job :).

Code:
wget https://r-1.ch/intel-microcode_3.20221108.2_amd64.deb
dpkg -i intel-microcode_3.20221108.2_amd64.deb

I just installed this on my N6005 and it looks like it updated. Before and after:
Code:
[    0.000000] microcode: microcode updated early to revision 0x24000023, date = 2022-02-19
[    0.000000] microcode: microcode updated early to revision 0x24000024, date = 2022-09-02

Fingers crossed it actually does something...
 
Seems there is a new pve-qemu-kvm package update that bumps QEMU from 7.1 to 7.2.

I'll likely update this package and manually install the 0x24000024 microcode next time my pfSense VM crashes.

Trying to figure out if moving from pfSense 22.05 to 23.01 did anything as it's now FreeBSD 14 instead of 12.3. Has been running a week so far, it tends to crash at the two week mark.
 
I asked ChatGPT to write a bash script for it. Just need to add a crontab for it to run every 5min or so.

Bash:
#!/bin/bash

#!/bin/bash
# Proxmox virtual machine IDs to check
vmids=(100 200)

# Iterate over the virtual machine IDs
for vmid in "${vmids[@]}"
do
    # Check if the virtual machine is running
    vm_state=$(qm status $vmid)

    # If the virtual machine is running, check its reachability
    if [ "$vm_state" == "status: running" ]; then

        # Check the reachability of the virtual machine
        check_result=$(qm guest exec $vmid -- /bin/sh -c \"echo test\")
        # If the check result does not contain "test", restart the virtual machine
        if ! echo "$check_result" | grep -q "test"; then
            echo "Virtual machine $vmid is not reachable. Restarting..."
            qm stop $vmid && qm start $vmid
        else
            echo "Virtual machine $vmid is reachable."
        fi
    else
        echo "Virtual machine $vmid is not running. Nothing to do."
    fi
done
fi
Wow that's great, nice find with chatgpt :)
I'll be gone for a couple of months and since my box is taking care of the lights + curtains it should keep working, as such I'll add this script (and maybe add more logging).

At least today I installed a Cloudflare tunnel on the PVE host, so I can securely access it remotely might the VM crash. I figured running wireguard in a VM would be a no-go.

More ontopic: I've had some crashes over the past couple of weeks, mostly when I had finished watching shows via plex. Also after updating the VM (Ubuntu 22.04) it would crash within a day.
 
I cut short my test and rebooted because, I figure, I might as well get test it with the microcode-20230214 Release

Code:
[    0.000000] microcode: microcode updated early to revision 0x24000024, date = 2022-09-02
[    0.145853] SRBDS: Vulnerable: No microcode
[    1.161258] microcode: sig=0x906c0, pf=0x1, revision=0x24000024
[    1.161274] microcode: Microcode Update Driver: v2.2.
I had same problem.
I only updated the microcode, and >24hs running with no problem.
 
Hey All,

I've been a lurker on this thread for a while. I managed to find a fix that worked involving disabling the Intel P-State kernel option. Since applying the fix about a month ago (wanted to thoroughly test it), I haven't had any VMs crashes on my N5105 Topton-series PCs and PVE has been completely stable. I'm running the standard Linux Kernel on the latest PVE release.

Here's what I did:
  • Edit /etc/default/grub and modified the value of GRUB_CMDLINE_LINUX_DEFAULT to be "intel_pstate=disable quiet"
  • Save the changes and run the update-grub command
  • Reboot
Perform these steps on each hypervisor host.

I hope this helps others as well.

Cheers!
 
I'm running the standard Linux Kernel on the latest PVE release.

Here's what I did:
  • Edit /etc/default/grub and modified the value of GRUB_CMDLINE_LINUX_DEFAULT to be "intel_pstate=disable quiet"
  • Save the changes and run the update-grub command
  • Reboot

you are on kernel 5.15 ? can y post inter grub_cmdline ? do you have
- intel_iommu=on ?
- intel_idle.max_cstate=1 ?
- processor.max_cstate=1?

thank you
 
Last edited:
Hey All,

I've been a lurker on this thread for a while. I managed to find a fix that worked involving disabling the Intel P-State kernel option. Since applying the fix about a month ago (wanted to thoroughly test it), I haven't had any VMs crashes on my N5105 Topton-series PCs and PVE has been completely stable. I'm running the standard Linux Kernel on the latest PVE release.

Here's what I did:
  • Edit /etc/default/grub and modified the value of GRUB_CMDLINE_LINUX_DEFAULT to be "intel_pstate=disable quiet"
  • Save the changes and run the update-grub command
  • Reboot
Perform these steps on each hypervisor host.

I hope this helps others as well.

Cheers!

What are your thermals? Does your CPU frequency still scale up and down? Doesn't this completely disable CPU power management?

What is the output of these two commands:
Code:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
cat /sys/devices/system/cpu/cpuidle/current_driver
 
Last edited:
you are on kernel 5.15 ? can y post inter grub_cmdline ? do you have
- intel_iommu=on ?
- intel_idle.max_cstate=1 ?
- processor.max_cstate=1?

thank you
Yes, I'm running everything stock apart the added kernel options I posted. Here's a snippet from my /boot/grub/grub.cfg file from the latest kernel entry:

Code:
menuentry 'Proxmox VE GNU/Linux' --class proxmox --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-52269be04257979a' {
    load_video
    insmod gzio
    if [ x$grub_platform = xxen ]; then insmod xzio; insmod lzopio; fi
    insmod part_gpt
    insmod zfs
    search --no-floppy --fs-uuid --set=root 52269be04257979a
    echo    'Loading Linux 5.15.85-1-pve ...'
    linux    /ROOT/pve-1@/boot/vmlinuz-5.15.85-1-pve root=ZFS=rpool/ROOT/pve-1 ro  root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_pstate=disable quiet
    echo    'Loading initial ramdisk ...'
    initrd    /ROOT/pve-1@/boot/initrd.img-5.15.85-1-pve

Here's the current version information:

Code:
root@pve1:~# pveversion
pve-manager/7.3-6/723bb6ec (running kernel: 5.15.85-1-pve)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!