Is there a way to instruct proxmox to automatically restart a VM after a crash?

yonoss · Oct 3, 2023

Hi,
Is there a way to instruct proxmox to automatically restart a VM after a crash? Something similar to "Automatically start VM on boot", but to trigger the VM restart in case of a VM failure/crash.

By default, if a VM crashes, Proxmox is not restarting it. And I have to log into the console and start it manually, which is not ok for a production environment.

Thanks!

bbgeek17 · Oct 3, 2023

It'd be helpful to define what exactly "vm crash" means. But in general there is no mechanism in PVE that would restart a VM because VM's OS failed. It may be possible for HA mechanism to notice that "kvm" process failed (ie killed by OOM) and restart the VM. However thats probably not what you are looking to protect from.

To sum up - PVE is not the right tool to monitor VM OS health. You need to implement things like: watchdogs, app monitors, API monitors, health checks etc. Which ones to implement and how depends on the application you are trying to protect.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

yonoss · Oct 3, 2023

I'm looking for a VM restart mechanism that will work for any type of VM failures. Most of the VM crashes are indeed caused by OOM errors.

bbgeek17 · Oct 3, 2023

yonoss said:
I'm looking for a VM restart mechanism that will work for any type of VM failures

there is no off-the-shelf single mechanism to achieve it. You will need to create a custom script that monitors health of your VM/application.

yonoss said:
Most of the VM crashes are indeed caused by OOM errors

That is an infrastructure problem that should never happen in production. This is best solved by having sufficient RAM in your hypervisor to cover VM needs. Critical production environment should never be overprovisioned.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

LnxBil · Oct 3, 2023

yonoss said:
Most of the VM crashes are indeed caused by OOM errors.

AFAIK, those are fixed by defining the VM as a HA VM, which should restart the VM in case of an external failure.

bbgeek17 said:
there is no off-the-shelf single mechanism to achieve it. You will need to create a custom script that monitors health of your VM/application.

For some guest OSes, there was a mechanism called watchdog timer, which does what you want yet in the end, it depends on the type of error in the VM. As @bbgeek17 already mentioned, setup proper service health monitoring and act upon them accordingly.

wishy · Mar 28, 2024

I've had a problem today where the power isn't running smoothly due to high winds, and despite a UPS, guests have randomly terminated, one at a time during various power issues. The timestamps fit with the UPS kicking in. Oddly, wasn't a problem under ESXi (even if I am glad to see the back of it)

Anyway, wrote this script and minimally tested it. I've set it to run as a cron job every 10 minutes. The script restarts 1 server in a stopped state

Bash:

#!/bin/bash

stopped_count=$(/usr/sbin/qm list | grep stopped | wc -l)

# Check if the line count is greater than 0
if [ $stopped_count -gt 0 ]; then
    # Servers have stopped
    #echo "Oh Dear.."
    # Search qm list for stopped VMs, if there are, awk the numbers, sort randomly, and pick the first random entry
    first_failed_guest=$(/usr/sbin/qm list | grep stopped | awk '{print $1}' | sort -R | head -1)
    # Send the qm list by email, indicating which guest will restart
    /usr/sbin/qm list | mail -s "Guest Stopped, restarting $first_failed_guest" your@email.here
    # Actually start the selected guest
    /usr/sbin/qm start $first_failed_guest
fi

zombie-man · Dec 1, 2024

wishy, only for my curiosity - why builded HA feature not enough?

wishy · Dec 1, 2024

zombie-man said:
wishy, only for my curiosity - why builded HA feature not enough?

I guess they would be. It's a home server, I don't particularly want to pay for power for 3 nodes, so I just make the 1 node reasonably redundant, have backups, and spare hardware if the main node goes pop

LnxBil · Dec 3, 2024

wishy said:
I guess they would be. It's a home server, I don't particularly want to pay for power for 3 nodes, so I just make the 1 node reasonably redundant, have backups, and spare hardware if the main node goes pop

HA stands for a lot of stuff, yet a simple "keep-the-VM-online" is also doable on a single node with the help of HA. If the VM gets stopped (e.g. a poweroff) inside of the VM, it will get started automatically. Besides that, anything else that has been written in this thread is still true. If the VM crashes (e.g. kernel panic) it'll be restarted from the inside. If you use a watchdog and the VM freezes, it'll be restarted. Aynthing else that does not work here has to be monitored from the outside.

waynej · Feb 14, 2025

Sorry to revive and old thread. Just posting this here in case its useful to someone else. I wrote a similar script.

The script runs if host uptime is more than 10 minutes so it doesn't try starting VMs if the host has just restarted. Checks for VMs with "Start at boot" enabled, checks the status and attempts to start if the VM is shutdown.

Bash:

#!/bin/bash

if [ $(cut -d '.' -f1 /proc/uptime) -gt 600 ]
then
  for node in `pvesh get /nodes --output-format json | jq ".[].node" | cut -f 2 -d \"`
  do
    for vmid in `pvesh get /nodes/$node/qemu --output-format json | jq ".[].vmid"`
    do
      if [ "$(pvesh get /nodes/$node/qemu/$vmid/config --output-format json | jq ".onboot")" == "1" ]
      then
        if [ "$(/usr/sbin/qm status $vmid)" == "status: stopped" ]; then
          /usr/sbin/qm start $vmid
        fi
      fi
    done
  done
fi

It's works on a single node. Don't think it would work if there was more than 1 node. I believe "qm" only works for VMs on the node its run.

I believe LRM can do the same job on a single node, but wasn't keen to test on production system. Anyone used LRM on a single node to keep VMs running?

elmarconi · Feb 14, 2025

Single node: https://forum.proxmox.com/threads/i6300esb-watchdog-in-windows-help-needed.37990/

ivenae · Feb 14, 2025

Aaand another one:

Code:

#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/local/bin/"

HOSTS="10.173.245.51"
COUNT=1

pingtest(){
  for myHost in "$@"
  do
    ping -c "$COUNT" "$myHost" && return 1
  done
  return 0
}

if pingtest $HOSTS
then
  pkill kvm
  qm start 100
  echo "$(date) RESTARTED" >> /root/restart.log
fi

The machine sometimes hung between 1:00 and 2:00 o'clock and i could not find the cause.
During analyse i used this script to prevent further "damage" (non working customer environment).
After implementing the script, the error does not occure anymore.

6 month later i removed the script and the problem reoccures, so i put it back on.
I still do not know what is happening here.

The scripts runs via cron:
*/5 * * * * /root/pingtest.sh > /dev/null

ness1602 · Feb 14, 2025

Why didn;t you just restart it every night at 1?

Search

Search

Is there a way to instruct proxmox to automatically restart a VM after a crash?

yonoss

Member

bbgeek17

Distinguished Member

yonoss

Member

bbgeek17

Distinguished Member

LnxBil

Distinguished Member

wishy

New Member

zombie-man

Member

wishy

New Member

LnxBil

Distinguished Member

waynej

New Member

elmarconi

Well-Known Member

ivenae

Member

ness1602

Famous Member

We value your privacy