Is there a way to instruct proxmox to automatically restart a VM after a crash?

yonoss

Member
Jul 5, 2023
30
0
6
Hi,
Is there a way to instruct proxmox to automatically restart a VM after a crash? Something similar to "Automatically start VM on boot", but to trigger the VM restart in case of a VM failure/crash.

By default, if a VM crashes, Proxmox is not restarting it. And I have to log into the console and start it manually, which is not ok for a production environment.

Thanks!
 
It'd be helpful to define what exactly "vm crash" means. But in general there is no mechanism in PVE that would restart a VM because VM's OS failed. It may be possible for HA mechanism to notice that "kvm" process failed (ie killed by OOM) and restart the VM. However thats probably not what you are looking to protect from.

To sum up - PVE is not the right tool to monitor VM OS health. You need to implement things like: watchdogs, app monitors, API monitors, health checks etc. Which ones to implement and how depends on the application you are trying to protect.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
I'm looking for a VM restart mechanism that will work for any type of VM failures. Most of the VM crashes are indeed caused by OOM errors.
 
I'm looking for a VM restart mechanism that will work for any type of VM failures
there is no off-the-shelf single mechanism to achieve it. You will need to create a custom script that monitors health of your VM/application.
Most of the VM crashes are indeed caused by OOM errors
That is an infrastructure problem that should never happen in production. This is best solved by having sufficient RAM in your hypervisor to cover VM needs. Critical production environment should never be overprovisioned.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Kingneutron
Most of the VM crashes are indeed caused by OOM errors.
AFAIK, those are fixed by defining the VM as a HA VM, which should restart the VM in case of an external failure.

there is no off-the-shelf single mechanism to achieve it. You will need to create a custom script that monitors health of your VM/application.
For some guest OSes, there was a mechanism called watchdog timer, which does what you want yet in the end, it depends on the type of error in the VM. As @bbgeek17 already mentioned, setup proper service health monitoring and act upon them accordingly.
 
I've had a problem today where the power isn't running smoothly due to high winds, and despite a UPS, guests have randomly terminated, one at a time during various power issues. The timestamps fit with the UPS kicking in. Oddly, wasn't a problem under ESXi (even if I am glad to see the back of it)

Anyway, wrote this script and minimally tested it. I've set it to run as a cron job every 10 minutes. The script restarts 1 server in a stopped state
Bash:
#!/bin/bash

stopped_count=$(/usr/sbin/qm list | grep stopped | wc -l)

# Check if the line count is greater than 0
if [ $stopped_count -gt 0 ]; then
    # Servers have stopped
    #echo "Oh Dear.."
    # Search qm list for stopped VMs, if there are, awk the numbers, sort randomly, and pick the first random entry
    first_failed_guest=$(/usr/sbin/qm list | grep stopped | awk '{print $1}' | sort -R | head -1)
    # Send the qm list by email, indicating which guest will restart
    /usr/sbin/qm list | mail -s "Guest Stopped, restarting $first_failed_guest" your@email.here
    # Actually start the selected guest
    /usr/sbin/qm start $first_failed_guest
fi
 
Last edited:
wishy, only for my curiosity - why builded HA feature not enough?
I guess they would be. It's a home server, I don't particularly want to pay for power for 3 nodes, so I just make the 1 node reasonably redundant, have backups, and spare hardware if the main node goes pop
 
  • Like
Reactions: Kingneutron
I guess they would be. It's a home server, I don't particularly want to pay for power for 3 nodes, so I just make the 1 node reasonably redundant, have backups, and spare hardware if the main node goes pop
HA stands for a lot of stuff, yet a simple "keep-the-VM-online" is also doable on a single node with the help of HA. If the VM gets stopped (e.g. a poweroff) inside of the VM, it will get started automatically. Besides that, anything else that has been written in this thread is still true. If the VM crashes (e.g. kernel panic) it'll be restarted from the inside. If you use a watchdog and the VM freezes, it'll be restarted. Aynthing else that does not work here has to be monitored from the outside.
 
Sorry to revive and old thread. Just posting this here in case its useful to someone else. I wrote a similar script.

The script runs if host uptime is more than 10 minutes so it doesn't try starting VMs if the host has just restarted. Checks for VMs with "Start at boot" enabled, checks the status and attempts to start if the VM is shutdown.

Bash:
#!/bin/bash

if [ $(cut -d '.' -f1 /proc/uptime) -gt 600 ]
then
  for node in `pvesh get /nodes --output-format json | jq ".[].node" | cut -f 2 -d \"`
  do
    for vmid in `pvesh get /nodes/$node/qemu --output-format json | jq ".[].vmid"`
    do
      if [ "$(pvesh get /nodes/$node/qemu/$vmid/config --output-format json | jq ".onboot")" == "1" ]
      then
        if [ "$(/usr/sbin/qm status $vmid)" == "status: stopped" ]; then
          /usr/sbin/qm start $vmid
        fi
      fi
    done
  done
fi

It's works on a single node. Don't think it would work if there was more than 1 node. I believe "qm" only works for VMs on the node its run.

I believe LRM can do the same job on a single node, but wasn't keen to test on production system. Anyone used LRM on a single node to keep VMs running?
 
  • Like
Reactions: Kingneutron
Aaand another one:
Code:
#!/bin/bash
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/local/bin/"

HOSTS="10.173.245.51"
COUNT=1

pingtest(){
  for myHost in "$@"
  do
    ping -c "$COUNT" "$myHost" && return 1
  done
  return 0
}

if pingtest $HOSTS
then
  pkill kvm
  qm start 100
  echo "$(date) RESTARTED" >> /root/restart.log
fi

The machine sometimes hung between 1:00 and 2:00 o'clock and i could not find the cause.
During analyse i used this script to prevent further "damage" (non working customer environment).
After implementing the script, the error does not occure anymore.

6 month later i removed the script and the problem reoccures, so i put it back on.
I still do not know what is happening here.

The scripts runs via cron:
*/5 * * * * /root/pingtest.sh > /dev/null
 
Last edited: