Is there a way to instruct proxmox to automatically restart a VM after a crash?

yonoss

Member
Jul 5, 2023
30
0
6
Hi,
Is there a way to instruct proxmox to automatically restart a VM after a crash? Something similar to "Automatically start VM on boot", but to trigger the VM restart in case of a VM failure/crash.

By default, if a VM crashes, Proxmox is not restarting it. And I have to log into the console and start it manually, which is not ok for a production environment.

Thanks!
 
It'd be helpful to define what exactly "vm crash" means. But in general there is no mechanism in PVE that would restart a VM because VM's OS failed. It may be possible for HA mechanism to notice that "kvm" process failed (ie killed by OOM) and restart the VM. However thats probably not what you are looking to protect from.

To sum up - PVE is not the right tool to monitor VM OS health. You need to implement things like: watchdogs, app monitors, API monitors, health checks etc. Which ones to implement and how depends on the application you are trying to protect.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
I'm looking for a VM restart mechanism that will work for any type of VM failures. Most of the VM crashes are indeed caused by OOM errors.
 
I'm looking for a VM restart mechanism that will work for any type of VM failures
there is no off-the-shelf single mechanism to achieve it. You will need to create a custom script that monitors health of your VM/application.
Most of the VM crashes are indeed caused by OOM errors
That is an infrastructure problem that should never happen in production. This is best solved by having sufficient RAM in your hypervisor to cover VM needs. Critical production environment should never be overprovisioned.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Kingneutron
Most of the VM crashes are indeed caused by OOM errors.
AFAIK, those are fixed by defining the VM as a HA VM, which should restart the VM in case of an external failure.

there is no off-the-shelf single mechanism to achieve it. You will need to create a custom script that monitors health of your VM/application.
For some guest OSes, there was a mechanism called watchdog timer, which does what you want yet in the end, it depends on the type of error in the VM. As @bbgeek17 already mentioned, setup proper service health monitoring and act upon them accordingly.
 
I've had a problem today where the power isn't running smoothly due to high winds, and despite a UPS, guests have randomly terminated, one at a time during various power issues. The timestamps fit with the UPS kicking in. Oddly, wasn't a problem under ESXi (even if I am glad to see the back of it)

Anyway, wrote this script and minimally tested it. I've set it to run as a cron job every 10 minutes. The script restarts 1 server in a stopped state
Bash:
#!/bin/bash

stopped_count=$(/usr/sbin/qm list | grep stopped | wc -l)

# Check if the line count is greater than 0
if [ $stopped_count -gt 0 ]; then
    # Servers have stopped
    #echo "Oh Dear.."
    # Search qm list for stopped VMs, if there are, awk the numbers, sort randomly, and pick the first random entry
    first_failed_guest=$(/usr/sbin/qm list | grep stopped | awk '{print $1}' | sort -R | head -1)
    # Send the qm list by email, indicating which guest will restart
    /usr/sbin/qm list | mail -s "Guest Stopped, restarting $first_failed_guest" your@email.here
    # Actually start the selected guest
    /usr/sbin/qm start $first_failed_guest
fi
 
Last edited:
wishy, only for my curiosity - why builded HA feature not enough?
I guess they would be. It's a home server, I don't particularly want to pay for power for 3 nodes, so I just make the 1 node reasonably redundant, have backups, and spare hardware if the main node goes pop
 
I guess they would be. It's a home server, I don't particularly want to pay for power for 3 nodes, so I just make the 1 node reasonably redundant, have backups, and spare hardware if the main node goes pop
HA stands for a lot of stuff, yet a simple "keep-the-VM-online" is also doable on a single node with the help of HA. If the VM gets stopped (e.g. a poweroff) inside of the VM, it will get started automatically. Besides that, anything else that has been written in this thread is still true. If the VM crashes (e.g. kernel panic) it'll be restarted from the inside. If you use a watchdog and the VM freezes, it'll be restarted. Aynthing else that does not work here has to be monitored from the outside.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!