[TUTORIAL] Hardware watchdog at a per-VM level

chrispage1 · Jan 31, 2022

From my testing of Proxmox, one frustration I had was that unlike my previous Xen environment, Proxmox does not detect if a VM has panicked/crashed/frozen and as such won't reboot the VM, potentially ending up in hours of downtime until the issue is realised and resolved.

After a bit of digging on various sites and pulling together a few sources, I created my own guide on doing this, but thought it'd be helpful to share with the Proxmox community. It is possible to enable a watchdog service on your VM's that integrates with Proxmox, effectively mimicking a physical hardware watchdog that'd reset bare metal in the instance of a panic.

Of course some care should be taken with this as a misconfiguration could potentially put your VM into a cycle of resets (but if you follow carefully you should be fine). I've had this configuration running on 12 of my VM's for a couple of months now without issue.

The below is using apt in Ubuntu 20.04 but I'm sure different OS's will have a similar flow.

1. Modify your VM config file on the Proxmox node nano /etc/pve/qemu-server/[server_id].conf and add our virtual watchdog device. We'll be using the i6300esb watchdog as although old, is supported by KVM and provides the functionality we need. To add the watchdog device, append the below to the config file and save:

Code:

watchdog: model=i6300esb,action=reset

**Anything below this line should be performed on the VM, NOT the PVE node**

2. Install watchdog on the VM with apt install watchdog

3. Configure the watchdog service by appending the below options to /etc/watchdog.conf. This tells the watchdog service the device it should be heartbeating with.

Code:

watchdog-device = /dev/watchdog
log-dir =  /var/log/watchdog
realtime = yes
priority = 1

4. By default, the i6300esb device is blacklisted within Linux. To work around this, modify the newly created /etc/default/watchdog file and set the watchdog_module to i6300esb.

5. Enable the watchdog service to start at next boot with systemctl enable watchdog

6. Fully power off the VM (not restart). This is important as it'll allow it to adopt the new hardware configuration.

7. Power the VM back on and check the watchdog module is up and working by running dmesg | grep i6300. You should see something like the below:

Code:

[    7.249538] i6300ESB timer 0000:00:04.0: initialized. heartbeat=30 sec (nowayout=0)

Everything is now configured and the only thing left to do is to give it a test. To run a test, trigger a kernel panic by running echo c > /proc/sysrq-trigger. After a short while (60 seconds or so) you should see the VM automatically reset, and you're done! I hope you find this useful.

Sidenote: it'd be great if in the Proxmox UI under hardware you could manually add custom lines. It's a shame that once configured you can't see this on the VM hardware page.

cferguson · Apr 5, 2022

Thank you for this. I've been having issues with a pfsense VM crashing or something while I am away working. The wife wants to kill me, lol.

That being said, do you know how I would get this working in my pfsense os?

I would love to have it restart if it detects an issue.

guruevi · Apr 9, 2022

This doesn't work for other OS. PFSense, Windows all need drivers and support for a watchdog. There are scripts 'out there' that run a script from the host through various means. Basically you need a monitoring system with an automated remote reboot.

cferguson · Apr 10, 2022

guruevi said:
This doesn't work for other OS. PFSense, Windows all need drivers and support for a watchdog. There are scripts 'out there' that run a script from the host through various means. Basically you need a monitoring system with an automated remote reboot.

thanks. I've been working on a ping bash script cron job. Guess it will have to do.

Lefuneste · May 14, 2022

Thank you VERY MUCH. I just come to the stage where HA migration of VM is not stable enough within Proxmox to ensure my critical VM (docker VM and Home Assistant VM) always stay operational. This is perfect. It is working wonderfully. THIS SHOULD BE NATIVE IN PROXMOX !!!!

HellrazorX · Sep 14, 2022

I did the whole thing in a Manjaro vm, the test worked well, but in real life it did not start the vm back on when it (presumably) crashed.

darknezz · Dec 7, 2022

Lefuneste said:
Thank you VERY MUCH. I just come to the stage where HA migration of VM is not stable enough within Proxmox to ensure my critical VM (docker VM and Home Assistant VM) always stay operational. This is perfect. It is working wonderfully. THIS SHOULD BE NATIVE IN PROXMOX !!!!

How you add the watchdog to Home Assistant VM?

timoverbrugghe · Jan 13, 2023

Thanks @chrispage1 for this great tutorial

Just wanted to add that if you're using the ubuntu cloud images, these don't include the i6300esb watchdog kernel module (which means that watchdog won't start in your VM)

You can install the standard modules from ubuntu server with

Code:

apt-get install linux-image-generic

For those using ansible, I created a simple playbook to do these tasks: https://github.com/TimoVerbrugghe/h...ible/roles/routervm/tasks/enable-watchdog.yml

mbc · Sep 11, 2023

timoverbrugghe said:
Thanks @chrispage1 for this great tutorial

Just wanted to add that if you're using the ubuntu cloud images, these don't include the i6300esb watchdog kernel module (which means that watchdog won't start in your VM)

You can install the standard modules from ubuntu server with

Code:

apt-get install linux-image-generic

For those using ansible, I created a simple playbook to do these tasks: https://github.com/TimoVerbrugghe/h...ible/roles/routervm/tasks/enable-watchdog.yml

Hi,

Isn't there an alternative to installing the full linux-image-generic?

Having to install an extra 1,4GB for just the i6300esb watchdog seems a little too much.

BTW the github link is broken. It seems like you have reorganized the repository and this link is no longer valid.

guruevi · Sep 11, 2023

Per the KVM page: Intel's WDT driver is obsolete and broken, and shouldn't be used. There are no WDT drivers for Windows that I know of.

The reason the module is no longer included in modern/cloud distros by default, is because the hardware it emulates is a 32-bit PCI device. If you don't want a large package (basically replacing the Ubuntu Cloud with Ubuntu Server/Desktop kernel) you can compile it separately as a module, then copy and insert the module into your kernel or compile a custom kernel with the hardware support in it, package and distribute it, whether that is feasible in a production environment is left as an exercise to the reader.

My suggestion would be to use QEMU Guest Agent.
qm agent <id> ping
If the the QEMU guest agent is reachable, the command will complete without any output, otherwise you can use the exit code to reset the machine, do this in a cronjob every 60s if you want.

mbc · Sep 11, 2023

guruevi said:
Per the KVM page: Intel's WDT driver is obsolete and broken, and shouldn't be used. There are no WDT drivers for Windows that I know of.

The reason the module is no longer included in modern/cloud distros by default, is because the hardware it emulates is a 32-bit PCI device. If you don't want a large package (basically replacing the Ubuntu Cloud with Ubuntu Server/Desktop kernel) you can compile it separately as a module, then copy and insert the module into your kernel or compile a custom kernel with the hardware support in it, package and distribute it, whether that is feasible in a production environment is left as an exercise to the reader.

My suggestion would be to use QEMU Guest Agent.
qm agent <id> ping
If the the QEMU guest agent is reachable, the command will complete without any output, otherwise you can use the exit code to reset the machine, do this in a cronjob every 60s if you want.

Thanks for the detalied response @guruevi.

I will definitely take a look t your aproach!

LnxBil · Sep 11, 2023

guruevi said:
If the the QEMU guest agent is reachable, the command will complete without any output, otherwise you can use the exit code to reset the machine, do this in a cronjob every 60s if you want.

Shouldn't proper monitoring including actions be used? The service itself should be monitored and taken care of if it does not work as it should. I personally like that approach much better.

frijsdijk · Mar 30, 2024

Very nice, this works like a charm!
Only thing is, if it's triggered (using the test in the tutorial), the vm does reset and boots fine, but there is no log anywhere on the pve host. Or is there? (I'm assuming that it's not possible on the vm itself, or perhaps post-reset?)

frijsdijk · Apr 19, 2024

Just checking: is there no way to check this in logs post-watchdog-reset?

LnxBil · Apr 20, 2024

frijsdijk said:
Just checking: is there no way to check this in logs post-watchdog-reset?

You have to ask in a "guest related" forum. A Watchdog is triggered in the OS and if does not write this to somewhere, you're out of luck. Normally, in case of a watchdog event, the OS is completely fucked up and that normally means, that it will not write anything to any log, because it is fucked up.

For Linux guests, you could archieve this by having kernel logs send to another machine e.g. via netconsole. The crash or watchdog event is probably logged there, yet I haven't tried. We're using watchdog (i6300) based VMs for almost 10 years, yet never had a watchdog event triggered in there. You may ask how could we know that, we monitor the uptime and inspect if we have found something. You can automate this if you have a central logging host that "knows" when and how a machine is rebooted and may alert if there are no "shutdown messages" registered for a specific host.

frijsdijk · Apr 21, 2024

LnxBil said:
You have to ask in a "guest related" forum. A Watchdog is triggered in the OS and if does not write this to somewhere, you're out of luck. Normally, in case of a watchdog event, the OS is completely fucked up and that normally means, that it will not write anything to any log, because it is fucked up.

For Linux guests, you could archieve this by having kernel logs send to another machine e.g. via netconsole. The crash or watchdog event is probably logged there, yet I haven't tried. We're using watchdog (i6300) based VMs for almost 10 years, yet never had a watchdog event triggered in there. You may ask how could we know that, we monitor the uptime and inspect if we have found something. You can automate this if you have a central logging host that "knows" when and how a machine is rebooted and may alert if there are no "shutdown messages" registered for a specific host.

Uh, are you sure, "guest related"? Because as I understand, in this case it's a daemon in the guest OS that stops sending "I'm alive" to proxmox because it crashed, or is hanging, and it's proxmox that resets the VM. And I'm not looking for reasons why the VM crashed, I'm just looking for logs that this mechanism in proxmox triggered the reset of the VM.

LnxBil · Apr 21, 2024

frijsdijk said:
Uh, are you sure, "guest related"? Because as I understand, in this case it's a daemon in the guest OS that stops sending "I'm alive" to proxmox because it crashed, or is hanging, and it's proxmox that resets the VM. And I'm not looking for reasons why the VM crashed, I'm just looking for logs that this mechanism in proxmox triggered the reset of the VM.

No, it's a virtualized hardware watchdog inside of the VM that monitors and resets the VM. PVE is out of the loop completely.

frijsdijk · Apr 21, 2024

LnxBil said:
No, it's a virtualized hardware watchdog inside of the VM that monitors and resets the VM. PVE is out of the loop completely.

Forgive me for my ignorance, but I don't understand how this works. In the example we do a "echo c > /proc/sysrq-trigger" to trigger the reset. If we do this, everything 'hangs' inside the VM, how can hardware than still reset the VM? I'll have to do some digging because I'm clearly missing some knowlegde

Thanks!

LnxBil · Apr 23, 2024

A watchdog is (in this case virualized) hardware, that can reset your machine and works besides the CPU. It needs to be updated from the OS constantly in order to not trigger a hardware reset, once activated. This is done by the watchdog daemon. Normally it is just incrementing a value. If this update is not given in e.g. 60 seconds, the watchdog will reset your machine and a OS crash is assumed. AFAIK, this can be simulated by killing the watchdog daemon inside of your guest, which would update the counter regularly.

There is also a wikipedia article, that give a more complete picture.

frijsdijk · Apr 24, 2024

Thanks, I never considered it could be a seperate piece of (virtual) hardware that works independently from the CPU of the guest.

[TUTORIAL] Hardware watchdog at a per-VM level

Well-Known Member

New Member

Renowned Member

New Member

Renowned Member

Member

Member

Member

Member

Renowned Member

Member

Distinguished Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

We value your privacy