Possible to detect guest hangs in HA using Qemu Agent?

casparsmit

Renowned Member
Feb 24, 2015
42
2
73
Hi,

In a 3 node CEPH/Proxmox 4 HA Cluster, I recently had a Windows 7 Guest VM hang (BSOD).
As expected HA never kicked in because in proxmox' point of view the VM is up and running.

I thought maybe the QEMU Guest Agent would help checking for hung VM's but when I checked the wiki page it only helps with the proper shutdown of a VM and creating consistent (VSS) backups.

Shouldn't/will it be possible to use the QEMU Guest Agent to check if the VM's OS is working properly and if not (No connection to QEMU Guest Agent) use the HA stack to stop (first try guest agent shutdown, then acpi shutdown, and lastly forced stop) and the start the VM again (with a maximum number of retries).

Or is there another way to check the status of the OS running inside the VM?

Caspar
 
In theory, this should work via a guest ping. I have never tried this but it should work perfectly to check if the guest agent is "ponging", this does not check if it crashed, only if the agent is running.

For all of my Linux boxes I use the QEMU watchdog to automatically reset the kernel on a crash, yet I do not know about the windows 64-bit integration of the watchdog timers, there was 32-bit support for old ms products, but not for never ones.
 
For all of my Linux boxes I use the QEMU watchdog to automatically reset the kernel on a crash, yet I do not know about the windows 64-bit integration of the watchdog timers, there was 32-bit support for old ms products, but not for never ones.

Hi @LnxBil, I came across this post since I've recently had a couple of guest Ubuntu VM kernel panics and thought the same that PVE HA would identify this (given qemu-guest-agent installed on the guest VM) and restart the VM but it appears from the above that this is not the intended purpose of qemu-guest-agent... As such, could you describe how you installd the "QEMU watchdog" you mention to identify when there is a linux guest VM freeze / kernel panic and restart the VM?

Also, I'd be keen to understand whether there is the option to automatically save a screenshot of the guest VM's console before rebooting it on a hang so that the kernel panic trace could be saved for future investigations? Perhaps the "QEMU Watchdog" allows the running of a "VM reboot script" on the host to also capture the VM console screenshot before rebooting the VM rather than just hard restarting it?

Many Thanks!
 
As such, could you describe how you installd the "QEMU watchdog" you mention to identify when there is a linux guest VM freeze / kernel panic and restart the VM?

It's simple:
- add the configuration manually and stop and start your VM: watchdog: model=i6300esb,action=reset
- install the watchdog package
- test it (this will crash your machine): echo c > /proc/sysrq-trigger
- wait until the machine is reset automatically

Also, I'd be keen to understand whether there is the option to automatically save a screenshot of the guest VM's console before rebooting it on a hang so that the kernel panic trace could be saved for future investigations? Perhaps the "QEMU Watchdog" allows the running of a "VM reboot script" on the host to also capture the VM console screenshot before rebooting the VM rather than just hard restarting it?

If you want to debug your running guest OS, I suggest you read about its crash dump capabilities:
https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html

I use this quite often to debug "real" machines, I never encountered crashes inside of VMs, but I also do not use Ubuntu :p
 
Thanks @LnxBil,

I had a quick look inside the VM after enabling the "watchdog: model=i6300esb,action=reset " in the VMID.conf file but noted that the i6300esb module is blacklisted by default in Ubuntu (it's actually blacklisted not under /etc/modprobe.d but under the /lib/modprobe.d/blacklist_<KERNEL VERSION>.conf files which makes me believe that, even if commenting the blacklisted module out of these files (or enabling it in /etc/modules), it would re-blacklist the module on each kernel update, unless there's an easier way of un-blacklisting the i6300esb module which persists kernel updates?

Also, on ubuntu at least, to perform a simulated test crash, you also need to run the first command here to enable the simulated crash before executing it:

Code:
echo 1 > /proc/sys/kernel/sysrq; echo c > /proc/sysrq-trigger

I'll take a look at the Kernel Crash Dump capabilities but, for the time being, I've simply just enabled
Code:
kernel.panic = 60
to
Code:
/etc/sysctl.conf
in the VM to auto-reboot the VM 60 seconds after a kernel panic occurs

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!