Possible to detect guest hangs in HA using Qemu Agent?

casparsmit

Active Member
Feb 24, 2015
38
1
28
Hi,

In a 3 node CEPH/Proxmox 4 HA Cluster, I recently had a Windows 7 Guest VM hang (BSOD).
As expected HA never kicked in because in proxmox' point of view the VM is up and running.

I thought maybe the QEMU Guest Agent would help checking for hung VM's but when I checked the wiki page it only helps with the proper shutdown of a VM and creating consistent (VSS) backups.

Shouldn't/will it be possible to use the QEMU Guest Agent to check if the VM's OS is working properly and if not (No connection to QEMU Guest Agent) use the HA stack to stop (first try guest agent shutdown, then acpi shutdown, and lastly forced stop) and the start the VM again (with a maximum number of retries).

Or is there another way to check the status of the OS running inside the VM?

Caspar
 

LnxBil

Famous Member
Feb 21, 2015
5,621
650
133
Germany
In theory, this should work via a guest ping. I have never tried this but it should work perfectly to check if the guest agent is "ponging", this does not check if it crashed, only if the agent is running.

For all of my Linux boxes I use the QEMU watchdog to automatically reset the kernel on a crash, yet I do not know about the windows 64-bit integration of the watchdog timers, there was 32-bit support for old ms products, but not for never ones.
 

n1nj4888

Member
Jan 13, 2019
161
14
23
42
For all of my Linux boxes I use the QEMU watchdog to automatically reset the kernel on a crash, yet I do not know about the windows 64-bit integration of the watchdog timers, there was 32-bit support for old ms products, but not for never ones.

Hi @LnxBil, I came across this post since I've recently had a couple of guest Ubuntu VM kernel panics and thought the same that PVE HA would identify this (given qemu-guest-agent installed on the guest VM) and restart the VM but it appears from the above that this is not the intended purpose of qemu-guest-agent... As such, could you describe how you installd the "QEMU watchdog" you mention to identify when there is a linux guest VM freeze / kernel panic and restart the VM?

Also, I'd be keen to understand whether there is the option to automatically save a screenshot of the guest VM's console before rebooting it on a hang so that the kernel panic trace could be saved for future investigations? Perhaps the "QEMU Watchdog" allows the running of a "VM reboot script" on the host to also capture the VM console screenshot before rebooting the VM rather than just hard restarting it?

Many Thanks!
 

LnxBil

Famous Member
Feb 21, 2015
5,621
650
133
Germany
As such, could you describe how you installd the "QEMU watchdog" you mention to identify when there is a linux guest VM freeze / kernel panic and restart the VM?

It's simple:
- add the configuration manually and stop and start your VM: watchdog: model=i6300esb,action=reset
- install the watchdog package
- test it (this will crash your machine): echo c > /proc/sysrq-trigger
- wait until the machine is reset automatically

Also, I'd be keen to understand whether there is the option to automatically save a screenshot of the guest VM's console before rebooting it on a hang so that the kernel panic trace could be saved for future investigations? Perhaps the "QEMU Watchdog" allows the running of a "VM reboot script" on the host to also capture the VM console screenshot before rebooting the VM rather than just hard restarting it?

If you want to debug your running guest OS, I suggest you read about its crash dump capabilities:
https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html

I use this quite often to debug "real" machines, I never encountered crashes inside of VMs, but I also do not use Ubuntu :p
 

n1nj4888

Member
Jan 13, 2019
161
14
23
42
Thanks @LnxBil,

I had a quick look inside the VM after enabling the "watchdog: model=i6300esb,action=reset " in the VMID.conf file but noted that the i6300esb module is blacklisted by default in Ubuntu (it's actually blacklisted not under /etc/modprobe.d but under the /lib/modprobe.d/blacklist_<KERNEL VERSION>.conf files which makes me believe that, even if commenting the blacklisted module out of these files (or enabling it in /etc/modules), it would re-blacklist the module on each kernel update, unless there's an easier way of un-blacklisting the i6300esb module which persists kernel updates?

Also, on ubuntu at least, to perform a simulated test crash, you also need to run the first command here to enable the simulated crash before executing it:

Code:
echo 1 > /proc/sys/kernel/sysrq; echo c > /proc/sysrq-trigger

I'll take a look at the Kernel Crash Dump capabilities but, for the time being, I've simply just enabled
Code:
kernel.panic = 60
to
Code:
/etc/sysctl.conf
in the VM to auto-reboot the VM 60 seconds after a kernel panic occurs

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!