Possible to detect guest hangs in HA using Qemu Agent?

casparsmit · Apr 14, 2017

Hi,

In a 3 node CEPH/Proxmox 4 HA Cluster, I recently had a Windows 7 Guest VM hang (BSOD).
As expected HA never kicked in because in proxmox' point of view the VM is up and running.

I thought maybe the QEMU Guest Agent would help checking for hung VM's but when I checked the wiki page it only helps with the proper shutdown of a VM and creating consistent (VSS) backups.

Shouldn't/will it be possible to use the QEMU Guest Agent to check if the VM's OS is working properly and if not (No connection to QEMU Guest Agent) use the HA stack to stop (first try guest agent shutdown, then acpi shutdown, and lastly forced stop) and the start the VM again (with a maximum number of retries).

Or is there another way to check the status of the OS running inside the VM?

Caspar

LnxBil · Apr 14, 2017

In theory, this should work via a guest ping. I have never tried this but it should work perfectly to check if the guest agent is "ponging", this does not check if it crashed, only if the agent is running.

For all of my Linux boxes I use the QEMU watchdog to automatically reset the kernel on a crash, yet I do not know about the windows 64-bit integration of the watchdog timers, there was 32-bit support for old ms products, but not for never ones.

n1nj4888 · Oct 7, 2019

LnxBil said:
For all of my Linux boxes I use the QEMU watchdog to automatically reset the kernel on a crash, yet I do not know about the windows 64-bit integration of the watchdog timers, there was 32-bit support for old ms products, but not for never ones.

Hi @LnxBil, I came across this post since I've recently had a couple of guest Ubuntu VM kernel panics and thought the same that PVE HA would identify this (given qemu-guest-agent installed on the guest VM) and restart the VM but it appears from the above that this is not the intended purpose of qemu-guest-agent... As such, could you describe how you installd the "QEMU watchdog" you mention to identify when there is a linux guest VM freeze / kernel panic and restart the VM?

Also, I'd be keen to understand whether there is the option to automatically save a screenshot of the guest VM's console before rebooting it on a hang so that the kernel panic trace could be saved for future investigations? Perhaps the "QEMU Watchdog" allows the running of a "VM reboot script" on the host to also capture the VM console screenshot before rebooting the VM rather than just hard restarting it?

Many Thanks!

LnxBil · Oct 16, 2019

n1nj4888 said:
As such, could you describe how you installd the "QEMU watchdog" you mention to identify when there is a linux guest VM freeze / kernel panic and restart the VM?

It's simple:
- add the configuration manually and stop and start your VM: watchdog: model=i6300esb,action=reset
- install the watchdog package
- test it (this will crash your machine): echo c > /proc/sysrq-trigger
- wait until the machine is reset automatically

n1nj4888 said:
Also, I'd be keen to understand whether there is the option to automatically save a screenshot of the guest VM's console before rebooting it on a hang so that the kernel panic trace could be saved for future investigations? Perhaps the "QEMU Watchdog" allows the running of a "VM reboot script" on the host to also capture the VM console screenshot before rebooting the VM rather than just hard restarting it?

If you want to debug your running guest OS, I suggest you read about its crash dump capabilities:
https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html

I use this quite often to debug "real" machines, I never encountered crashes inside of VMs, but I also do not use Ubuntu

n1nj4888 · Oct 17, 2019

Thanks @LnxBil,

I had a quick look inside the VM after enabling the "watchdog: model=i6300esb,action=reset " in the VMID.conf file but noted that the i6300esb module is blacklisted by default in Ubuntu (it's actually blacklisted not under /etc/modprobe.d but under the /lib/modprobe.d/blacklist_<KERNEL VERSION>.conf files which makes me believe that, even if commenting the blacklisted module out of these files (or enabling it in /etc/modules), it would re-blacklist the module on each kernel update, unless there's an easier way of un-blacklisting the i6300esb module which persists kernel updates?

Also, on ubuntu at least, to perform a simulated test crash, you also need to run the first command here to enable the simulated crash before executing it:

Code:

echo 1 > /proc/sys/kernel/sysrq; echo c > /proc/sysrq-trigger

I'll take a look at the Kernel Crash Dump capabilities but, for the time being, I've simply just enabled

Code:

kernel.panic = 60

to

Code:

/etc/sysctl.conf

in the VM to auto-reboot the VM 60 seconds after a kernel panic occurs

Thanks

Search

Search

Possible to detect guest hangs in HA using Qemu Agent?

casparsmit

Renowned Member

LnxBil

Distinguished Member

n1nj4888

Well-Known Member

LnxBil

Distinguished Member

n1nj4888

Well-Known Member

We value your privacy