guest-ping QMP command missing or unhandled causes VM to remain in "running (shutdown)" state

user7500

New Member
Jul 16, 2025
1
0
1
Summary:
We're encountering a persistent problem with Proxmox (tested on latest 8.4.x) where Debian 12 and Ubuntu 24.04 guests equipped with the QEMU guest agent do not respond to `guest-ping`. This leads to a situation where Proxmox attempts a clean shutdown, but the VM remains stuck in the `running (shutdown)` state indefinitely — even though the guest system has already shut down properly.

How to reproduce:
1. Install Debian 12 or Ubuntu 24.04 in a VM.
2. Enable the QEMU guest agent in VM options.
3. Install `qemu-guest-agent` in the guest OS.
4. Try to shut down the VM using:
`qm shutdown <vmid>`
5. The VM never transitions to `status = stopped` — it stays in `running (shutdown)`.

Diagnostic Details:
- The agent is correctly installed and appears to work:
`qm agent <vmid> get-osinfo` returns valid data.
- However, `qm agent <vmid> ping` returns nothing at all.
- Direct socket testing with:
`echo '{"execute":"guest-ping"}' | socat - UNIX-CONNECT:/var/run/qemu-server/<vmid>.qga`
also results in an empty response.

- Proxmox logs show:
`VM <vmid> qmp command 'guest-ping' failed - unable to connect to VM <vmid> qga socket - timeout after 51 retries`

Tested with:
- Debian 12.11 netinst ISO with default and backports kernel
- Ubuntu 24.04 LTS
- Fully up-to-date Proxmox 8.4 with `pve-qemu-kvm 9.2.0-6`

What we suspect:
It seems that `guest-ping` is either not implemented in the version of `qemu-guest-agent` shipped with Debian/Ubuntu, or the agent does not behave as expected by Proxmox (i.e., returning `{}` or at least something that makes the shutdown logic happy).

We explored patching Proxmox here:
- `/usr/share/perl5/PVE/QemuServer.pm`
- `/usr/share/perl5/PVE/QemuServer/Agent.pm`

Specifically:
- `sub qga_check_running`
- `sub agent_available`

But we could not make it reliably detect the broken/missing ping and fall back to SIGTERM or other means.

Workaround:
Using `qm shutdown --timeout 30 --forceStop 1` eventually falls back to SIGTERM and stops the VM cleanly, still other times didnt' work so the only option was find the PIDs of kvm instances and `kill -9` them.

Suggestion:
We propose to:
1. Treat missing `guest-ping` as "not implemented" (instead of a failure).
2. Fallback automatically if agent returns an empty response, not just an error hash.
3. Optionally improve log messages to distinguish between guest agent not running vs. guest agent not implementing `guest-ping`.

Question:
Can the Proxmox team consider adding support for this edge case in upcoming updates?
Or is there a preferred method for handling this in a Proxmox-compliant way?

Related: https://unix.stackexchange.com/ques...-guest-agent-responds-to-commands-but-qm-shut

Bad Solution for now: https://www.reddit.com/r/Proxmox/co...en_neither_the_gui_nor_qm_stop/?show=original

Other Bad Solution I'm currently using:

Code:
nano /root/check_stuck_vm.sh
Code:
#!/bin/bash

DATE=$(date "+%Y-%m-%d %H:%M:%S")
echo "[$DATE] Starting stuck VM check..."

# Get list of VMs in running state
running_vms=$(qm list | awk '/running/ {print $1}')

for vmid in $running_vms; do
    # Check if VM is in 'shutdown' phase
    qmpstatus=$(qm status "$vmid" --verbose | grep '^qmpstatus:' | awk '{print $2}')

    if [ "$qmpstatus" = "shutdown" ]; then
        echo "[$DATE] VM $vmid is in shutdown phase (running/shutdown detected)..."

        # Wait 30 seconds
        sleep 30

        # Re-check qmpstatus
        newstatus=$(qm status "$vmid" --verbose | grep '^qmpstatus:' | awk '{print $2}')
        if [ "$newstatus" = "shutdown" ]; then
            echo "[$DATE] VM $vmid is still in shutdown phase — forcing with kill -9..."
            pid=$(ps aux | grep -E "[k]vm.*-id $vmid" | awk '{print $2}')
            if [ -n "$pid" ]; then
                kill -9 "$pid"
                echo "[$DATE] VM $vmid process $pid killed."
            else
                echo "[$DATE] Unable to find PID for VM $vmid."
            fi
        else
            echo "[$DATE] VM $vmid shutdown completed normally during wait."
        fi
    fi
done

Code:
chmod +x /root/check_stuck_vm.sh

Code:
crontab -e

Code:
* * * * * /root/check_stuck_vm.sh

Thanks!
 
Last edited:
I can confirm I have sort of the same issue, however in my case it has to do with a VM failing to migrate between cluster nodes because the guest VM does not respond to 'guest-ping' . Our Guest-VM is running Windows 10.