VM QMP Timeout Issues After Rejoining Node to Proxmox Cluster

Kuldeep.Roy

New Member
Feb 3, 2025
4
0
1
Hello everyone,

I'm experiencing issues with my Proxmox cluster after one of my nodes (Node 3) went down physically. After bringing the node back and rejoining it to the cluster, all VMs on that node are behaving erratically.

System Details:

  • Proxmox Version: 8.1.4
  • Cluster Setup: 3 nodes
  • Storage: ZFS (local), Ceph (shared)
  • VM Configuration: Mixed (some local disk, some on Ceph)

Issue Description:

After Node 3 was powered back on and rejoined, I started seeing errors related to QMP timeouts when trying to interact with the VMs. For example, when trying to start, stop, or get the console or status of VMs, any action, I get the following error:

VM 180 qmp command 'query-version' failed - unable to connect to VM 180 qmp socket - timeout after 51 retries (500)

Some VMs fail to start, while others become unresponsive. Restarting the Proxmox services (pve-cluster, pveproxy, pvedaemon, qemu-server) doesn't seem to resolve the issue.

Troubleshooting Steps Taken:

  1. Checked Node & Cluster Status:
    • pvecm status shows the cluster is online, and quorum is present.
    • pveceph status confirms that Ceph is healthy.
    • systemctl status corosync shows it is running fine.
  2. Checked VM Processes & QMP Sockets:
    • Found that affected VMs do not have an active QMP socket under /var/run/qemu-server/.
    • Manually running qm list sometimes hangs or takes a long time.
  3. Restarted Services & Nodes:
    • Restarting pve-cluster and pvedaemon had no effect.
    • Rebooted the affected node, but the problem persists.
  4. Checked Disk & Storage Issues:
    • zpool status (for ZFS) and ceph -s (for Ceph) report no critical errors.
    • VMs on local ZFS storage fail to start with the same QMP error.
  5. Manually Tried Starting VM with Debugging:
    • Running qm start 180 -debug results in the same QMP timeout.
    • Manually checking ps aux | grep qemu shows no running process for the VM.

Questions & Assistance Needed:

  • How can I properly recover the QMP socket and get the VMs working again?
  • Could this be related to a cluster sync issue after Node 3 was rejoined?
  • Are there specific logs or recovery steps I should follow to debug this further?
Any insights or suggestions would be greatly appreciated. Thanks in advance!