[SOLVED] Repeated HA Failure When Starting VM After Reboot

Jul 6, 2024
24
2
3
Hello

I'm facing a recurring issue with High Availability on my Proxmox cluster. If I stop a VM and later attempt to start it again, the VM fails to boot, and its HA status shows as "failed." This issue happens multiple times every week. HA is in State: started. Then it becomes failed as soon as I stop / start the VM. It does not make much sense.

Here’s the scenario:
1. I stop the VM without any issues.
2. When I try to start it again, it won’t boot because HA status shows as "failed."
3. To get the VM running again, I have to delete its HA configuration. Once HA is removed, the VM boots without any problems.

I’ve attached relevant logs below to help diagnose the problem. Any insights or suggestions on how to resolve this would be greatly appreciated.

Thank you!

Code:
Nov 18 21:47:39 localhost pvedaemon[3790683]: <root@pam> starting task UPID:localhost:00119DE9:11334410:673BA7EB:hastop:1162:root@pam:
Nov 18 21:47:39 localhost pvedaemon[3790683]: <root@pam> end task UPID:localhost:00119DE9:11334410:673BA7EB:hastop:1162:root@pam: OK
Nov 18 21:47:44 localhost pve-ha-lrm[1154629]: stopping service vm:1162 (timeout=0)
Nov 18 21:47:44 localhost pve-ha-lrm[1154629]: <root@pam> starting task UPID:localhost:00119E49:1133464A:673BA7F0:qmstop:1162:root@pam:
Nov 18 21:47:44 localhost pve-ha-lrm[1154633]: stop VM 1162: UPID:localhost:00119E49:1133464A:673BA7F0:qmstop:1162:root@pam:
Nov 18 21:47:49 localhost pve-ha-lrm[1154629]: Task 'UPID:localhost:00119E49:1133464A:673BA7F0:qmstop:1162:root@pam:' still active, waiting
Nov 18 21:47:49 localhost pve-ha-lrm[1154633]: VM 1162 qmp command failed - VM 1162 qmp command 'quit' failed - got timeout
Nov 18 21:47:49 localhost pve-ha-lrm[1154633]: VM quit/powerdown failed - terminating now with SIGTERM
Nov 18 21:47:54 localhost pve-ha-lrm[1154629]: Task 'UPID:localhost:00119E49:1133464A:673BA7F0:qmstop:1162:root@pam:' still active, waiting
Nov 18 21:47:56 localhost pvestatd[2695]: VM 1162 qmp command failed - VM 1162 qmp command 'query-proxmox-support' failed - got timeout
Nov 18 21:47:57 localhost pvestatd[2695]: status update time (9.566 seconds)
Nov 18 21:47:57 localhost pvestatd[2695]: restarting server after 29 cycles to reduce memory usage (free 156628 (15556) KB)
Nov 18 21:47:57 localhost pvestatd[2695]: server shutdown (restart)
Nov 18 21:47:58 localhost pvestatd[2695]: restarting server
Nov 18 21:47:59 localhost pve-ha-lrm[1154629]: Task 'UPID:localhost:00119E49:1133464A:673BA7F0:qmstop:1162:root@pam:' still active, waiting
Nov 18 21:47:59 localhost pve-ha-lrm[1154633]: VM still running - terminating now with SIGKILL
Nov 18 21:48:01 localhost pve-ha-lrm[1154633]: can't unmap rbd device /dev/rbd-pve/6f2f3b31-bcba-47a5-ac2b-e08a7aed1b63/block-storage-metadata/vm-1162-disk-0: rbd: sysfs write failed
Nov 18 21:48:01 localhost pve-ha-lrm[1154633]: volume deactivation failed: block-storage:vm-1162-disk-0 at /usr/share/perl5/PVE/Storage.pm line 1258.
Nov 18 21:48:01 localhost pve-ha-lrm[1154629]: <root@pam> end task UPID:localhost:00119E49:1133464A:673BA7F0:qmstop:1162:root@pam: OK
Nov 18 21:48:01 localhost pve-ha-lrm[1154629]: unable to stop stop service vm:1162 (still running)
Nov 18 21:48:04 localhost pve-ha-lrm[1155163]: service vm:1162 is in an error state and needs manual intervention. Look up 'ERROR RECOVERY' in the documentation.
Nov 18 21:48:16 localhost pvestatd[2695]: VM 1162 qmp command failed - VM 1162 qmp command 'query-proxmox-support' failed - unable to connect to VM 1162 qmp socket - timeout after 51 retries
 
Last edited:
Resolved guys

Made a monitoring script that disabled HA if it goes in error state that allows vm to boot normally.

Code:
#!/bin/bash

# Description:
# This script monitors system logs in real-time using journalctl.
# When it detects a failure message for the ha-manager command,
# it extracts the VM ID and attempts to restart the VM using ha-manager.

# Ensure the script is run as root
if [[ "$EUID" -ne 0 ]]; then
  echo "Please run as root."
  exit 1
fi

# Function to handle the detected failure
handle_failure() {
  local vm_id=$1
  echo "$(date): Detected failure for vm:$vm_id. Attempting to restart HA manager..."
 
  # Execute the ha-manager command to restart the VM
  ha-manager set vm:"$vm_id" --state disabled
 
  if [[ $? -eq 0 ]]; then
    echo "$(date): Successfully executed 'ha-manager set vm:$vm_id --state disabled'."
  else
    echo "$(date): Failed to execute 'ha-manager set vm:$vm_id --state disabled'."
  fi
}

# Monitor journalctl logs in real-time
journalctl -f | while read -r line; do
  # Check if the line contains the specific failure message
  if [[ "$line" =~ command\ \'ha-manager\ set\ vm:([0-9]+)\ --state\ disabled\'\ failed:\ exit\ code\ 255 ]]; then
    vm_id="${BASH_REMATCH[1]}"
    handle_failure "$vm_id"
  fi
done