PVE - Cluster - HA error recovery state

pena · Apr 22, 2021

Hi !

I have been working on an issue of error recovery after sudden shutdown of the proxmox server, where VMs show HA on error state .

Bash:

# ha-manager status | grep pve-01
lrm pve-01 (active, Tue Apr 13 17:13:31 2021)
service vm:100 (pve-01, error)
service vm:101 (pve-01, error)
service vm:102 (pve-01, started)

when the proxmox server suffers a sudden shutdown we are seeing those VMs that are using HA and a group that restrict those using local storage to the server to be set on error recovery state after the server state is up.

Is there feature that can help to recover those VMs on error state or just to ensure that those are always up ?
I understand this is usually done by HA manager but because we have VMs restricted to use local storage we were wondering if there is some else that helps to keeps a VM always up.

If not such feature exist is there a process to request it ?

Here under is a snapshot of the script we are planning to use on our environment to helps to recover those VMs after a sudden shutdown. I would appreciate any feedback form your side.

Bash:

#!/usr/bin/env bash

# To script should be added to crontab as example "@reboot sleep 120 /root/scripts/vms-ha-recovery.sh"

# Variables

pve_host=$(hostname -s)
pve_uptime=$(uptime -p)
remote_storage="^(acc|test|prod|dev)|management"
tmpfile_ha_vms="/tmp/pve-ha-error-vm-disk.tmp"
ha_manager_cmd="/usr/sbin/ha-manager"
ha_manager_status=$($ha_manager_cmd status | grep "${pve_host}\|quorum")
ha_manager_quorum=$(echo "$ha_manager_status" | grep "quorum" | awk '{print $2}')
ha_manager_error=$(echo "$ha_manager_status" | grep started | grep vm: | awk '{print $2}' | cut -d ":" -f 2)
ha_manager_error_count=$(echo "$ha_manager_error" | grep -v "^$" | wc -l)
ha_manager_log="/var/log/pve-ha-vm-recovery.log"

# Function that prints all vms are OK and exit if condition matches

function start_process {
    if [ $ha_manager_error_count -le 0 ]; then
            echo "Proxmox $pve_host has been $pve_uptime and HA-manager VMs state is OK" | tee -a $ha_manager_log
            exit 0
        fi
}

# Function (OPTIONAL) quorum check, if ok continue

function quorum_check {
        if [ $ha_manager_quorum != "OK" ]; then
            echo "Proxmox $pve_host shows quorum is $ha_manager_quorum , stopping process" | tee -a $ha_manager_log
            exit 1
        fi
}

# Function to list only vms using local storage

function storage_local {
     touch ${tmpfile_ha_vms}
     echo "Checking for HA vms using local storage" | tee -a $ha_manager_log
         for vm in $ha_manager_error ; do
                 vmcfg=$(qm config ${vm})
         localdisks=$(echo -e "${vmcfg}" | sed -nE "s/^scsi[0-9]+:\s+([^:]+).*/\1/p" | grep -vE "${remote_storage}" | wc -l)
                 if [ ${localdisks} -gt 0 ]; then
                 name=$(echo -e "${vmcfg}" | sed -nE "s/name:\s+//p")
                 echo "[+] ${vm} (${name}) uses ${localdisks} local disk(s) adding to state recovery list" | tee -a $ha_manager_log
                     echo "vm:${vm}" >> ${tmpfile_ha_vms}
         else
                 name=$(echo -e "${vmcfg}" | sed -nE "s/name:\s+//p")
             echo "[-] ${vm} (${name}) uses 0 local disk, not adding to state recovery list" | tee -a $ha_manager_log
                 fi
     done
}

# Function that if there are vms on error state using local disk to be recovered

function confirm_recovery_list {
    tmpfile_ha_vms_count=$(cat ${tmpfile_ha_vms} | wc -l)
        if [ $tmpfile_ha_vms_count -le 0 ]; then
            echo "There are not HA vms using local storage on list, stopping recovery " | tee -a $ha_manager_log
            exit 0
        fi
}

# Funtion to recover vms with ha error state

function recover_vms {
    echo "Starting recovery of HA-manager VMs in error state" | tee -a $ha_manager_log
    for vm in $(cat ${tmpfile_ha_vms}) ; do
        echo "Recovering ${vm}" | tee -a $ha_manager_log
        $ha_manager_cmd set ${vm} --state disabled
        sleep 10
        $ha_manager_cmd set ${vm} --state started
    done
    echo "Proxmox HA error state recovery completed at $(date +%T)!" | tee -a $ha_manager_log
}

# Print date and vms ha state

date | tee -a $ha_manager_log
echo "$ha_manager_status" | tee -a $ha_manager_log

# Execute function

start_process
quorum_check
storage_local
confirm_recovery_list
recover_vms

# Clean up tmp file
rm ${tmpfile_ha_vms}

Note: Depending on your storage pools names the variable "remote_storage" should be changed to ignore those pools.

Thanks you!

aaron · Apr 27, 2021

pena said:
when the proxmox server suffers a sudden shutdown we are seeing those VMs that are using HA and a group that restrict those using local storage to the server to be set on error recovery state after the server state is up.

Is there feature that can help to recover those VMs on error state or just to ensure that those are always up ?
I understand this is usually done by HA manager but because we have VMs restricted to use local storage we were wondering if there is some else that helps to keeps a VM always up.

Those VMs can only stay at one node in the cluster? Then why not remove them from the HA stack and enabling the "Start on boot" option of the VMs? This way they will be started if the node boots.

Without them being configured as HA, they will not automatically be migrated to another node as well.

pena · Jun 30, 2021

Thank you ! I have created a bug https://bugzilla.proxmox.com/show_bug.cgi?id=3415

Search

Search

PVE - Cluster - HA error recovery state

pena

New Member

aaron

Proxmox Staff Member

pena

New Member