Hi !
I have been working on an issue of error recovery after sudden shutdown of the proxmox server, where VMs show HA on error state .
when the proxmox server suffers a sudden shutdown we are seeing those VMs that are using HA and a group that restrict those using local storage to the server to be set on error recovery state after the server state is up.
Is there feature that can help to recover those VMs on error state or just to ensure that those are always up ?
I understand this is usually done by HA manager but because we have VMs restricted to use local storage we were wondering if there is some else that helps to keeps a VM always up.
If not such feature exist is there a process to request it ?
Here under is a snapshot of the script we are planning to use on our environment to helps to recover those VMs after a sudden shutdown. I would appreciate any feedback form your side.
Note: Depending on your storage pools names the variable "remote_storage" should be changed to ignore those pools.
Thanks you!
I have been working on an issue of error recovery after sudden shutdown of the proxmox server, where VMs show HA on error state .
Bash:
# ha-manager status | grep pve-01
lrm pve-01 (active, Tue Apr 13 17:13:31 2021)
service vm:100 (pve-01, error)
service vm:101 (pve-01, error)
service vm:102 (pve-01, started)
when the proxmox server suffers a sudden shutdown we are seeing those VMs that are using HA and a group that restrict those using local storage to the server to be set on error recovery state after the server state is up.
Is there feature that can help to recover those VMs on error state or just to ensure that those are always up ?
I understand this is usually done by HA manager but because we have VMs restricted to use local storage we were wondering if there is some else that helps to keeps a VM always up.
If not such feature exist is there a process to request it ?
Here under is a snapshot of the script we are planning to use on our environment to helps to recover those VMs after a sudden shutdown. I would appreciate any feedback form your side.
Bash:
#!/usr/bin/env bash
# To script should be added to crontab as example "@reboot sleep 120 /root/scripts/vms-ha-recovery.sh"
# Variables
pve_host=$(hostname -s)
pve_uptime=$(uptime -p)
remote_storage="^(acc|test|prod|dev)|management"
tmpfile_ha_vms="/tmp/pve-ha-error-vm-disk.tmp"
ha_manager_cmd="/usr/sbin/ha-manager"
ha_manager_status=$($ha_manager_cmd status | grep "${pve_host}\|quorum")
ha_manager_quorum=$(echo "$ha_manager_status" | grep "quorum" | awk '{print $2}')
ha_manager_error=$(echo "$ha_manager_status" | grep started | grep vm: | awk '{print $2}' | cut -d ":" -f 2)
ha_manager_error_count=$(echo "$ha_manager_error" | grep -v "^$" | wc -l)
ha_manager_log="/var/log/pve-ha-vm-recovery.log"
# Function that prints all vms are OK and exit if condition matches
function start_process {
if [ $ha_manager_error_count -le 0 ]; then
echo "Proxmox $pve_host has been $pve_uptime and HA-manager VMs state is OK" | tee -a $ha_manager_log
exit 0
fi
}
# Function (OPTIONAL) quorum check, if ok continue
function quorum_check {
if [ $ha_manager_quorum != "OK" ]; then
echo "Proxmox $pve_host shows quorum is $ha_manager_quorum , stopping process" | tee -a $ha_manager_log
exit 1
fi
}
# Function to list only vms using local storage
function storage_local {
touch ${tmpfile_ha_vms}
echo "Checking for HA vms using local storage" | tee -a $ha_manager_log
for vm in $ha_manager_error ; do
vmcfg=$(qm config ${vm})
localdisks=$(echo -e "${vmcfg}" | sed -nE "s/^scsi[0-9]+:\s+([^:]+).*/\1/p" | grep -vE "${remote_storage}" | wc -l)
if [ ${localdisks} -gt 0 ]; then
name=$(echo -e "${vmcfg}" | sed -nE "s/name:\s+//p")
echo "[+] ${vm} (${name}) uses ${localdisks} local disk(s) adding to state recovery list" | tee -a $ha_manager_log
echo "vm:${vm}" >> ${tmpfile_ha_vms}
else
name=$(echo -e "${vmcfg}" | sed -nE "s/name:\s+//p")
echo "[-] ${vm} (${name}) uses 0 local disk, not adding to state recovery list" | tee -a $ha_manager_log
fi
done
}
# Function that if there are vms on error state using local disk to be recovered
function confirm_recovery_list {
tmpfile_ha_vms_count=$(cat ${tmpfile_ha_vms} | wc -l)
if [ $tmpfile_ha_vms_count -le 0 ]; then
echo "There are not HA vms using local storage on list, stopping recovery " | tee -a $ha_manager_log
exit 0
fi
}
# Funtion to recover vms with ha error state
function recover_vms {
echo "Starting recovery of HA-manager VMs in error state" | tee -a $ha_manager_log
for vm in $(cat ${tmpfile_ha_vms}) ; do
echo "Recovering ${vm}" | tee -a $ha_manager_log
$ha_manager_cmd set ${vm} --state disabled
sleep 10
$ha_manager_cmd set ${vm} --state started
done
echo "Proxmox HA error state recovery completed at $(date +%T)!" | tee -a $ha_manager_log
}
# Print date and vms ha state
date | tee -a $ha_manager_log
echo "$ha_manager_status" | tee -a $ha_manager_log
# Execute function
start_process
quorum_check
storage_local
confirm_recovery_list
recover_vms
# Clean up tmp file
rm ${tmpfile_ha_vms}
Note: Depending on your storage pools names the variable "remote_storage" should be changed to ignore those pools.
Thanks you!
Last edited: