Hey everyone, I wanted to share a solution I've been running for a few weeks that solved a persistent headache with Veeam worker VMs on Proxmox.
My setup is Veeam Backup & Replication 13 (build 13.01.1071) running against Proxmox VE 9.1.4 with daily scheduled backup jobs. Veeam previously deploys worker VMs on the Proxmox host to handle backup processing and shuts them down when each job finishes. The problem is they don't always come back up cleanly for the next job. Sometimes, a stale lock file gets left behind, or a start task gets stuck in pvedaemon, and the workers just sit offline until someone manually intervenes.
Note: I have deployed two Veeam proxmox workers as I like to have a little redundancy.
The error you'll see in the Proxmox task log is either:
Error: timeout waiting on systemd
or
can't lock file '/var/lock/qemu-server/lock-104.conf'
What's actually happening is Proxmox starts the QEMU process fine, but then waits for the guest agent inside the VM to respond and times out. The stuck task holds a lock, every subsequent start attempt fails for the same reason, and without manual intervention the workers stay offline indefinitely.
My fix was a simple bash watchdog script running as a cron job every hour. It checks if the worker VMs are running, clears any stale locks, kills stuck start tasks, and brings the VMs back up automatically. I've been running it for a couple of weeks now and haven't had to SSH in manually once. Sharing it here in case it saves another admin the same frustration.
The Solution
A watchdog script that runs every hour via cron. It checks whether each worker VM is running, handles stale locks, kills zombie qmstart tasks, and starts the VM if needed. It also uses a startup timeout so it can never hang indefinitely like the default qm start behavior.
Setup
Step 1: Create the script
Code:
nano /usr/local/bin/vm-watchdog.sh
Paste the following update the start_vm lines at the bottom to match your VM IDs:
Code:
#!/bin/bash
RETRY_FILE="/tmp/vm-watchdog-retries"
LOG_PREFIX="vm-watchdog"
QM_START_TIMEOUT=60 # seconds before we consider qm start hung
log() {
echo "$(date): $1"
logger -t "$LOG_PREFIX" "$1"
}
kill_stuck_qmstart() {
local vm_id=$1
local stuck_pids
stuck_pids=$(ps aux | grep "task UPID.*qmstart:${vm_id}:" | grep -v grep | awk '{print $2}')
if [ -n "$stuck_pids" ]; then
log "VM $vm_id - Found stuck qmstart task(s): $stuck_pids — killing"
kill -9 $stuck_pids 2>/dev/null
sleep 2
return 0
fi
return 1
}
is_qemu_running() {
local vm_id=$1
pgrep -f "kvm.*-id ${vm_id} " > /dev/null 2>&1
}
start_vm() {
local vm_id=$1
local lock_file="/var/lock/qemu-server/lock-${vm_id}.conf"
local retry_count=0
# --- Step 1: Check if QEMU is actually running at the process level ---
if is_qemu_running $vm_id; then
local qm_status
qm_status=$(/usr/sbin/qm status $vm_id 2>/dev/null)
if echo "$qm_status" | grep -q "running"; then
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
return 0
else
log "VM $vm_id - QEMU process exists but Proxmox shows '$qm_status' — checking for stuck tasks"
kill_stuck_qmstart $vm_id
sleep 5
if /usr/sbin/qm status $vm_id 2>/dev/null | grep -q "running"; then
log "VM $vm_id - Proxmox now shows running after cleanup"
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
return 0
fi
fi
fi
# --- Step 2: Check if VM is stopped per Proxmox ---
if ! /usr/sbin/qm status $vm_id 2>/dev/null | grep -q "stopped"; then
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
return 0
fi
# --- Step 3: Kill any stuck qmstart tasks before doing anything else ---
if kill_stuck_qmstart $vm_id; then
log "VM $vm_id - Killed stuck qmstart task(s), clearing state"
rm -f $lock_file
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
sleep 3
fi
# --- Step 4: Handle lock file ---
if [ -f "$lock_file" ]; then
if fuser "$lock_file" > /dev/null 2>&1; then
retry_count=$(grep "^vm${vm_id}=" $RETRY_FILE 2>/dev/null | cut -d'=' -f2)
retry_count=${retry_count:-0}
retry_count=$((retry_count + 1))
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
echo "vm${vm_id}=${retry_count}" >> $RETRY_FILE
log "VM $vm_id - Lock held by active process (attempt $retry_count of 2)"
if [ $retry_count -ge 2 ]; then
log "VM $vm_id - 2 failed attempts, forcing lock removal"
rm -f $lock_file
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
else
return 1
fi
else
log "VM $vm_id - Removing stale lock"
rm -f $lock_file
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
fi
fi
# --- Step 5: Start the VM with a timeout so we never hang ---
log "VM $vm_id - Starting"
local start_output
start_output=$(timeout $QM_START_TIMEOUT /usr/sbin/qm start $vm_id 2>&1)
local exit_code=$?
if [ $exit_code -eq 124 ]; then
log "VM $vm_id - qm start timed out after ${QM_START_TIMEOUT}s — will retry next cycle"
kill_stuck_qmstart $vm_id
return 1
elif echo "$start_output" | grep -qi "already running"; then
log "VM $vm_id - Already running (caught by qm start output)"
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
return 0
elif [ $exit_code -ne 0 ]; then
log "VM $vm_id - Start failed (exit $exit_code): $start_output"
return 1
else
log "VM $vm_id - Started successfully"
sed -i "/^vm${vm_id}=/d" $RETRY_FILE 2>/dev/null
return 0
fi
}
# Create retry file if it doesn't exist
touch $RETRY_FILE
# Use a lock so only one instance of this script runs at a time
exec 9>/tmp/vm-watchdog.lock
if ! flock -n 9; then
log "Another instance of vm-watchdog is already running — exiting"
exit 1
fi
# --- Update these VM IDs to match your Veeam worker VMs ---
start_vm 100
start_vm 101
Step 2: Set permissions
Code:
chmod +x /usr/local/bin/vm-watchdog.sh
Step 3: Verify syntax
Code:
bash -n /usr/local/bin/vm-watchdog.sh && echo "Syntax OK"
Step 4: Test run
Code:
bash /usr/local/bin/vm-watchdog.sh
Step 5: Set up the cron job
Add this line: In my case, I have it run every hour. You can run it longer or shorter, but for me, a run every hour gives enough time for the backup to finish.
Code:
# Hourly watchdog - checks and restarts VMs if stopped, cleans stale locks
0 * * * * /usr/local/bin/vm-watchdog.sh
Step 6: Monitor the logs
Code:
journalctl -t vm-watchdog --since "today"
What the script handles
- Stale lock files — detects and removes them automatically
- Zombie qmstart tasks — finds and kills stuck pvedaemon workers that hold locks
- QEMU/Proxmox state mismatch — detects when QEMU is running but Proxmox doesn't know about it
- Hung qm start — enforces a 60 second timeout so the script never blocks indefinitely
- Overlapping runs — uses flock to prevent multiple instances running simultaneously
- Logging — writes to both stdout and syslog for easy monitoring
Hope this saves someone else the headache.