vzdump-hook-script and clustering

Jun 17, 2023
6
0
1
Hi everybody, I've got a small pve cluster in my home lab, and I started toying around with Proxmox Backup Server. I want to do a setup where after the end of a backup job the PBS machine is shutdown and powered off, more or less as described here:

https://forum.proxmox.com/threads/a...er-backup-finishes-hook-script-option.105493/

This works fine (as far as I can tell) if I do it on one node (e.g. backing up several virtual machines on the first node pve1).
But when I backup several pve nodes at once, the slowest backup is not able to finish, seemingly because another machine
in the cluster decided to power down the PBS, because its own backup job was done.

Can that interpretation be right, or should I look for a problem elsewhere?

If this interpretation of my problem is correct: Is there a way to check in the hook script if other nodes in the cluster are still running their parts of the backup jobs? How does this acutally work? Is the backup job executed on each node as a seperate process, each calling their own local copy of the hook script?

A part of the log of the failing job:
Code:
INFO:  59% (18.9 GiB of 32.0 GiB) in 7m 30s, read: 76.0 MiB/s, write: 69.0 MiB/s
INFO:  59% (19.0 GiB of 32.0 GiB) in 7m 32s, read: 50.0 MiB/s, write: 36.0 MiB/s
ERROR: backup write data failed: command error: connection reset
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 107 failed - backup write data failed: command error: connection reset
INFO: Failed at 2024-01-30 14:37:07

Any hints on how to best do this shutting down the pbs after the job has finished on all cluster nodes? Is the calling of the hook scripts logged somewhere in a more verbose way?

Thanks in advance,
Ralph
 
Hi,
If this interpretation of my problem is correct: Is there a way to check in the hook script if other nodes in the cluster are still running their parts of the backup jobs? How does this acutally work?
What comes to mind quickly is using e.g. proxmox-backup-client task list --repository USER@IP:8007:datastore to see if there are still any active tasks on the server.
Is the backup job executed on each node as a seperate process, each calling their own local copy of the hook script?
Yes.
 
I am wondering if this worked for you @ausserirdischesindgesund ? If it did, could you share the part of the hook script that powers down the PBS?

Here, I am in the exact same situation. However, the proxmox-backup-client task list --repository USER@IP:8007:datastore returns Error: permission check failed. each time I try it. Even though it prompts me for the password and then the fingerprint, it keeps giving the same error. @fiona, would you have any suggestions for this? I have tried running it as different users (including root@pam with all permissions) and verified that the IP, port, and datastore are all correct. I tried running the command on PVE and PBS, both return the same error. I am not using the proxmox-backup-client, only the backup feature in PVE (unless proxmox-backup-client is what is used under the hood).
 
It worked, but I decided later, that I want to do it timebased, because I want to backup other clients anyway. I must look how far I did complete the script, but in principle the strategy suggested by fiona worked very well in my testing.
 
Okay I see. Something must be wrong with my setup to give that error.

For anyone running into this in the future, I solved the issue (of running the script with a poweroff command only once even with multiple nodes / backup jobs) with the following script. It's a hacky workaround of creating lock files for each node (just two in my case, pve01 and pve02), and only running the poweroff command once all lock files are gone again, to indicate that the backup has finished on all nodes.

Code:
#!/bin/bash

HOSTNAME1=pve01
HOSTNAME2=pve02

if [ "$1" == "job-init" ]; then
    ssh root@IP "set -C; 2>/dev/null >/tmp/pve-backupjob-$HOSTNAME.lock"
fi


if [ "$1" == "job-end" ]; then
    ssh root@IP /bin/bash << EOF

        # This trap executes when the script ends for whatever reason, also on fail
        trap "rm -f /tmp/pve-backupjob-$HOSTNAME.lock" EXIT

        # Now forcefully remove the lock file, because otherwise the check below doesn't work (needs to be deleted first, then do the check)
        2>/dev/null rm /tmp/pve-backupjob-$HOSTNAME.lock

        # Check if the lock file still exists for any of the nodes (i.e., is the backup job still running on any of the nodes)
        if [ -f "/tmp/pve-backupjob-$HOSTNAME1.lock" ] || [ -f "/tmp/pve-backupjob-$HOSTNAME2.lock" ]; then
            # Do nothing
        else
            shutdown +5 < /dev/null &
        fi
EOF
fi

exit 0

Suggestions for improvements are of course welcome.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!