Simple way to monitor backup jobs using healthchecks.io

May 4, 2021
7
3
8
35
Hello Proxmox community.

I have a Proxmox server(s) (7.4-3) where I've setup 3 different backup jobs for 3:2:1 backup setup. I've set them to send email on failure only which I thought it would be just fine.

That server failed (completely frozen and only way to get out of it was to cold boot the machine) after reboot Proxmox would not boot up (stuck on cleaning up zfs or something like that). I've tried to live boot debian os and tried to retrieve at least config files of VMs nothing was there at all. So I said it will be fine I have backups.

That's where I nearly had a stroke. Backups were missing for almost over a year! I don't think that backups didn't happen I think they were failing. But the emails were never sent out I believe. (This is happening on some of my proxmox machines. some of them will send email some of them wont. It's always the same setup so i don't know) I can't confirm since new PVE OS was already installed. I was VERY VERY lucky I haven't touched the VMs for over a year from config point of view so I could use the config files from year ago to restore VM IDs etc. and my ZFS pool was fine so I've imported the ZFS pool back and re-scanned storage to import disks back. The day was saved no data lost at all I was again very lucky.

However that brought me into a serious concern of knowing if backup happened and if it was successful or not using as well third party solution and not just rely on emails. Which in my scenario proved to fail. I use healtchecks.io to monitor almost anything in my IT business.

With healthchecks.io you can signal start, end and also failure if you wish by simple wget or curl commands. Example below:

wget https://mydomain.com/ping/d353a7c6-d8c9d1f-ae963d23d566/start = this will signal start of healtchcheck monitor

wget https://mydomain.com/ping/d353a7c6-d8c9d1f-ae963d23d566 = this will signal end of healthcheck monitor

wget https://mydomain.com/ping/d353a7c6-d8c9d1f-ae963d23d566/5 = this will signal failure with exit code 5


My question is:
How can I implement this to my configured backup jobs (these were setup using webgui only)?
I've tried to do research and I understand there is script hook available for manually created backup jobs? Which if I understand right have to be then put into crontab in order for them to be executed? One possible solution would be to write shell script to execute vzdump, but that beats the purpose of nice simple webgui config option. Also if I am right I would have to manually run prune jobs? If this is the only way of doing can Proxmox Staff please consider adding support of healthchecks.io to the future release? There seems to be some checkmk company being able to monitor backups, but i read here on forum it doesn't monitor well backups.

https://healthchecks.io

Any help from staff or someone else would be greatly appreciated,
thank you.
Ladislav
 
Hi,
That's where I nearly had a stroke. Backups were missing for almost over a year! I don't think that backups didn't happen I think they were failing. But the emails were never sent out I believe. (This is happening on some of my proxmox machines. some of them will send email some of them wont. It's always the same setup so i don't know) I can't confirm since new PVE OS was already installed.
please check that the email address of your root@pam user is correct in Datacenter > (Permissions >) Users > root@pam > Edit. You can test if it works with e.g.
Code:
root@pve701 ~ # sendmail root                        
Subject: test
text
.
Make sure to press enter after sendmail root rather than paste everything at once.

I've tried to do research and I understand there is script hook available for manually created backup jobs? Which if I understand right have to be then put into crontab in order for them to be executed?
You just need to set the script for an existing job and the hook script will be invoked whenever the job is executed as scheduled. Use cat /etc/pve/jobs.cfg to get the ID of your job and then: pvesh set /cluster/backup/backup-<ID> --script /path/to/script. See /usr/share/doc/pve-manager/examples/vzdump-hook-script.pl for an example script. Make sure your script is executable (with chmod +x /path/to/script).
 
Hi Fiona,
thank you for coming back to me. I've tried to send email out nothing arrived. But it works on other Proxmox machines just fine. I am not overly worried about the email just now I would rather implement your hook script idea first. This is what i did however no joy.

# creating directory to store my scripts
mkdir /etc/lss

# I've copied example .pl file
cp /usr/share/doc/pve-manager/examples/vzdump-hook-script.pl /etc/lss/monitoring.pl

# I've made it exectuable
chmod +x /etc/lss/monitoring.pl

# I've edited the file and added line to this section see lines in orange to what I added:
nano /etc/lss/monitoring.pl

# example: wake up remote storage node and enable storage
if ($phase eq 'job-init') {
#system("wakeonlan AA:BB:CC:DD:EE:FF");
#sleep(30);
#system ("/sbin/pvesm set $storeid --disable 0") == 0 ||
# die "enabling storage $storeid failed";
system "wget https://mydomain/ping/myuuid/start";
}

# do what you want


and into this section I've added this line:

# example: copy resulting backup file to another host using scp
if ($phase eq 'backup-end') {
#system ("scp $target backup-host:/backup-dir") == 0 ||
# die "copy tar file to backup-host failed";
system "wget https://mydomain/ping/myuuid";
}

# After this edit I've added script to the config:
pvesh set /cluster/backup/backup-3604134d-4686 --script /etc/lss/monitoring.pl

# here is output from backupjobs config, you can see that the script was added

vzdump: backup-3604134d-4686
schedule 21:00
compress zstd
enabled 1
mailnotification failure
mailto ladia@lssolutions.ie
mode snapshot
notes-template {{guestname}}
repeat-missed 0
script /etc/lss/monitoring.pl
storage Local-HDD
vmid 10

I go to my proxmox webgui and I start the backup job manually to see if it worked however nothings get signalled in my healthchecks.io dashboard at all.

I would greatly appreciate your help with this one.

EDIT: i've tried to run in proxmox shell wget https://mydomain/ping/myuuid to confirm that proxmox can reach my healthchecks dashboard and I can confirm that yes it does reach my dashboard.

Thank you
Ladislav
 
Last edited:
Hi Fiona,

I've managed to get it working. I didn't close the commands in () it should have been like this:

system ("wget https://mydomain/ping/myuuid/start");

I have now proper way to monitor backups. Thank you for your help.

PS: In regards to emails not being sent the issue is with Microsoft servers they are blocking my IP address. I will contact Microsoft instead.
 
Last edited:
  • Like
Reactions: fiona
Hi

I followed along from LS Solutions and got a similar ping working. Was wondering if I could take it further and obtain the healthcheck.io uuid from the container/vm config so that each item could be tracked. At the moment, my script fires off about 4 or 5 times depending on the numnber of machines being backed up. So if machine 2 failed, the overall proxmox healthcheck would not show that state because 3 would come along and set the ping to success.

Any ideas how I might be able to pass to the pl_monitoring script a uuid. I could go in a command of the vm/lxc definition if needed.
 
Hello Ladislav,
Note that /sbin/pvesm set $storeid --disable 0 always retruns with exit code 0, even if the backup server is down as you are free to enable/disable a datastore even if it is not reachable. (at least on: pve-manager/8.0.4)

I use the following to determine if the backup server is up:
status=$(/usr/sbin/pvesm status 2> /dev/null |grep ${PVE_BUDS} | awk '{print $3}')
where PVE_BUDS is the 'Name' of the datastore from the PVE point of view. To list the pbs the output of pvesm status | egrep '(Status|pbs)' on the pve.
 
I've successfully used this script (https://gist.github.com/djarbz/28b24e6fc792bab47be5fe42486afa25) with a few adjustments on my setup.

There's my version of the script adapted for public healthchecks.io, PVE 8 cluster setup: https://gist.github.com/waza-ari/8fb8375ec5770a50486abeb2a7bb9c52

To use the script:
  • Place the script on each node as
    Code:
    /usr/local/bin/vzdump-hook-script.sh
  • Adjust the four variables in lines 33-36. 33 and 34 can stay as they are for public healthchecks.io, the values for lines 35 and 36 you can generate in your project settings
  • Modify the job config located at
    Code:
    /etc/pve/jobs.cfg
    and add one more line to call the script:
Code:
vzdump: backup
        # schedule ...
        # ...
        script /usr/local/bin/vzdump-hook-script.sh
 
  • Like
Reactions: jvandenbroek
I've successfully used this script (https://gist.github.com/djarbz/28b24e6fc792bab47be5fe42486afa25) with a few adjustments on my setup.

There's my version of the script adapted for public healthchecks.io, PVE 8 cluster setup: https://gist.github.com/waza-ari/8fb8375ec5770a50486abeb2a7bb9c52

To use the script:
  • Place the script on each node as
    Code:
    /usr/local/bin/vzdump-hook-script.sh
  • Adjust the four variables in lines 33-36. 33 and 34 can stay as they are for public healthchecks.io, the values for lines 35 and 36 you can generate in your project settings
  • Modify the job config located at
    Code:
    /etc/pve/jobs.cfg
    and add one more line to call the script:
Code:
vzdump: backup
        # schedule ...
        # ...
        script /usr/local/bin/vzdump-hook-script.sh
Hi

If I read your code correctly, are you creating a new check for each machine?

I can't see what information, could you post a log maybe from it running? I really like this approach.
 
Hi

If I read your code correctly, are you creating a new check for each machine?

I can't see what information, could you post a log maybe from it running? I really like this approach.
Hi,

yes that's correct, its creating one check per VM, and in addition one check per physical node. I'm not sure why the per-node check is be needed myself though. Also, I haven't written the script myself, I just spent quite some time to make it work in my cluster environment and on public HC.io endpoints, so I figured I'd save others from the same headache. Credits go to the original script author though!

This is how it would like on hc.io (two nodes and two VMs only, domain name and UUIDs removed). Name syntax for node names would be nodename.clustername.domain, VMs would be id.qemu.nodename.clustername.domain

What logs would you be interested in?

Screenshot 2024-01-08 at 10.07.11.png
 
No need now. I managed to implement it and see it working. The only thing I had to work out was the ping side. By default I’d assumed the domain for ping was all that was needed, but for my instance I had to add /ping.

I also had to ensure I had v3 of health checks. For the v3 api.
 
No need now. I managed to implement it and see it working. The only thing I had to work out was the ping side. By default I’d assumed the domain for ping was all that was needed, but for my instance I had to add /ping.

I also had to ensure I had v3 of health checks. For the v3 api.
Are you using self hosted health checks instance? Then the initial version could fit better, I adjusted it for public health check endpoint
 
Are you using self hosted health checks instance? Then the initial version could fit better, I adjusted it for public health check endpoint
yes mine is self hosted. Ok I'll investigate. But I do have it running now with only one tweak needed by me
 
  • Like
Reactions: Wazaari
Are you using self hosted health checks instance? Then the initial version could fit better, I adjusted it for public health check endpoint
me again. Any ideas how to obtain the timeout/schedule of teh VM backup. This seems to default to creating a HC slug at 1 day. And even if it's modified manually it seems to be reset.