HowTo monitoring replication?

udo

Distinguished Member
Apr 22, 2009
5,975
196
163
Ahrensburg; Germany
Hi,
I use replication between two nodes and want to monitor (with icinga) the replication.

I see sometimes on some VMs an error (import failed: exit code 29),
But if I look with "pvesh get cluster/replication" I see no error/hint.

Howto monitor all replications (in my case 10)?

Log:
Code:
2018-03-06 19:30:06 102-0: start replication job
2018-03-06 19:30:06 102-0: guest => VM 102, running => 26513
2018-03-06 19:30:06 102-0: volumes => pve01pool:vm-102-disk-1
2018-03-06 19:30:06 102-0: create snapshot '__replicate_102-0_1520361006__' on pve01pool:vm-102-disk-1
2018-03-06 19:30:06 102-0: incremental sync 'pve01pool:vm-102-disk-1' (__replicate_102-0_1520360100__ => __replicate_102-0_1520361006__)
2018-03-06 19:30:08 102-0: delete previous replication snapshot '__replicate_102-0_1520361006__' on pve01pool:vm-102-disk-1
2018-03-06 19:30:08 102-0: end replication job with error: import failed: exit code 29
And is it normal, that some jobs faild with such error?

Udo
 
Hi,

you get only the config of the jobs with this path.
The status you get with this part
Code:
pvesh get /nodes/<nodename>/replication/<JOBID>/status
 
Hi,

you get only the config of the jobs with this path.
The status you get with this part
Code:
pvesh get /nodes/<nodename>/replication/<JOBID>/status
Hi Wolfgang,
thanks, that will help for monitoring.

BTW, do you know what I can do against exit code 29?
Code:
   {
      "duration" : 1.772466,
      "error" : "import failed: exit code 29",
      "fail_count" : 1,
      "guest" : "105",
      "id" : "105-0",
      "jobnum" : "0",
      "last_sync" : 1520521143,
      "last_try" : 1520521209,
      "next_sync" : 1520521509,
      "schedule" : "*/18",
      "target" : "pve02-xxxx",
      "type" : "local",
      "vmtype" : "qemu"
   }
Udo
 
Hi Udo,

You can use pve-zsync who is more reliable compace withe replication. The main advantages:
- is rock solid
- it will send a mail if the task is not successful
 
- it will send a mail if the task is not successful
I get email notification from failed pvesr job, ie:
Code:
Replication Job: 100-0 failed

 command 'zfs snapshot rpool/data/vm-100-disk-1@__replicate_100-0_1515895201__' failed: got timeout
 
BTW, do you know what I can do against exit code 29?
I'm not sure and will inspect next week on that, but I think it is a slow/overloaded not responding target zpool.
 
Here's a simple script I wrote for our Nagios that monitor errors with the replications

Code:
#!/bin/bash
# Script to check Proxmox storage replication
# ExitCode:
# 0 = Ok
# 1 = Warning
# 2 = Critical
# 4 = Ok (No replicatons configured)

RESULTS=($(/usr/bin/pvesr status | awk 'NR>1 {print $7}'))
EXITCODE=0

for i in "${RESULTS[@]}"
do
    if [ $i -gt 0 ] && [ $i -le 5 ]
    then
        EXITCODE=1
        break
    elif [ $i -gt 5 ]
    then
        EXITCODE=2
        break
    fi
done

if [ -z $RESULTS ]
then
    EXITCODE=4
fi

if [ $EXITCODE -eq 2 ]
then
    echo "CRITICAL: Some replication jobs failed !"
    exit 2
elif [ $EXITCODE -eq 1 ]
then
    echo "WARNING: There is some errors with some replication jobs"
    exit 1
elif [ $EXITCODE -eq 4 ]
then
    echo "OK: No replication jobs configured"
    exit 0
elif [ $EXITCODE -eq 0 ]
then
    echo "OK: All replication jobs working as intented"
    exit 0
fi
 
  • Like
Reactions: mnih and ales

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!