HowTo monitoring replication?

udo

Distinguished Member
Apr 22, 2009
5,981
203
163
Ahrensburg; Germany
Hi,
I use replication between two nodes and want to monitor (with icinga) the replication.

I see sometimes on some VMs an error (import failed: exit code 29),
But if I look with "pvesh get cluster/replication" I see no error/hint.

Howto monitor all replications (in my case 10)?

Log:
Code:
2018-03-06 19:30:06 102-0: start replication job
2018-03-06 19:30:06 102-0: guest => VM 102, running => 26513
2018-03-06 19:30:06 102-0: volumes => pve01pool:vm-102-disk-1
2018-03-06 19:30:06 102-0: create snapshot '__replicate_102-0_1520361006__' on pve01pool:vm-102-disk-1
2018-03-06 19:30:06 102-0: incremental sync 'pve01pool:vm-102-disk-1' (__replicate_102-0_1520360100__ => __replicate_102-0_1520361006__)
2018-03-06 19:30:08 102-0: delete previous replication snapshot '__replicate_102-0_1520361006__' on pve01pool:vm-102-disk-1
2018-03-06 19:30:08 102-0: end replication job with error: import failed: exit code 29
And is it normal, that some jobs faild with such error?

Udo
 
Hi,

you get only the config of the jobs with this path.
The status you get with this part
Code:
pvesh get /nodes/<nodename>/replication/<JOBID>/status
 
Hi,

you get only the config of the jobs with this path.
The status you get with this part
Code:
pvesh get /nodes/<nodename>/replication/<JOBID>/status
Hi Wolfgang,
thanks, that will help for monitoring.

BTW, do you know what I can do against exit code 29?
Code:
   {
      "duration" : 1.772466,
      "error" : "import failed: exit code 29",
      "fail_count" : 1,
      "guest" : "105",
      "id" : "105-0",
      "jobnum" : "0",
      "last_sync" : 1520521143,
      "last_try" : 1520521209,
      "next_sync" : 1520521509,
      "schedule" : "*/18",
      "target" : "pve02-xxxx",
      "type" : "local",
      "vmtype" : "qemu"
   }
Udo
 
Hi Udo,

You can use pve-zsync who is more reliable compace withe replication. The main advantages:
- is rock solid
- it will send a mail if the task is not successful
 
- it will send a mail if the task is not successful
I get email notification from failed pvesr job, ie:
Code:
Replication Job: 100-0 failed

 command 'zfs snapshot rpool/data/vm-100-disk-1@__replicate_100-0_1515895201__' failed: got timeout
 
BTW, do you know what I can do against exit code 29?
I'm not sure and will inspect next week on that, but I think it is a slow/overloaded not responding target zpool.
 
Here's a simple script I wrote for our Nagios that monitor errors with the replications

Code:
#!/bin/bash
# Script to check Proxmox storage replication
# ExitCode:
# 0 = Ok
# 1 = Warning
# 2 = Critical
# 4 = Ok (No replicatons configured)

RESULTS=($(/usr/bin/pvesr status | awk 'NR>1 {print $7}'))
EXITCODE=0

for i in "${RESULTS[@]}"
do
    if [ $i -gt 0 ] && [ $i -le 5 ]
    then
        EXITCODE=1
        break
    elif [ $i -gt 5 ]
    then
        EXITCODE=2
        break
    fi
done

if [ -z $RESULTS ]
then
    EXITCODE=4
fi

if [ $EXITCODE -eq 2 ]
then
    echo "CRITICAL: Some replication jobs failed !"
    exit 2
elif [ $EXITCODE -eq 1 ]
then
    echo "WARNING: There is some errors with some replication jobs"
    exit 1
elif [ $EXITCODE -eq 4 ]
then
    echo "OK: No replication jobs configured"
    exit 0
elif [ $EXITCODE -eq 0 ]
then
    echo "OK: All replication jobs working as intented"
    exit 0
fi
 
  • Like
Reactions: mnih and ales