Replication job error

Maksimus

Member
May 16, 2022
78
3
13
A lot of synchronization error messages come, mostly during non-working hours at night. One of hundreds of errors that come in the mail below.

Tell me what to do about this, we are afraid that at some point we will have broken copies on the server receiving replication.


Replication job 132-1 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-17 05:06:29

Next sync try: 2024-04-17 05:32:39

Failure count: 2


Error:

command 'zfs snapshot disk2/vm-132-disk-0@__replicate_132-1_1713320590__' failed: got timeout
 
I see this is an ongoing problem you have, as per your older post.

Do you have a dedicated migration network vs the cluster network as per the doc's advice.

Assuming its not a network issue (as your older post tries to suggest), it's probably a load on the ZFS pool at that time.

How much replication/other activity is going on at the time of the error.

You may want to somehow ensure 2 job replications etc. are not going on at the same time.

You may also want to look into the replication target timeout, IDK much about it, but there was once a discussion on it.
 
Yes, the problem is very common. We are already starting to worry that at one bad moment we will get a broken copy on the replication receiver server.

Yes, there is a second dedicated network 10Gb/s

Replication of 4 nodes with 2 disks each (VM 3+9+7+15) into 1 node with 2 disks. Once every 5 minutes. There are enterprise ssds everywhere

If it is somehow possible to configure so that replication does not occur from several nodes at the same time, then we will happily reconfigure it.

Where can I see these timeout?
 
Last edited:
As I mentioned IDK much about it. On searching I found this. Maybe reach out to @fiona to find out the current situation of timeouts.

One other thought. When is the ZFS scrub taking place on the pool.
 
  • Like
Reactions: Maksimus
Timeouts were set to 3600 on advice from another topic. I increased the timeouts, but the errors keep coming. Based on the description in the error, I thought that there was a backup, but the backup was at 20:00

Ошибка
Replication job 115-0 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-18 01:35:05

Next sync try: 2024-04-18 01:45:00

Failure count: 1


Error:

command 'set -o pipefail && pvesm export local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__ -base __replicate_115-0_1713393305__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@192.168.200.7 -- pvesm import local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__' failed: exit code 255


Here's a new error
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts.
The sender host also didn’t do anything supernatural, nothing happened to the VM itself, it worked normally.
2024-04-18 12:41:58 703-0: start replication job
2024-04-18 12:41:58 703-0: guest => VM 703, running => 3653200
2024-04-18 12:41:58 703-0: volumes => local-zfs:vm-703-disk-1,local-zfs:vm-703-disk-2,local-zfs:vm-703-disk-3,local-zfs:vm-703-disk-4
2024-04-18 12:41:58 703-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@82.202.177.220 pvecm mtunnel -migration_network 192.168.200.1/24 -get_migration_ip' failed: exit code 255

The only thing that was recorded is that it jumps on the 1680 i/o, but this is logical since the recording to the disk was in progress. But the performance limit is far from the limit; the disk can easily handle 10000 i/o, in peaks up to 22500 i\o
 

Attachments

  • Screenshot_155.png
    Screenshot_155.png
    9.7 KB · Views: 3
  • Screenshot_157.png
    Screenshot_157.png
    50.8 KB · Views: 3
Last edited:
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts
So those replications can cause stress on your Zpool/NW.

the disk can easily handle 10000 i/o, in peaks up to 22500 i\o
Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?
 
So those replications can cause stress on your Zpool/NW.


Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?
What or how to test zpool so as not to break it.

scrubs on the screen
I also attach statistics on i\o during scrubs
 

Attachments

  • Screenshot_158.png
    Screenshot_158.png
    20.6 KB · Views: 8
  • Screenshot_159.png
    Screenshot_159.png
    93.5 KB · Views: 8
I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?
 
I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?
NW=network ?
There is a 10Gbps network
 

Attachments

  • Screenshot_160.png
    Screenshot_160.png
    73.3 KB · Views: 3
  • Screenshot_161.png
    Screenshot_161.png
    127.6 KB · Views: 4
Have you tried correlating high NW activity (above graph/s) to time of your error receiving messages?
 
There is no direct connection with receiving errors and the load. The load was up until 11:57, the error message came at 12:01. But at 12:00 there was a sentry backup on those servers about which a message was received at 12:01.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!