Replication job error

Maksimus

Member
May 16, 2022
78
3
13
A lot of synchronization error messages come, mostly during non-working hours at night. One of hundreds of errors that come in the mail below.

Tell me what to do about this, we are afraid that at some point we will have broken copies on the server receiving replication.


Replication job 132-1 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-17 05:06:29

Next sync try: 2024-04-17 05:32:39

Failure count: 2


Error:

command 'zfs snapshot disk2/vm-132-disk-0@__replicate_132-1_1713320590__' failed: got timeout
 
I see this is an ongoing problem you have, as per your older post.

Do you have a dedicated migration network vs the cluster network as per the doc's advice.

Assuming its not a network issue (as your older post tries to suggest), it's probably a load on the ZFS pool at that time.

How much replication/other activity is going on at the time of the error.

You may want to somehow ensure 2 job replications etc. are not going on at the same time.

You may also want to look into the replication target timeout, IDK much about it, but there was once a discussion on it.
 
Yes, the problem is very common. We are already starting to worry that at one bad moment we will get a broken copy on the replication receiver server.

Yes, there is a second dedicated network 10Gb/s

Replication of 4 nodes with 2 disks each (VM 3+9+7+15) into 1 node with 2 disks. Once every 5 minutes. There are enterprise ssds everywhere

If it is somehow possible to configure so that replication does not occur from several nodes at the same time, then we will happily reconfigure it.

Where can I see these timeout?
 
Last edited:
As I mentioned IDK much about it. On searching I found this. Maybe reach out to @fiona to find out the current situation of timeouts.

One other thought. When is the ZFS scrub taking place on the pool.
 
  • Like
Reactions: Maksimus
Timeouts were set to 3600 on advice from another topic. I increased the timeouts, but the errors keep coming. Based on the description in the error, I thought that there was a backup, but the backup was at 20:00

Ошибка
Replication job 115-0 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-18 01:35:05

Next sync try: 2024-04-18 01:45:00

Failure count: 1


Error:

command 'set -o pipefail && pvesm export local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__ -base __replicate_115-0_1713393305__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@192.168.200.7 -- pvesm import local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__' failed: exit code 255


Here's a new error
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts.
The sender host also didn’t do anything supernatural, nothing happened to the VM itself, it worked normally.
2024-04-18 12:41:58 703-0: start replication job
2024-04-18 12:41:58 703-0: guest => VM 703, running => 3653200
2024-04-18 12:41:58 703-0: volumes => local-zfs:vm-703-disk-1,local-zfs:vm-703-disk-2,local-zfs:vm-703-disk-3,local-zfs:vm-703-disk-4
2024-04-18 12:41:58 703-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@82.202.177.220 pvecm mtunnel -migration_network 192.168.200.1/24 -get_migration_ip' failed: exit code 255

The only thing that was recorded is that it jumps on the 1680 i/o, but this is logical since the recording to the disk was in progress. But the performance limit is far from the limit; the disk can easily handle 10000 i/o, in peaks up to 22500 i\o
 

Attachments

  • Screenshot_155.png
    Screenshot_155.png
    9.7 KB · Views: 7
  • Screenshot_157.png
    Screenshot_157.png
    50.8 KB · Views: 7
Last edited:
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts
So those replications can cause stress on your Zpool/NW.

the disk can easily handle 10000 i/o, in peaks up to 22500 i\o
Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?
 
So those replications can cause stress on your Zpool/NW.


Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?
What or how to test zpool so as not to break it.

scrubs on the screen
I also attach statistics on i\o during scrubs
 

Attachments

  • Screenshot_158.png
    Screenshot_158.png
    20.6 KB · Views: 17
  • Screenshot_159.png
    Screenshot_159.png
    93.5 KB · Views: 17
I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?
 
I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?
NW=network ?
There is a 10Gbps network
 

Attachments

  • Screenshot_160.png
    Screenshot_160.png
    73.3 KB · Views: 8
  • Screenshot_161.png
    Screenshot_161.png
    127.6 KB · Views: 7
Have you tried correlating high NW activity (above graph/s) to time of your error receiving messages?
 
There is no direct connection with receiving errors and the load. The load was up until 11:57, the error message came at 12:01. But at 12:00 there was a sentry backup on those servers about which a message was received at 12:01.
 
I'm having a very similar problem. Once in one-two days, I'm seeing the same problem:

Code:
Replication job '192-0' with target 'cf-pve3' and schedule '*/15' failed!

       Last successful sync: 2025-06-04 06:30:12
       Next sync try: ERROR
       Failure count: 1

       Error:
     
command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=cf-pve3' -o 'UserKnownHostsFile=/etc/pve/nodes/cf-pve3/ssh_known_hosts' 
                 -o 'GlobalKnownHostsFile=none' root@172.20.10.42 -- pvesr prepare-local-job 192-0 --scan local-zfs local-zfs:subvol-192-disk-0 
                 --last_sync 1749007812 --parent_snapname mspmo25' failed: exit code 255

As of ssh mange page: ssh exits with the exit status of the remote command or with 255 if an error occurred. Now the question is, can we add debugging output here somehow? Where is this command to add -v there?


Also, is it possible to filter out these messages from spamming my mailbox?
 
Bumping this issue because I have the exact same problem.

Have set up replication months ago, and ignored the error for now, because most replications seem to work. But everyday, I'll have a few of these :
Code:
Replication job '100-0' with target 'pve2' and schedule '*:0/15' failed!


       Last successful sync: 2025-08-11 02:45:01

       Next sync try: ERROR

       Failure count: 1

              

       Error:

      
command 'zfs snapshot rpool/vm-100-disk-0@__replicate_100-0_1754874002__' failed: got timeout

I understand there might be some kind of overload (either disk I/O or network), it is possible in my case, although every replication happens on it's own schedule (one at *:03/15, one at *:03/20, etc.).
This all makes replication a very fiddly feature to work with.
The docs could mention it's limits more clearly, or link to ZFS docs if these mention the limits more clearly.

In other words, is this a hard limit from ZFS, or is there an implementation issue on the Proxmox part ?
 
I am having the same issue.
Thinly provisioned.
10G Dedicated Replication network. MTU 8970
Sometimes it will replicated and sometimes it will not.
Migrations over the same network work perfectly. Never a timeout.
2025-09-20 20:37:19 200-0: create snapshot '__replicate_200-0_1758422222__' on RAIDZ2:vm-200-disk-0
2025-09-20 20:37:41 200-0: end replication job with error: command 'zfs snapshot RAIDZ2/vm-200-disk-0@__replicate_200-0_1758422222__' failed: got timeout

I am very new at ProxMox and would welcome any advice.