Replication job error

Maksimus · Apr 17, 2024

A lot of synchronization error messages come, mostly during non-working hours at night. One of hundreds of errors that come in the mail below.

Tell me what to do about this, we are afraid that at some point we will have broken copies on the server receiving replication.

Replication job 132-1 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-17 05:06:29

Next sync try: 2024-04-17 05:32:39

Failure count: 2

Error:

command 'zfs snapshot disk2/vm-132-disk-0@__replicate_132-1_1713320590__' failed: got timeout

gfngfn256 · Apr 17, 2024

I see this is an ongoing problem you have, as per your older post.

Do you have a dedicated migration network vs the cluster network as per the doc's advice.

Assuming its not a network issue (as your older post tries to suggest), it's probably a load on the ZFS pool at that time.

How much replication/other activity is going on at the time of the error.

You may want to somehow ensure 2 job replications etc. are not going on at the same time.

You may also want to look into the replication target timeout, IDK much about it, but there was once a discussion on it.

Maksimus · Apr 17, 2024

Yes, the problem is very common. We are already starting to worry that at one bad moment we will get a broken copy on the replication receiver server.

Yes, there is a second dedicated network 10Gb/s

Replication of 4 nodes with 2 disks each (VM 3+9+7+15) into 1 node with 2 disks. Once every 5 minutes. There are enterprise ssds everywhere

If it is somehow possible to configure so that replication does not occur from several nodes at the same time, then we will happily reconfigure it.

Where can I see these timeout?

gfngfn256 · Apr 17, 2024

As I mentioned IDK much about it. On searching I found this. Maybe reach out to @fiona to find out the current situation of timeouts.

One other thought. When is the ZFS scrub taking place on the pool.

Maksimus · Apr 18, 2024

Timeouts were set to 3600 on advice from another topic. I increased the timeouts, but the errors keep coming. Based on the description in the error, I thought that there was a backup, but the backup was at 20:00

Ошибка
Replication job 115-0 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-18 01:35:05

Next sync try: 2024-04-18 01:45:00

Failure count: 1

Error:

command 'set -o pipefail && pvesm export local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__ -base __replicate_115-0_1713393305__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@192.168.200.7 -- pvesm import local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__' failed: exit code 255

Here's a new error
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts.
The sender host also didn’t do anything supernatural, nothing happened to the VM itself, it worked normally.
2024-04-18 12:41:58 703-0: start replication job
2024-04-18 12:41:58 703-0: guest => VM 703, running => 3653200
2024-04-18 12:41:58 703-0: volumes => local-zfs:vm-703-disk-1,local-zfs:vm-703-disk-2,local-zfs:vm-703-disk-3,local-zfs:vm-703-disk-4
2024-04-18 12:41:58 703-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@82.202.177.220 pvecm mtunnel -migration_network 192.168.200.1/24 -get_migration_ip' failed: exit code 255

The only thing that was recorded is that it jumps on the 1680 i/o, but this is logical since the recording to the disk was in progress. But the performance limit is far from the limit; the disk can easily handle 10000 i/o, in peaks up to 22500 i\o

gfngfn256 · Apr 18, 2024

Maksimus said:
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts

So those replications can cause stress on your Zpool/NW.

Maksimus said:
the disk can easily handle 10000 i/o, in peaks up to 22500 i\o

Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?

Maksimus · Apr 18, 2024

gfngfn256 said:
So those replications can cause stress on your Zpool/NW.

Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?

What or how to test zpool so as not to break it.

scrubs on the screen
I also attach statistics on i\o during scrubs

gfngfn256 · Apr 18, 2024

I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?

Maksimus · Apr 18, 2024

gfngfn256 said:
I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?

NW=network ?
There is a 10Gbps network

gfngfn256 · Apr 18, 2024

Have you tried correlating high NW activity (above graph/s) to time of your error receiving messages?

Maksimus · Apr 22, 2024

There is no direct connection with receiving errors and the load. The load was up until 11:57, the error message came at 12:01. But at 12:00 there was a sentry backup on those servers about which a message was received at 12:01.

pva · Jun 5, 2025

I'm having a very similar problem. Once in one-two days, I'm seeing the same problem:

Code:

Replication job '192-0' with target 'cf-pve3' and schedule '*/15' failed!

       Last successful sync: 2025-06-04 06:30:12
       Next sync try: ERROR
       Failure count: 1

       Error:
     
command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=cf-pve3' -o 'UserKnownHostsFile=/etc/pve/nodes/cf-pve3/ssh_known_hosts' 
                 -o 'GlobalKnownHostsFile=none' root@172.20.10.42 -- pvesr prepare-local-job 192-0 --scan local-zfs local-zfs:subvol-192-disk-0 
                 --last_sync 1749007812 --parent_snapname mspmo25' failed: exit code 255

As of ssh mange page: ssh exits with the exit status of the remote command or with 255 if an error occurred. Now the question is, can we add debugging output here somehow? Where is this command to add -v there?

Also, is it possible to filter out these messages from spamming my mailbox?

alc · Aug 11, 2025

Bumping this issue because I have the exact same problem.

Have set up replication months ago, and ignored the error for now, because most replications seem to work. But everyday, I'll have a few of these :

Code:

Replication job '100-0' with target 'pve2' and schedule '*:0/15' failed!


       Last successful sync: 2025-08-11 02:45:01

       Next sync try: ERROR

       Failure count: 1

              

       Error:

      
command 'zfs snapshot rpool/vm-100-disk-0@__replicate_100-0_1754874002__' failed: got timeout

I understand there might be some kind of overload (either disk I/O or network), it is possible in my case, although every replication happens on it's own schedule (one at *:03/15, one at *:03/20, etc.).
This all makes replication a very fiddly feature to work with.
The docs could mention it's limits more clearly, or link to ZFS docs if these mention the limits more clearly.

In other words, is this a hard limit from ZFS, or is there an implementation issue on the Proxmox part ?

mrt-mt · Sep 21, 2025

I am having the same issue.
Thinly provisioned.
10G Dedicated Replication network. MTU 8970
Sometimes it will replicated and sometimes it will not.
Migrations over the same network work perfectly. Never a timeout.
2025-09-20 20:37:19 200-0: create snapshot '__replicate_200-0_1758422222__' on RAIDZ2:vm-200-disk-0
2025-09-20 20:37:41 200-0: end replication job with error: command 'zfs snapshot RAIDZ2/vm-200-disk-0@__replicate_200-0_1758422222__' failed: got timeout

I am very new at ProxMox and would welcome any advice.

alc · Oct 30, 2025

alc said:
Bumping this issue because I have the exact same problem.

Have set up replication months ago, and ignored the error for now, because most replications seem to work. But everyday, I'll have a few of these :

Code:

Replication job '100-0' with target 'pve2' and schedule '*:0/15' failed! Last successful sync: 2025-08-11 02:45:01 Next sync try: ERROR Failure count: 1 Error: command 'zfs snapshot rpool/vm-100-disk-0@__replicate_100-0_1754874002__' failed: got timeout

I understand there might be some kind of overload (either disk I/O or network), it is possible in my case, although every replication happens on it's own schedule (one at *:03/15, one at *:03/20, etc.).
This all makes replication a very fiddly feature to work with.
The docs could mention it's limits more clearly, or link to ZFS docs if these mention the limits more clearly.

In other words, is this a hard limit from ZFS, or is there an implementation issue on the Proxmox part ?

I seem to have solved the issues with ZFS replication by solving an "invisible" ARP problem on the network (as in, it was there and I knew about it for years, but my network had no other issues than ZFS replication).

My OpenBSD router stated ARP overwrites sometimes involving one Proxmox node or another (3-node cluster), but most of the time involving other servers.
I do not know how I solved it, appart from the fact that I deactivated a CARP setup that was unused, but my limited understanding doesn't allow me to make sense of how an unused - but functionnal, when it was used - CARP setup would have created ARP overwrites on other hosts.

Anyways, ZFS replications now work flawlessly with my ~10VM's, some of them replicating every 2 minutes to both other nodes.
I can conclude my issues were not due to network overload although my network is only 1GBE.

mrt-mt · Nov 4, 2025

I was not able to leave my MTU size to anything other than the default settings. No matter what ProxMox or Debian would never account for the overhead encapsulation. This is a real bummer as I have a 12TB Hyper-V vm that I have yet to convert to ProxMox andhaving that incease MTU size would really help on the dedicated replication network. In my case replication on 2TB vms works perfect with the default MTU. Hope this helps someone else. Thx

Search

Search

Replication job error

Maksimus

Member

gfngfn256

Distinguished Member

Maksimus

Member

gfngfn256

Distinguished Member

Maksimus

Member

Attachments

gfngfn256

Distinguished Member

Maksimus

Member

Attachments

gfngfn256

Distinguished Member

Maksimus

Member

Attachments

gfngfn256

Distinguished Member

Maksimus

Member

pva

New Member

alc

Active Member

mrt-mt

New Member

alc

Active Member

mrt-mt

New Member

We value your privacy