Replication Error

Byphone

Active Member
Jan 9, 2017
20
0
41
48
Hello everyone,

I Have a 4 hosts, proxmox cluster. Each host has a zfs pool named "rpool".
I setup replication every 15 minutes on some VM. About 10 times per day i have this kind of message :


Replication job 604-0 with target 'hostgra2' and schedule '8..21:0/15' failed! Last successful sync: 2022-10-07 11:15:01 Next sync try: 2022-10-07 11:35:00 Failure count: 1 Error: command 'set -o pipefail && pvesm export local-zfs:vm-604-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_604-0_1665135001__ -base __replicate_604-0_1665134101__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=hostgra2' [EMAIL='root@141.94.100.239']root@[/EMAIL]XXX.XXX.XXX.XXX -- pvesm import local-zfs:vm-604-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_604-0_1665135001__ -allow-rename 0 -base __replicate_604-0_1665134101__' failed: exit code 255


Thanks for help :)

Regards
 
Hi,
to see the exact error, you'd need to check the replication log after the error (before the next replication runs). If those failures are one-off events and replication recovers by itself, then most likely, they stem from timeouts when the target ZFS is under load. The timeout handling on the replication target could be improved, but nobody has come around to implementing it yet.
 
Thanks for your answer,

Here is the replication log

Code:
2022-10-12 11:30:31 289-0: start replication job
2022-10-12 11:30:31 289-0: guest => VM 289, running => 37406
2022-10-12 11:30:31 289-0: volumes => db1-pool5:vm-289-disk-0,db1-pool5:vm-289-disk-1
2022-10-12 11:30:32 289-0: freeze guest filesystem
2022-10-12 11:30:37 289-0: create snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-0
2022-10-12 11:30:38 289-0: create snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-1
2022-10-12 11:30:38 289-0: thaw guest filesystem
2022-10-12 11:30:39 289-0: using secure transmission, rate limit: none
2022-10-12 11:30:39 289-0: incremental sync 'db1-pool5:vm-289-disk-0' (__replicate_289-0_1665566231__ => __replicate_289-0_1665567031__)
2022-10-12 11:30:39 289-0: ssh_exchange_identification: Connection closed by remote host
2022-10-12 11:30:39 289-0: warning: cannot send 'db1-pool5/vm-289-disk-0@__replicate_289-0_1665567031__': Broken pipe
2022-10-12 11:30:39 289-0: command 'zfs send -Rpv -I __replicate_289-0_1665566231__ -- db1-pool5/vm-289-disk-0@__replicate_289-0_1665567031__' failed: exit code 1
2022-10-12 11:30:39 289-0: delete previous replication snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-0
2022-10-12 11:30:40 289-0: delete previous replication snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-1
2022-10-12 11:30:40 289-0: end replication job with error: command 'set -o pipefail && pvesm export db1-pool5:vm-289-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_289-0_1665567031__ -base __replicate_289-0_1665566231__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=hostdb2' root@XXX.XXX.XXX.XXX -- pvesm import db1-pool5:vm-289-disk-0 zfs - -with-snapshots 1 -base __replicate_289-0_1665566231__' failed: exit code 255

I see an ssh connection close...

Regards
 
Then it might actually be an issue related to the network or to the ssh configuration rather than load on the ZFS pool.
 
may be, but if it's a ssh misconfiguration, the issue should be permanent.
Do you have an idea, where i have to investigate ?
 
may be, but if it's a ssh misconfiguration, the issue should be permanent.
Not if it's related to a timeout or something.

Is there an option to force network interface used by replication ?
The migration network set in the /etc/pve/datacenter.cfg is also used for replication
 
Hi, i See the same error, but only 1-2 times per day but if happens then for all VMs and LXC containers.

@fiona, in my datacenter.cfg only keyboard setting is set. is there something wrong with my installation?

The timeout for ssh connections can be set in /etc/ssh/sshd_config. I set on both servers now

ClientAliveInterval 3600

for testing purposes. sshd service need to be restarted after change.
 
Hi,
Hi, i See the same error, but only 1-2 times per day but if happens then for all VMs and LXC containers.
then in your case, I'd guess it's timeouts when the pool is under load. As said, our timeout handling on the replication target is not that great at the moment, unfortunately. If you can get a hold of the replication log after the error, it should tell you more about the actual error.

@fiona, in my datacenter.cfg only keyboard setting is set. is there something wrong with my installation?
No, it just means that the default (i.e. cluster) network will be used. See here for more information. We do recommend using a dedicated network, so that cluster traffic won't be affected by other network load.
 
Hi,
@fiona
I have a similar problem. Maybe in 2 years some new tools have appeared. Following advice from this topic, I increased the timeout to 3600
Link to my post with the problem https://forum.proxmox.com/threads/replication-job-error.145256/
in your case it's not SSH that is timing out, but it seems to be the ZFS command. Unfortunately, the timeout for such commands is currently only 10 seconds, but that might not be enough if the pool is under heavy load and there is no mechanism implemented right now to detect when the ZFS commands happens in the context of a replication or the context of another operation where it needs to complete quickly.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!