Replication Error

Byphone · Oct 7, 2022

Hello everyone,

I Have a 4 hosts, proxmox cluster. Each host has a zfs pool named "rpool".
I setup replication every 15 minutes on some VM. About 10 times per day i have this kind of message :


Replication job 604-0 with target 'hostgra2' and schedule '8..21:0/15' failed!

    Last successful sync: 2022-10-07 11:15:01

    Next sync try: 2022-10-07 11:35:00

    Failure count: 1



  Error:

  command 'set -o pipefail && pvesm export local-zfs:vm-604-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_604-0_1665135001__ -base __replicate_604-0_1665134101__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=hostgra2' [EMAIL='root@141.94.100.239']root@[/EMAIL]XXX.XXX.XXX.XXX -- pvesm import local-zfs:vm-604-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_604-0_1665135001__ -allow-rename 0 -base __replicate_604-0_1665134101__' failed: exit code 255

Thanks for help

Regards

Byphone · Oct 8, 2022

fiona · Oct 10, 2022

Hi,
to see the exact error, you'd need to check the replication log after the error (before the next replication runs). If those failures are one-off events and replication recovers by itself, then most likely, they stem from timeouts when the target ZFS is under load. The timeout handling on the replication target could be improved, but nobody has come around to implementing it yet.

Byphone · Oct 12, 2022

Thanks for your answer,

Here is the replication log

Code:

2022-10-12 11:30:31 289-0: start replication job
2022-10-12 11:30:31 289-0: guest => VM 289, running => 37406
2022-10-12 11:30:31 289-0: volumes => db1-pool5:vm-289-disk-0,db1-pool5:vm-289-disk-1
2022-10-12 11:30:32 289-0: freeze guest filesystem
2022-10-12 11:30:37 289-0: create snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-0
2022-10-12 11:30:38 289-0: create snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-1
2022-10-12 11:30:38 289-0: thaw guest filesystem
2022-10-12 11:30:39 289-0: using secure transmission, rate limit: none
2022-10-12 11:30:39 289-0: incremental sync 'db1-pool5:vm-289-disk-0' (__replicate_289-0_1665566231__ => __replicate_289-0_1665567031__)
2022-10-12 11:30:39 289-0: ssh_exchange_identification: Connection closed by remote host
2022-10-12 11:30:39 289-0: warning: cannot send 'db1-pool5/vm-289-disk-0@__replicate_289-0_1665567031__': Broken pipe
2022-10-12 11:30:39 289-0: command 'zfs send -Rpv -I __replicate_289-0_1665566231__ -- db1-pool5/vm-289-disk-0@__replicate_289-0_1665567031__' failed: exit code 1
2022-10-12 11:30:39 289-0: delete previous replication snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-0
2022-10-12 11:30:40 289-0: delete previous replication snapshot '__replicate_289-0_1665567031__' on db1-pool5:vm-289-disk-1
2022-10-12 11:30:40 289-0: end replication job with error: command 'set -o pipefail && pvesm export db1-pool5:vm-289-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_289-0_1665567031__ -base __replicate_289-0_1665566231__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=hostdb2' root@XXX.XXX.XXX.XXX -- pvesm import db1-pool5:vm-289-disk-0 zfs - -with-snapshots 1 -base __replicate_289-0_1665566231__' failed: exit code 255

I see an ssh connection close...

Regards

fiona · Oct 12, 2022

Then it might actually be an issue related to the network or to the ssh configuration rather than load on the ZFS pool.

Byphone · Oct 12, 2022

may be, but if it's a ssh misconfiguration, the issue should be permanent.
Do you have an idea, where i have to investigate ?

Byphone · Oct 12, 2022

Is there an option to force network interface used by replication ?

fiona · Oct 12, 2022

Byphone said:
may be, but if it's a ssh misconfiguration, the issue should be permanent.

Not if it's related to a timeout or something.

Byphone said:
Is there an option to force network interface used by replication ?

The migration network set in the /etc/pve/datacenter.cfg is also used for replication

Enthylsa · Oct 21, 2022

Hi, i See the same error, but only 1-2 times per day but if happens then for all VMs and LXC containers.

@fiona, in my datacenter.cfg only keyboard setting is set. is there something wrong with my installation?

The timeout for ssh connections can be set in /etc/ssh/sshd_config. I set on both servers now

ClientAliveInterval 3600

for testing purposes. sshd service need to be restarted after change.

fiona · Oct 21, 2022

Hi,

Enthylsa said:
Hi, i See the same error, but only 1-2 times per day but if happens then for all VMs and LXC containers.

then in your case, I'd guess it's timeouts when the pool is under load. As said, our timeout handling on the replication target is not that great at the moment, unfortunately. If you can get a hold of the replication log after the error, it should tell you more about the actual error.

Enthylsa said:
@fiona, in my datacenter.cfg only keyboard setting is set. is there something wrong with my installation?

No, it just means that the default (i.e. cluster) network will be used. See here for more information. We do recommend using a dedicated network, so that cluster traffic won't be affected by other network load.

Maksimus · Apr 17, 2024

@fiona
I have a similar problem. Maybe in 2 years some new tools have appeared. Following advice from this topic, I increased the timeout to 3600
Link to my post with the problem https://forum.proxmox.com/threads/replication-job-error.145256/

fiona · Apr 18, 2024

Hi,

Maksimus said:
@fiona
I have a similar problem. Maybe in 2 years some new tools have appeared. Following advice from this topic, I increased the timeout to 3600
Link to my post with the problem https://forum.proxmox.com/threads/replication-job-error.145256/

in your case it's not SSH that is timing out, but it seems to be the ZFS command. Unfortunately, the timeout for such commands is currently only 10 seconds, but that might not be enough if the pool is under heavy load and there is no mechanism implemented right now to detect when the ZFS commands happens in the context of a replication or the context of another operation where it needs to complete quickly.

Search

Search

Replication Error

Byphone

Active Member

Byphone

Active Member

fiona

Proxmox Staff Member

Byphone

Active Member

fiona

Proxmox Staff Member

Byphone

Active Member

Byphone

Active Member

fiona

Proxmox Staff Member

Enthylsa

Member

fiona

Proxmox Staff Member

Maksimus

Member

fiona

Proxmox Staff Member