Replication job error

Maksimus · Apr 17, 2024

A lot of synchronization error messages come, mostly during non-working hours at night. One of hundreds of errors that come in the mail below.

Tell me what to do about this, we are afraid that at some point we will have broken copies on the server receiving replication.

Replication job 132-1 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-17 05:06:29

Next sync try: 2024-04-17 05:32:39

Failure count: 2

Error:

command 'zfs snapshot disk2/vm-132-disk-0@__replicate_132-1_1713320590__' failed: got timeout

gfngfn256 · Apr 17, 2024

I see this is an ongoing problem you have, as per your older post.

Do you have a dedicated migration network vs the cluster network as per the doc's advice.

Assuming its not a network issue (as your older post tries to suggest), it's probably a load on the ZFS pool at that time.

How much replication/other activity is going on at the time of the error.

You may want to somehow ensure 2 job replications etc. are not going on at the same time.

You may also want to look into the replication target timeout, IDK much about it, but there was once a discussion on it.

Maksimus · Apr 17, 2024

Yes, the problem is very common. We are already starting to worry that at one bad moment we will get a broken copy on the replication receiver server.

Yes, there is a second dedicated network 10Gb/s

Replication of 4 nodes with 2 disks each (VM 3+9+7+15) into 1 node with 2 disks. Once every 5 minutes. There are enterprise ssds everywhere

If it is somehow possible to configure so that replication does not occur from several nodes at the same time, then we will happily reconfigure it.

Where can I see these timeout?

gfngfn256 · Apr 17, 2024

As I mentioned IDK much about it. On searching I found this. Maybe reach out to @fiona to find out the current situation of timeouts.

One other thought. When is the ZFS scrub taking place on the pool.

Maksimus · Apr 18, 2024

Timeouts were set to 3600 on advice from another topic. I increased the timeouts, but the errors keep coming. Based on the description in the error, I thought that there was a backup, but the backup was at 20:00

Ошибка
Replication job 115-0 with target 'Host807' and schedule '*/5' failed!

Last successful sync: 2024-04-18 01:35:05

Next sync try: 2024-04-18 01:45:00

Failure count: 1

Error:

command 'set -o pipefail && pvesm export local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__ -base __replicate_115-0_1713393305__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@192.168.200.7 -- pvesm import local-zfs:vm-115-disk-4 zfs - -with-snapshots 1 -snapshot __replicate_115-0_1713393605__' failed: exit code 255

Here's a new error
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts.
The sender host also didn’t do anything supernatural, nothing happened to the VM itself, it worked normally.
2024-04-18 12:41:58 703-0: start replication job
2024-04-18 12:41:58 703-0: guest => VM 703, running => 3653200
2024-04-18 12:41:58 703-0: volumes => local-zfs:vm-703-disk-1,local-zfs:vm-703-disk-2,local-zfs:vm-703-disk-3,local-zfs:vm-703-disk-4
2024-04-18 12:41:58 703-0: end replication job with error: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=Host807' root@82.202.177.220 pvecm mtunnel -migration_network 192.168.200.1/24 -get_migration_ip' failed: exit code 255

The only thing that was recorded is that it jumps on the 1680 i/o, but this is logical since the recording to the disk was in progress. But the performance limit is far from the limit; the disk can easily handle 10000 i/o, in peaks up to 22500 i\o

gfngfn256 · Apr 18, 2024

Maksimus said:
We looked at the receiver host, there were no backups or migrations to it at that time, only replications from other hosts

So those replications can cause stress on your Zpool/NW.

Maksimus said:
the disk can easily handle 10000 i/o, in peaks up to 22500 i\o

Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?

Maksimus · Apr 18, 2024

gfngfn256 said:
So those replications can cause stress on your Zpool/NW.

Those I believe are the raw disk capabilities. Not the Zpool on your system.

Maybe try stressing your ZFS as a test & see what happens. (Not sure if its recommended in a production environment).
How do the scrubs go on the Zpool?

What or how to test zpool so as not to break it.

scrubs on the screen
I also attach statistics on i\o during scrubs

gfngfn256 · Apr 18, 2024

I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?

Maksimus · Apr 18, 2024

gfngfn256 said:
I/O during scrubs looks ok. Though we don't know what CPU/RAM usage there is for that period.
If the target server doesn't generally suffer any overloading, its probably going to be a NW overload. Do you have monitoring for this?

NW=network ?
There is a 10Gbps network

gfngfn256 · Apr 18, 2024

Have you tried correlating high NW activity (above graph/s) to time of your error receiving messages?

Maksimus · Apr 22, 2024

There is no direct connection with receiving errors and the load. The load was up until 11:57, the error message came at 12:01. But at 12:00 there was a sentry backup on those servers about which a message was received at 12:01.

Search

Search

Replication job error

Maksimus

Member

gfngfn256

Renowned Member

Maksimus

Member

gfngfn256

Renowned Member

Maksimus

Member

Attachments

gfngfn256

Renowned Member

Maksimus

Member

Attachments

gfngfn256

Renowned Member

Maksimus

Member

Attachments

gfngfn256

Renowned Member

Maksimus

Member