Manually initiating zfs replication?

greg

Renowned Member
Apr 6, 2011
137
2
83
Greetings
I have Proxmox 6.3 zfs-based cluster.
Last week I added a new node and set a few replication from the old nodes to this new node. It works ok for the small nodes (a few gigs), but for a moderate node it seems to loop: yesterday, the destination zvol was 20G, today it's only 16G. The throughput between the nodes is very variable, from almost nothing up to 100Mbps.
However, the sending node is slowed up to the point it's almost unusable.

Is it possible to manually send an initial snapshot and that Proxmox uses this snapshot as a base? so I can decide when the system is not replication and can be used for its purpose.

Note that my question is about replication, not sending the snapshot.

Thanks in advance

Regards
 
Last edited:
Hi,
the initial snapshot is sent right after the replication job is created. So you might want to create the job when there is not much else going on. You can also set a bandwidth limit to reduce the load on the sending side.
 
Thanks for your answer. In fact I did as you said: I've created the job on Friday night, unfortunately, one week later, it's still "looping" ; may guess is that if it takes more than 24h, the destination will be erased and re-transfered. So limiting bandwidth will probably make things worst.
I had to shut down the destination node to avoid the whole cluster to be un-available, which is a paradoxical for a security measure...
 
Thanks for your answer. In fact I did as you said: I've created the job on Friday night, unfortunately, one week later, it's still "looping" ; may guess is that if it takes more than 24h, the destination will be erased and re-transfered.
AFAICT, there should be no removal if the job is not aborted or fails for another reason. Are you sure the "looping" happened before the reboot? Are there any errors in the replication log? Did the job still make progress at that time?

So limiting bandwidth will probably make things worst.
Not if the nodes can consistently handle the limited amount of data, instead being under too much load (e.g. when the real bottleneck is the disk I/O).

Please also provide the output of pveversion -v.
 
After a lot of investiguation, I think I found the problem: the net interface had an MTU of 1500 ; after I changed the MTU to 1437 on all nodes, it seems to work a lot better.
Maybe this could help someone someday :)
 
Well I was too optimistic... the process stopped after 7 hours and 73.6G:
Code:
send/receive failed, cleaning up snapshot(s)..
command 'set -o pipefail && pvesm export ct_A:subvol-134-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_134-1_1622564262__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=scwv10' root@10.10.3.10 -- pvesm import ct_A:subvol-134-disk-0 zfs - -with-snapsho
ts 1 -allow-rename 0' failed: exit code 255

What does it mean? how can I have a more meaningfull error?
 
At the exact time of the failure, I see this in the syslog:
Code:
pvesr[26445]:  OK
 
Well I was too optimistic... the process stopped after 7 hours and 73.6G:
Code:
send/receive failed, cleaning up snapshot(s)..
command 'set -o pipefail && pvesm export ct_A:subvol-134-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_134-1_1622564262__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=scwv10' root@10.10.3.10 -- pvesm import ct_A:subvol-134-disk-0 zfs - -with-snapsho
ts 1 -allow-rename 0' failed: exit code 255

What does it mean? how can I have a more meaningfull error?
Is there an error/warning earlier in the log?

At the exact time of the failure, I see this in the syslog:
Code:
pvesr[26445]:  OK
This just means that the replication service finished running all replications (but not that all of them were successful).

I have the feeling it's correlated to the bug I refered here: https://forum.proxmox.com/threads/pct-listsnapshot-strange-output.86675/
because at some point I see this:
Deep recursion on anonymous subroutine at /usr/share/perl5/PVE/GuestHelpers.pm line 165.
Where exactly do you see this? AFAICT that is only be called when listing snapshots in a guest's configuration. The storage replication itself looks at the snapshots on the zfs volumes directly.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!