ZFS Replication does full instead of incremental

update nodes to 7.1-6 but same pb :
2021-11-25 11:43:00 692013-1: start replication job
2021-11-25 11:43:00 692013-1: guest => VM 692013, running => 11312
2021-11-25 11:43:00 692013-1: volumes => local-zfs:vm-692013-disk-0
2021-11-25 11:43:01 692013-1: delete stale replication snapshot '__replicate_692013-1_1637836747__' on local-zfs:vm-692013-disk-0
2021-11-25 11:43:01 692013-1: delete stale replication snapshot error: zfs error: cannot destroy snapshot rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__: dataset is busy

2021-11-25 11:43:02 692013-1: freeze guest filesystem
2021-11-25 11:43:02 692013-1: create snapshot '__replicate_692013-1_1637836980__' on local-zfs:vm-692013-disk-0
2021-11-25 11:43:02 692013-1: thaw guest filesystem
2021-11-25 11:43:02 692013-1: using insecure transmission, rate limit: 10 MByte/s
2021-11-25 11:43:02 692013-1: incremental sync 'local-zfs:vm-692013-disk-0' (__replicate_692013-1_1637709304__ => __replicate_692013-1_1637836980__)
2021-11-25 11:43:02 692013-1: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692013-disk-0'
2021-11-25 11:43:05 692013-1: send from @__replicate_692013-1_1637709304__ to rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__ estimated size is 27.3G
2021-11-25 11:43:05 692013-1: send from @__replicate_692013-1_1637836747__ to rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836980__ estimated size is 859K
2021-11-25 11:43:05 692013-1: total estimated size is 27.3G
2021-11-25 11:43:06 692013-1: TIME SENT SNAPSHOT rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:06 692013-1: 11:43:06 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:07 692013-1: 11:43:07 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:08 692013-1: 11:43:08 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:09 692013-1: 11:43:09 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:10 692013-1: 11:43:10 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:11 692013-1: 11:43:11 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:12 692013-1: 11:43:12 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:12 692013-1: 586936 B 573.2 KB 7.97 s 73612 B/s 71.89 KB/s
2021-11-25 11:43:12 692013-1: write: Broken pipe
2021-11-25 11:43:12 692013-1: warning: cannot send 'rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__': signal received
2021-11-25 11:43:12 692013-1: warning: cannot send 'rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836980__': Broken pipe
2021-11-25 11:43:12 692013-1: cannot send 'rpool/data/vm-692013-disk-0': I/O error
2021-11-25 11:43:12 692013-1: command 'zfs send -Rpv -I __replicate_692013-1_1637709304__ -- rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836980__' failed: exit code 1
2021-11-25 11:43:12 692013-1: [ve-ara-22] cannot receive incremental stream: dataset is busy
2021-11-25 11:43:12 692013-1: [ve-ara-22] command 'zfs recv -F -- rpool/data/vm-692013-disk-0' failed: exit code 1
2021-11-25 11:43:12 692013-1: delete previous replication snapshot '__replicate_692013-1_1637836980__' on local-zfs:vm-692013-disk-0
2021-11-25 11:43:13 692013-1: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692013-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692013-1_1637836980__ -base __replicate_692013-1_1637709304__ | /usr/bin/cstream -t 10000000' failed: exit code 2
Please try to wait a bit until all the old replication jobs have terminated (again check with ps).

and for second VM :



2021-11-25 11:47:07 692008-0: start replication job
2021-11-25 11:47:07 692008-0: guest => VM 692008, running => 7037
2021-11-25 11:47:07 692008-0: volumes => local-zfs:vm-692008-disk-0
2021-11-25 11:47:09 692008-0: freeze guest filesystem
2021-11-25 11:47:09 692008-0: create snapshot '__replicate_692008-0_1637837227__' on local-zfs:vm-692008-disk-0
2021-11-25 11:47:09 692008-0: thaw guest filesystem
2021-11-25 11:47:09 692008-0: using insecure transmission, rate limit: 10 MByte/s
2021-11-25 11:47:09 692008-0: full sync 'local-zfs:vm-692008-disk-0' (__replicate_692008-0_1637837227__)
2021-11-25 11:47:09 692008-0: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692008-disk-0'
2021-11-25 11:47:12 692008-0: full send of rpool/data/vm-692008-disk-0@__replicate_692008-0_1637837227__ estimated size is 86.7G
2021-11-25 11:47:12 692008-0: total estimated size is 86.7G
2021-11-25 11:47:12 692008-0: 1180 B 1.2 KB 0.71 s 1651 B/s 1.61 KB/s
2021-11-25 11:47:12 692008-0: write: Broken pipe
2021-11-25 11:47:12 692008-0: warning: cannot send 'rpool/data/vm-692008-disk-0@__replicate_692008-0_1637837227__': signal received
2021-11-25 11:47:12 692008-0: cannot send 'rpool/data/vm-692008-disk-0': I/O error
2021-11-25 11:47:12 692008-0: command 'zfs send -Rpv -- rpool/data/vm-692008-disk-0@__replicate_692008-0_1637837227__' failed: exit code 1
2021-11-25 11:47:12 692008-0: [ve-ara-22] volume 'rpool/data/vm-692008-disk-0' already exists
2021-11-25 11:47:12 692008-0: delete previous replication snapshot '__replicate_692008-0_1637837227__' on local-zfs:vm-692008-disk-0
2021-11-25 11:47:12 692008-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692008-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692008-0_1637837227__ | /usr/bin/cstream -t 10000000' failed: exit code 2
Hmm, seems like for that one had an invalid replication state and tried to do a full sync. I'm afraid you'll have to remove and re-create that job (or remove the volume on the target).
 
Many thanks to all of you for providing all these logs!

There was indeed a corner case that could lead to an inconsistent state (should be very difficult to trigger with pve-manager >= 7.1-6) and should be fixed by this proposed patch.
 
Hello Fabian,
Thank you for your answers.
This morning (PVE 7.1-6), one VM have still problem :
1638172535953.png

It seem touch some */2:00 réplication.
For this VM, last replication has worked yesterday 18PM.

2021-11-29 08:48:10 692010-0: start replication job
2021-11-29 08:48:10 692010-0: guest => VM 692010, running => 3436
2021-11-29 08:48:10 692010-0: volumes => local-zfs:vm-692010-disk-0
2021-11-29 08:48:12 692010-0: freeze guest filesystem
2021-11-29 08:48:13 692010-0: create snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:13 692010-0: thaw guest filesystem
2021-11-29 08:48:13 692010-0: using insecure transmission, rate limit: 10 MByte/s
2021-11-29 08:48:13 692010-0: full sync 'local-zfs:vm-692010-disk-0' (__replicate_692010-0_1638172090__)
2021-11-29 08:48:13 692010-0: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692010-disk-0'
2021-11-29 08:48:16 692010-0: full send of rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__ estimated size is 44.1G
2021-11-29 08:48:16 692010-0: total estimated size is 44.1G
2021-11-29 08:48:16 692010-0: 1180 B 1.2 KB 0.57 s 2084 B/s 2.04 KB/s
2021-11-29 08:48:16 692010-0: write: Broken pipe
2021-11-29 08:48:16 692010-0: warning: cannot send 'rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__': signal received
2021-11-29 08:48:16 692010-0: cannot send 'rpool/data/vm-692010-disk-0': I/O error
2021-11-29 08:48:16 692010-0: command 'zfs send -Rpv -- rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__' failed: exit code 1
2021-11-29 08:48:16 692010-0: [ve-ara-22] volume 'rpool/data/vm-692010-disk-0' already exists
2021-11-29 08:48:16 692010-0: delete previous replication snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:16 692010-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692010-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692010-0_1638172090__ | /usr/bin/cstream -t 10000000' failed: exit code 2

Can you please reply when the patch is enable for testing ?

best regards.
 
PS: a suggestion: send an email to the root mail given in installation in the event of a replication error, because it's a problem not to know if it doesn't work
 
Hello Fabian,
Thank you for your answers.
This morning (PVE 7.1-6), one VM have still problem :
View attachment 31848

It seem touch some */2:00 réplication.
For this VM, last replication has worked yesterday 18PM.

2021-11-29 08:48:10 692010-0: start replication job
2021-11-29 08:48:10 692010-0: guest => VM 692010, running => 3436
2021-11-29 08:48:10 692010-0: volumes => local-zfs:vm-692010-disk-0
2021-11-29 08:48:12 692010-0: freeze guest filesystem
2021-11-29 08:48:13 692010-0: create snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:13 692010-0: thaw guest filesystem
2021-11-29 08:48:13 692010-0: using insecure transmission, rate limit: 10 MByte/s
2021-11-29 08:48:13 692010-0: full sync 'local-zfs:vm-692010-disk-0' (__replicate_692010-0_1638172090__)
2021-11-29 08:48:13 692010-0: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692010-disk-0'
2021-11-29 08:48:16 692010-0: full send of rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__ estimated size is 44.1G
2021-11-29 08:48:16 692010-0: total estimated size is 44.1G
2021-11-29 08:48:16 692010-0: 1180 B 1.2 KB 0.57 s 2084 B/s 2.04 KB/s
2021-11-29 08:48:16 692010-0: write: Broken pipe
2021-11-29 08:48:16 692010-0: warning: cannot send 'rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__': signal received
2021-11-29 08:48:16 692010-0: cannot send 'rpool/data/vm-692010-disk-0': I/O error
2021-11-29 08:48:16 692010-0: command 'zfs send -Rpv -- rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__' failed: exit code 1
2021-11-29 08:48:16 692010-0: [ve-ara-22] volume 'rpool/data/vm-692010-disk-0' already exists
2021-11-29 08:48:16 692010-0: delete previous replication snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:16 692010-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692010-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692010-0_1638172090__ | /usr/bin/cstream -t 10000000' failed: exit code 2
Could you please share the log from zpool history | grep vm-692010 (at least for the last day) from both source and target node and the syslog on the source node (since a bit before the last successful replication)?

Can you please reply when the patch is enable for testing ?

best regards.
Usually a few days after it has been reviewed and applied. But it should be very difficult to trigger that issue with pve-manager >= 7.1-6...

PS: a suggestion: send an email to the root mail given in installation in the event of a replication error, because it's a problem not to know if it doesn't work
That should actually already happen. Please check the configured email address for root@pam (in the UI: Datacenter > Users > Edit).
 
Could you please share the log from zpool history | grep vm-692010 (at least for the last day) from both source and target node and the syslog on the source node (since a bit before the last successful replication)?
logs in attache files
 

Attachments

  • logs.zip
    16.6 KB · Views: 3
logs in attache files

Nov 29 00:17:06 ve-ara-23 pvescheduler[261209]: ERROR: can't lock file '/var/lock/pvesr.lck' - got timeout
Nov 29 00:17:06 ve-ara-23 pvescheduler[261209]: got shutdown request, signal running jobs to stop
Nov 29 00:17:06 ve-ara-23 pvescheduler[261209]: server stopped
It seems like the old version was still running there. Quoting myself:
Note that it is still possible to run into the issue if there are active replications while upgrading, because the buggy version is still running during the upgrade.

To avoid spam, the error mail is only sent the first time a job fails until it has succeeded again, and it does seem like mails are sent:
Code:
Nov 29 00:06:48 ve-ara-23 pvescheduler[3429192]: 692008-0: got unexpected replication job error - command 'zfs snapshot rpool/data/vm-692008-disk-0@__replicate_692008-0_1638140710__' failed: got timeout
Nov 29 00:06:48 ve-ara-23 postfix/pickup[2931815]: C8036102E16: uid=0 from=<root>
Nov 29 00:06:48 ve-ara-23 postfix/cleanup[4040362]: C8036102E16: message-id=<20211128230648.C8036102E16@ve-ara-23.dreal-ara.i2>
Nov 29 00:06:48 ve-ara-23 postfix/qmgr[3349]: C8036102E16: from=<"root@ve-ara-23"@dreal-ara.i2>, size=538, nrcpt=1 (queue active)
Nov 29 00:06:49 ve-ara-23 postfix/smtp[4040476]: C8036102E16: to=<"root@ve-ara-23"@dreal-ara.i2>, orig_to=<root>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.245]:25, delay=0.21, delays=0.03/0.01/0.11/0.06, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as EC7AB10F823)
Nov 29 00:06:49 ve-ara-23 postfix/qmgr[3349]: C8036102E16: removed
 
Strange, I did not receive it while I receive the backup email.
I found the difference :
- backup mail (recived) :
Nov 27 01:33:11 ve-ara-23 postfix/smtp[2300353]: B634C100820: to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.246]:25, delay=0.22, delays=0.04/0.03/0.07/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as D671EA3DF0)
- error réplication mail :
Nov 29 00:06:49 ve-ara-23 postfix/smtp4040476: C8036102E16: to=<"root@ve-ara-23"@dreal-ara.i2>, orig_to=<root>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.245]:25, delay=0.21, delays=0.03/0.01/0.11/0.06, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as EC7AB10F823)

to=<"root@ve-ara-23"@dreal-ara.i2> doesn't work, but to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr> is OK.
The configured email address for root@pam (in the UI: Datacenter > Users > Edit) is "reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr" but the error mail is send to root@ve-ara-23"@dreal-ara.i2 (don't exist).
 
I found the difference :
- backup mail (recived) :
Nov 27 01:33:11 ve-ara-23 postfix/smtp[2300353]: B634C100820: to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.246]:25, delay=0.22, delays=0.04/0.03/0.07/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as D671EA3DF0)
- error réplication mail :
Nov 29 00:06:49 ve-ara-23 postfix/smtp4040476: C8036102E16: to=<"root@ve-ara-23"@dreal-ara.i2>, orig_to=<root>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.245]:25, delay=0.21, delays=0.03/0.01/0.11/0.06, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as EC7AB10F823)

to=<"root@ve-ara-23"@dreal-ara.i2> doesn't work, but to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr> is OK.
The configured email address for root@pam (in the UI: Datacenter > Users > Edit) is "reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr" but the error mail is send to root@ve-ara-23"@dreal-ara.i2 (don't exist).
On a usual Proxmox VE setup, sending to root will forward to the address configured for root@pam. What are the contents of /root/.forward? Please also check your postfix configuration.
 
On a usual Proxmox VE setup, sending to root will forward to the address configured for root@pam. What are the contents of /root/.forward? Please also check your postfix configuration.
cat /root/.forward
|/usr/bin/pvemailforward

cat /etc/postfix/main.cf
myhostname = ve-ara-23.dreal-ara.i2
smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
biff = no
append_dot_mydomain = no
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
mydestination = $myhostname, localhost.$mydomain, localhost
relayhost = smtp.m2.e2.rie.gouv.fr
mynetworks = 127.0.0.0/8
inet_interfaces = loopback-only
recipient_delimiter = +
myorigin = ve-ara-23@dreal-ara.i2

Standard config with add "myorigin" because without I don't recived mail.

It seems not forward to the address configured for root@pam.
 
cat /root/.forward
|/usr/bin/pvemailforward

cat /etc/postfix/main.cf
myhostname = ve-ara-23.dreal-ara.i2
smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
biff = no
append_dot_mydomain = no
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
mydestination = $myhostname, localhost.$mydomain, localhost
relayhost = smtp.m2.e2.rie.gouv.fr
mynetworks = 127.0.0.0/8
inet_interfaces = loopback-only
recipient_delimiter = +
myorigin = ve-ara-23@dreal-ara.i2

Standard config with add "myorigin" because without I don't recived mail.

It seems not forward to the address configured for root@pam.
Not a mail expert, but I think it might be because you set a relayhost, so the mail doesn't get processed locally anymore and so pvemailforward doesn't trigger.
 
I think at least the @ part is wrong here - see http://www.postfix.org/postconf.5.html#myorigin

How does the mail have to look so that it's accepted by your relay host?
What's the error if you remove the myorigin line (and restart postfix)
First of all thanks for looking at my problem :)

Without "myorigin" the forward is ok but " Access denied" by smtp serveur
With "myorigin, no forward and no message because unknown recipient
So I get in touch with my messaging support.
Thanks a lot.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!