ZFS Replication does full instead of incremental

fiona · Nov 25, 2021

steph b said:
update nodes to 7.1-6 but same pb :
2021-11-25 11:43:00 692013-1: start replication job
2021-11-25 11:43:00 692013-1: guest => VM 692013, running => 11312
2021-11-25 11:43:00 692013-1: volumes => local-zfs:vm-692013-disk-0
2021-11-25 11:43:01 692013-1: delete stale replication snapshot '__replicate_692013-1_1637836747__' on local-zfs:vm-692013-disk-0
2021-11-25 11:43:01 692013-1: delete stale replication snapshot error: zfs error: cannot destroy snapshot rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__: dataset is busy

2021-11-25 11:43:02 692013-1: freeze guest filesystem
2021-11-25 11:43:02 692013-1: create snapshot '__replicate_692013-1_1637836980__' on local-zfs:vm-692013-disk-0
2021-11-25 11:43:02 692013-1: thaw guest filesystem
2021-11-25 11:43:02 692013-1: using insecure transmission, rate limit: 10 MByte/s
2021-11-25 11:43:02 692013-1: incremental sync 'local-zfs:vm-692013-disk-0' (__replicate_692013-1_1637709304__ => __replicate_692013-1_1637836980__)
2021-11-25 11:43:02 692013-1: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692013-disk-0'
2021-11-25 11:43:05 692013-1: send from @__replicate_692013-1_1637709304__ to rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__ estimated size is 27.3G
2021-11-25 11:43:05 692013-1: send from @__replicate_692013-1_1637836747__ to rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836980__ estimated size is 859K
2021-11-25 11:43:05 692013-1: total estimated size is 27.3G
2021-11-25 11:43:06 692013-1: TIME SENT SNAPSHOT rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:06 692013-1: 11:43:06 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:07 692013-1: 11:43:07 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:08 692013-1: 11:43:08 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:09 692013-1: 11:43:09 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:10 692013-1: 11:43:10 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:11 692013-1: 11:43:11 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:12 692013-1: 11:43:12 619K rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__
2021-11-25 11:43:12 692013-1: 586936 B 573.2 KB 7.97 s 73612 B/s 71.89 KB/s
2021-11-25 11:43:12 692013-1: write: Broken pipe
2021-11-25 11:43:12 692013-1: warning: cannot send 'rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836747__': signal received
2021-11-25 11:43:12 692013-1: warning: cannot send 'rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836980__': Broken pipe
2021-11-25 11:43:12 692013-1: cannot send 'rpool/data/vm-692013-disk-0': I/O error
2021-11-25 11:43:12 692013-1: command 'zfs send -Rpv -I __replicate_692013-1_1637709304__ -- rpool/data/vm-692013-disk-0@__replicate_692013-1_1637836980__' failed: exit code 1
2021-11-25 11:43:12 692013-1: [ve-ara-22] cannot receive incremental stream: dataset is busy
2021-11-25 11:43:12 692013-1: [ve-ara-22] command 'zfs recv -F -- rpool/data/vm-692013-disk-0' failed: exit code 1
2021-11-25 11:43:12 692013-1: delete previous replication snapshot '__replicate_692013-1_1637836980__' on local-zfs:vm-692013-disk-0
2021-11-25 11:43:13 692013-1: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692013-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692013-1_1637836980__ -base __replicate_692013-1_1637709304__ | /usr/bin/cstream -t 10000000' failed: exit code 2

Please try to wait a bit until all the old replication jobs have terminated (again check with ps).

steph b said:
and for second VM :

2021-11-25 11:47:07 692008-0: start replication job
2021-11-25 11:47:07 692008-0: guest => VM 692008, running => 7037
2021-11-25 11:47:07 692008-0: volumes => local-zfs:vm-692008-disk-0
2021-11-25 11:47:09 692008-0: freeze guest filesystem
2021-11-25 11:47:09 692008-0: create snapshot '__replicate_692008-0_1637837227__' on local-zfs:vm-692008-disk-0
2021-11-25 11:47:09 692008-0: thaw guest filesystem
2021-11-25 11:47:09 692008-0: using insecure transmission, rate limit: 10 MByte/s
2021-11-25 11:47:09 692008-0: full sync 'local-zfs:vm-692008-disk-0' (__replicate_692008-0_1637837227__)
2021-11-25 11:47:09 692008-0: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692008-disk-0'
2021-11-25 11:47:12 692008-0: full send of rpool/data/vm-692008-disk-0@__replicate_692008-0_1637837227__ estimated size is 86.7G
2021-11-25 11:47:12 692008-0: total estimated size is 86.7G
2021-11-25 11:47:12 692008-0: 1180 B 1.2 KB 0.71 s 1651 B/s 1.61 KB/s
2021-11-25 11:47:12 692008-0: write: Broken pipe
2021-11-25 11:47:12 692008-0: warning: cannot send 'rpool/data/vm-692008-disk-0@__replicate_692008-0_1637837227__': signal received
2021-11-25 11:47:12 692008-0: cannot send 'rpool/data/vm-692008-disk-0': I/O error
2021-11-25 11:47:12 692008-0: command 'zfs send -Rpv -- rpool/data/vm-692008-disk-0@__replicate_692008-0_1637837227__' failed: exit code 1
2021-11-25 11:47:12 692008-0: [ve-ara-22] volume 'rpool/data/vm-692008-disk-0' already exists
2021-11-25 11:47:12 692008-0: delete previous replication snapshot '__replicate_692008-0_1637837227__' on local-zfs:vm-692008-disk-0
2021-11-25 11:47:12 692008-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692008-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692008-0_1637837227__ | /usr/bin/cstream -t 10000000' failed: exit code 2

Hmm, seems like for that one had an invalid replication state and tried to do a full sync. I'm afraid you'll have to remove and re-create that job (or remove the volume on the target).

fiona · Nov 26, 2021

Many thanks to all of you for providing all these logs!

There was indeed a corner case that could lead to an inconsistent state (should be very difficult to trigger with pve-manager >= 7.1-6) and should be fixed by this proposed patch.

steph b · Nov 29, 2021

Hello Fabian,
Thank you for your answers.
This morning (PVE 7.1-6), one VM have still problem :

It seem touch some */2:00 réplication.
For this VM, last replication has worked yesterday 18PM.

2021-11-29 08:48:10 692010-0: start replication job
2021-11-29 08:48:10 692010-0: guest => VM 692010, running => 3436
2021-11-29 08:48:10 692010-0: volumes => local-zfs:vm-692010-disk-0
2021-11-29 08:48:12 692010-0: freeze guest filesystem
2021-11-29 08:48:13 692010-0: create snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:13 692010-0: thaw guest filesystem
2021-11-29 08:48:13 692010-0: using insecure transmission, rate limit: 10 MByte/s
2021-11-29 08:48:13 692010-0: full sync 'local-zfs:vm-692010-disk-0' (__replicate_692010-0_1638172090__)
2021-11-29 08:48:13 692010-0: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692010-disk-0'
2021-11-29 08:48:16 692010-0: full send of rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__ estimated size is 44.1G
2021-11-29 08:48:16 692010-0: total estimated size is 44.1G
2021-11-29 08:48:16 692010-0: 1180 B 1.2 KB 0.57 s 2084 B/s 2.04 KB/s
2021-11-29 08:48:16 692010-0: write: Broken pipe
2021-11-29 08:48:16 692010-0: warning: cannot send 'rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__': signal received
2021-11-29 08:48:16 692010-0: cannot send 'rpool/data/vm-692010-disk-0': I/O error
2021-11-29 08:48:16 692010-0: command 'zfs send -Rpv -- rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__' failed: exit code 1
2021-11-29 08:48:16 692010-0: [ve-ara-22] volume 'rpool/data/vm-692010-disk-0' already exists
2021-11-29 08:48:16 692010-0: delete previous replication snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:16 692010-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692010-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692010-0_1638172090__ | /usr/bin/cstream -t 10000000' failed: exit code 2

Can you please reply when the patch is enable for testing ?

best regards.

steph b · Nov 29, 2021

PS: a suggestion: send an email to the root mail given in installation in the event of a replication error, because it's a problem not to know if it doesn't work

fiona · Nov 29, 2021

steph b said:
Hello Fabian,
Thank you for your answers.
This morning (PVE 7.1-6), one VM have still problem :
View attachment 31848

It seem touch some */2:00 réplication.
For this VM, last replication has worked yesterday 18PM.

2021-11-29 08:48:10 692010-0: start replication job
2021-11-29 08:48:10 692010-0: guest => VM 692010, running => 3436
2021-11-29 08:48:10 692010-0: volumes => local-zfs:vm-692010-disk-0
2021-11-29 08:48:12 692010-0: freeze guest filesystem
2021-11-29 08:48:13 692010-0: create snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:13 692010-0: thaw guest filesystem
2021-11-29 08:48:13 692010-0: using insecure transmission, rate limit: 10 MByte/s
2021-11-29 08:48:13 692010-0: full sync 'local-zfs:vm-692010-disk-0' (__replicate_692010-0_1638172090__)
2021-11-29 08:48:13 692010-0: using a bandwidth limit of 10000000 bps for transferring 'local-zfs:vm-692010-disk-0'
2021-11-29 08:48:16 692010-0: full send of rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__ estimated size is 44.1G
2021-11-29 08:48:16 692010-0: total estimated size is 44.1G
2021-11-29 08:48:16 692010-0: 1180 B 1.2 KB 0.57 s 2084 B/s 2.04 KB/s
2021-11-29 08:48:16 692010-0: write: Broken pipe
2021-11-29 08:48:16 692010-0: warning: cannot send 'rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__': signal received
2021-11-29 08:48:16 692010-0: cannot send 'rpool/data/vm-692010-disk-0': I/O error
2021-11-29 08:48:16 692010-0: command 'zfs send -Rpv -- rpool/data/vm-692010-disk-0@__replicate_692010-0_1638172090__' failed: exit code 1
2021-11-29 08:48:16 692010-0: [ve-ara-22] volume 'rpool/data/vm-692010-disk-0' already exists
2021-11-29 08:48:16 692010-0: delete previous replication snapshot '__replicate_692010-0_1638172090__' on local-zfs:vm-692010-disk-0
2021-11-29 08:48:16 692010-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs:vm-692010-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_692010-0_1638172090__ | /usr/bin/cstream -t 10000000' failed: exit code 2

Could you please share the log from zpool history | grep vm-692010 (at least for the last day) from both source and target node and the syslog on the source node (since a bit before the last successful replication)?

steph b said:
Can you please reply when the patch is enable for testing ?

best regards.

Usually a few days after it has been reviewed and applied. But it should be very difficult to trigger that issue with pve-manager >= 7.1-6...

steph b said:
PS: a suggestion: send an email to the root mail given in installation in the event of a replication error, because it's a problem not to know if it doesn't work

That should actually already happen. Please check the configured email address for root@pam (in the UI: Datacenter > Users > Edit).

steph b · Nov 29, 2021

Fabian_E said:
Could you please share the log from zpool history | grep vm-692010 (at least for the last day) from both source and target node and the syslog on the source node (since a bit before the last successful replication)?

logs in attache files

steph b · Nov 29, 2021

Fabian_E said:
That should actually already happen. Please check the configured email address for root@pam (in the UI: Datacenter > Users > Edit).

I confirm that I do not receive an alert by email on replication errors.
Mail adress is alright :

and I receive mail with test :
echo "test" | mail -s "test messagerie" reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr

fiona · Nov 29, 2021

steph b said:
logs in attache files

Nov 29 00:17:06 ve-ara-23 pvescheduler[261209]: ERROR: can't lock file '/var/lock/pvesr.lck' - got timeout
Nov 29 00:17:06 ve-ara-23 pvescheduler[261209]: got shutdown request, signal running jobs to stop
Nov 29 00:17:06 ve-ara-23 pvescheduler[261209]: server stopped

It seems like the old version was still running there. Quoting myself:

Fabian_E said:
Note that it is still possible to run into the issue if there are active replications while upgrading, because the buggy version is still running during the upgrade.

To avoid spam, the error mail is only sent the first time a job fails until it has succeeded again, and it does seem like mails are sent:

Code:

Nov 29 00:06:48 ve-ara-23 pvescheduler[3429192]: 692008-0: got unexpected replication job error - command 'zfs snapshot rpool/data/vm-692008-disk-0@__replicate_692008-0_1638140710__' failed: got timeout
Nov 29 00:06:48 ve-ara-23 postfix/pickup[2931815]: C8036102E16: uid=0 from=<root>
Nov 29 00:06:48 ve-ara-23 postfix/cleanup[4040362]: C8036102E16: message-id=<20211128230648.C8036102E16@ve-ara-23.dreal-ara.i2>
Nov 29 00:06:48 ve-ara-23 postfix/qmgr[3349]: C8036102E16: from=<"root@ve-ara-23"@dreal-ara.i2>, size=538, nrcpt=1 (queue active)
Nov 29 00:06:49 ve-ara-23 postfix/smtp[4040476]: C8036102E16: to=<"root@ve-ara-23"@dreal-ara.i2>, orig_to=<root>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.245]:25, delay=0.21, delays=0.03/0.01/0.11/0.06, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as EC7AB10F823)
Nov 29 00:06:49 ve-ara-23 postfix/qmgr[3349]: C8036102E16: removed

steph b · Nov 29, 2021

Fabian_E said:
To avoid spam, the error mail is only sent the first time a job fails until it has succeeded again, and it does seem like mails are sent:

Strange, I did not receive it while I receive the backup email.

steph b · Nov 29, 2021

steph b said:
Strange, I did not receive it while I receive the backup email.

I found the difference :
- backup mail (recived) :
Nov 27 01:33:11 ve-ara-23 postfix/smtp[2300353]: B634C100820: to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.246]:25, delay=0.22, delays=0.04/0.03/0.07/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as D671EA3DF0)
- error réplication mail :
Nov 29 00:06:49 ve-ara-23 postfix/smtp4040476: C8036102E16: to=<"root@ve-ara-23"@dreal-ara.i2>, orig_to=<root>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.245]:25, delay=0.21, delays=0.03/0.01/0.11/0.06, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as EC7AB10F823)

to=<"root@ve-ara-23"@dreal-ara.i2> doesn't work, but to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr> is OK.
The configured email address for root@pam (in the UI: Datacenter > Users > Edit) is "reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr" but the error mail is send to root@ve-ara-23"@dreal-ara.i2 (don't exist).

fiona · Nov 30, 2021

steph b said:
I found the difference :
- backup mail (recived) :
Nov 27 01:33:11 ve-ara-23 postfix/smtp[2300353]: B634C100820: to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.246]:25, delay=0.22, delays=0.04/0.03/0.07/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as D671EA3DF0)
- error réplication mail :
Nov 29 00:06:49 ve-ara-23 postfix/smtp4040476: C8036102E16: to=<"root@ve-ara-23"@dreal-ara.i2>, orig_to=<root>, relay=smtp.m2.e2.rie.gouv.fr[10.167.128.245]:25, delay=0.21, delays=0.03/0.01/0.11/0.06, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as EC7AB10F823)

to=<"root@ve-ara-23"@dreal-ara.i2> doesn't work, but to=<reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr> is OK.
The configured email address for root@pam (in the UI: Datacenter > Users > Edit) is "reseaux-serveurs.dreal-ara@developpement-durable.gouv.fr" but the error mail is send to root@ve-ara-23"@dreal-ara.i2 (don't exist).

On a usual Proxmox VE setup, sending to root will forward to the address configured for root@pam. What are the contents of /root/.forward? Please also check your postfix configuration.

steph b · Nov 30, 2021

Fabian_E said:
On a usual Proxmox VE setup, sending to root will forward to the address configured for root@pam. What are the contents of /root/.forward? Please also check your postfix configuration.

cat /root/.forward
|/usr/bin/pvemailforward

cat /etc/postfix/main.cf
myhostname = ve-ara-23.dreal-ara.i2
smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
biff = no
append_dot_mydomain = no
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
mydestination = $myhostname, localhost.$mydomain, localhost
relayhost = smtp.m2.e2.rie.gouv.fr
mynetworks = 127.0.0.0/8
inet_interfaces = loopback-only
recipient_delimiter = +
myorigin = ve-ara-23@dreal-ara.i2

Standard config with add "myorigin" because without I don't recived mail.

It seems not forward to the address configured for root@pam.

fiona · Nov 30, 2021

steph b said:
cat /root/.forward
|/usr/bin/pvemailforward

cat /etc/postfix/main.cf
myhostname = ve-ara-23.dreal-ara.i2
smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
biff = no
append_dot_mydomain = no
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
mydestination = $myhostname, localhost.$mydomain, localhost
relayhost = smtp.m2.e2.rie.gouv.fr
mynetworks = 127.0.0.0/8
inet_interfaces = loopback-only
recipient_delimiter = +
myorigin = ve-ara-23@dreal-ara.i2

Standard config with add "myorigin" because without I don't recived mail.

It seems not forward to the address configured for root@pam.

Not a mail expert, but I think it might be because you set a relayhost, so the mail doesn't get processed locally anymore and so pvemailforward doesn't trigger.

Stoiko Ivanov · Nov 30, 2021

steph b said:
myorigin = ve-ara-23@dreal-ara.i2

I think at least the @ part is wrong here - see http://www.postfix.org/postconf.5.html#myorigin

How does the mail have to look so that it's accepted by your relay host?
What's the error if you remove the myorigin line (and restart postfix)

steph b · Nov 30, 2021

Stoiko Ivanov said:
I think at least the @ part is wrong here - see http://www.postfix.org/postconf.5.html#myorigin

How does the mail have to look so that it's accepted by your relay host?
What's the error if you remove the myorigin line (and restart postfix)

First of all thanks for looking at my problem

Without "myorigin" the forward is ok but " Access denied" by smtp serveur
With "myorigin, no forward and no message because unknown recipient
So I get in touch with my messaging support.
Thanks a lot.

Search

Search

ZFS Replication does full instead of incremental

fiona

Proxmox Staff Member

fiona

Proxmox Staff Member

steph b

New Member

steph b

New Member

fiona

Proxmox Staff Member

steph b

New Member

Attachments

steph b

New Member

fiona

Proxmox Staff Member

steph b

New Member

steph b

New Member

fiona

Proxmox Staff Member

steph b

New Member

fiona

Proxmox Staff Member

Stoiko Ivanov

Proxmox Staff Member

steph b

New Member

We value your privacy