Manually fixing ZFS replication?

Dec 16, 2018
13
0
6
46
Often, especially when migrating VMs back and forth (A -> B and then back to B -> A) during maintenance, my ZFS replication gets in a state where it fails with errors like:

"volume 'ssdtank/vmdata/vm-117-disk-0' already exists"
or claims target and source don't have a common ancestor version.

"Already exists" is of course true, but there most definitely is a common ancestor -- the volume was just migrated from A->B, but migrating it back (B->A) now fails. Proxmox has clearly lost some metadata in the process, since it consider the volume existing on A a surprise ("already exists").

The problem is, vm-117-disk-0 is 10.5 TB, so I really can't always delete and re-sync from the scratch.

What might this lost metadata be, and is there a way to fix this state manually so it that replication can resume?
 
Last edited:
Anyone know what exactly is complaining "already exists", btw?
It seems nonsensical -- of course vm-117-disk-0 already exists; otherwise incremental sync wouldn't be possible at all, right? Or what am I missing here?
 
Hi,

I never notice this behavior.
When you say migrate back and forth do you wait until the sync of the first job is done?
And all states are ok before migrate back?
 
Yes, I've had it without forcing a migration. Not sure about when exactly this happens, seems a bit arbitrary so far. Maybe only with VMs that have multiple disks..?

Anyway, I'd like to know how to recover manually when this occurs. Any idea what that "already exists" actually refers to? Is there perhaps something I could remove or rename to bypass that error.

Or maybe "already exists" is not an(/the) error at all? Maybe "cannot send ... signal received" is cause by something else? String "I/O error" in the log also looks bad, but zpools are clean and dmesg doesn't report errors either.

Here's a full example:

Code:
2019-04-01 22:59:00 117-3: volumes => local-zfs-ssd-images:vm-117-disk-0,zvol-moxa-optimized:vm-117-disk-1
2019-04-01 22:59:01 117-3: freeze guest filesystem
2019-04-01 22:59:01 117-3: create snapshot '__replicate_117-3_1554148740__' on local-zfs-ssd-images:vm-117-disk-0
2019-04-01 22:59:01 117-3: create snapshot '__replicate_117-3_1554148740__' on zvol-moxa-optimized:vm-117-disk-1
2019-04-01 22:59:01 117-3: thaw guest filesystem
2019-04-01 22:59:01 117-3: full sync 'local-zfs-ssd-images:vm-117-disk-0' (__replicate_117-3_1554148740__)
2019-04-01 22:59:02 117-3: full send of ssdtank/vmdata/vm-117-disk-0@__replicate_117-4_1553966052__ estimated size is 15.5G
2019-04-01 22:59:02 117-3: send from @__replicate_117-4_1553966052__ to ssdtank/vmdata/vm-117-disk-0@__replicate_117-2_1554142200__ estimated size is 1.30G
2019-04-01 22:59:02 117-3: send from @__replicate_117-2_1554142200__ to ssdtank/vmdata/vm-117-disk-0@__replicate_117-3_1554148740__ estimated size is 2.43G
2019-04-01 22:59:02 117-3: total estimated size is 19.2G
2019-04-01 22:59:02 117-3: TIME        SENT   SNAPSHOT
2019-04-01 22:59:02 117-3: ssdtank/vmdata/vm-117-disk-0    name    ssdtank/vmdata/vm-117-disk-0    -
2019-04-01 22:59:02 117-3: volume 'ssdtank/vmdata/vm-117-disk-0' already exists
2019-04-01 22:59:02 117-3: warning: cannot send 'ssdtank/vmdata/vm-117-disk-0@__replicate_117-4_1553966052__': signal received
2019-04-01 22:59:02 117-3: warning: cannot send 'ssdtank/vmdata/vm-117-disk-0@__replicate_117-2_1554142200__': Broken pipe
2019-04-01 22:59:02 117-3: TIME        SENT   SNAPSHOT
2019-04-01 22:59:02 117-3: warning: cannot send 'ssdtank/vmdata/vm-117-disk-0@__replicate_117-3_1554148740__': Broken pipe
2019-04-01 22:59:02 117-3: cannot send 'ssdtank/vmdata/vm-117-disk-0': I/O error
2019-04-01 22:59:02 117-3: command 'zfs send -Rpv -- ssdtank/vmdata/vm-117-disk-0@__replicate_117-3_1554148740__' failed: exit code 1
2019-04-01 22:59:02 117-3: delete previous replication snapshot '__replicate_117-3_1554148740__' on local-zfs-ssd-images:vm-117-disk-0
2019-04-01 22:59:02 117-3: delete previous replication snapshot '__replicate_117-3_1554148740__' on zvol-moxa-optimized:vm-117-disk-1
2019-04-01 22:59:02 117-3: end replication job with error: command 'set -o pipefail && pvesm export local-zfs-ssd-images:vm-117-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_117-3_1554148740__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=mox-d' root@10.68.68.6 -- pvesm import local-zfs-ssd-images:vm-117-disk-0 zfs - -with-snapshots 1' failed: exit code 255

All the while, replicating vm-117 to another node works, and syncing other VMs to the node that gives an error with vm-117 works too.
 
I'm having the same issue when I move back a VM to the original host, replication fails and I have to manually delete the snapshots of the affected VM to have the replication working again. Here is the logs when failing :

Code:
2019-04-03 08:28:03 100-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs-hdd:vm-100-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1554272880__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=porthos' root@10.1.0.10 -- pvesm import local-zfs-hdd:vm-100-disk-1 zfs - -with-snapshots 1' failed: exit code 255

Then, I have to go on the node where I replicate all VMs to, and delete manually the disk... I can replicate it everytime for all VMs.
 
Do you have any special ssh or bash setting?
 
@saphirblanc
The complete log output is needed to prove that you have the same error.
Your output only tells it does not work but not where or why it stop.
 
We connect to the host using a different port, however they talk to each other
@saphirblanc
The complete log output is needed to prove that you have the same error.
Your output only tells it does not work but not where or why it stop.

Sorry, here we go :
Code:
()
2019-04-03 09:43:00 100-0: start replication job
2019-04-03 09:43:00 100-0: guest => VM 100, running => 0
2019-04-03 09:43:00 100-0: volumes => local-zfs-hdd:vm-100-disk-1
2019-04-03 09:43:01 100-0: create snapshot '__replicate_100-0_1554277380__' on local-zfs-hdd:vm-100-disk-1
2019-04-03 09:43:01 100-0: full sync 'local-zfs-hdd:vm-100-disk-1' (__replicate_100-0_1554277380__)
2019-04-03 09:43:02 100-0: full send of rpool-hdd/vm-100-disk-1@__replicate_100-0_1554272400__ estimated size is 2.39G
2019-04-03 09:43:02 100-0: send from @__replicate_100-0_1554272400__ to rpool-hdd/vm-100-disk-1@__replicate_100-0_1554277380__ estimated size is 624B
2019-04-03 09:43:02 100-0: total estimated size is 2.39G
2019-04-03 09:43:02 100-0: rpool-hdd/vm-100-disk-1    name    rpool-hdd/vm-100-disk-1    -
2019-04-03 09:43:02 100-0: volume 'rpool-hdd/vm-100-disk-1' already exists
2019-04-03 09:43:02 100-0: TIME        SENT   SNAPSHOT
2019-04-03 09:43:02 100-0: warning: cannot send 'rpool-hdd/vm-100-disk-1@__replicate_100-0_1554272400__': Broken pipe
2019-04-03 09:43:02 100-0: TIME        SENT   SNAPSHOT
2019-04-03 09:43:02 100-0: warning: cannot send 'rpool-hdd/vm-100-disk-1@__replicate_100-0_1554277380__': Broken pipe
2019-04-03 09:43:03 100-0: cannot send 'rpool-hdd/vm-100-disk-1': I/O error
2019-04-03 09:43:03 100-0: command 'zfs send -Rpv -- rpool-hdd/vm-100-disk-1@__replicate_100-0_1554277380__' failed: exit code 1
2019-04-03 09:43:03 100-0: delete previous replication snapshot '__replicate_100-0_1554277380__' on local-zfs-hdd:vm-100-disk-1
2019-04-03 09:43:03 100-0: end replication job with error: command 'set -o pipefail && pvesm export local-zfs-hdd:vm-100-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_100-0_1554277380__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=porthos' root@10.1.0.10 -- pvesm import local-zfs-hdd:vm-100-disk-1 zfs - -with-snapshots 1' failed: exit code 255

And nothing really special for the SSH. Just to tell, all other replication tasks are running without any problem.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!