step by step ZFS storage replication

Mar 26, 2019
12
0
6
47
Hi all,

I've been trying to set up a 2 node storage replication on the latest 5.3 version for a while to no avail :(

I have 2 identical hosts with Proxmox installed on a 1TB virtual disk (split into local and local-LVM).
In addition each server has another 4TB virtual disk with no data yet.

I was thinking of running all VMs, CTs and file server on the bigger disk utilising ZFS and replication.
So that node1 hosts all live data which is periodically replicated to node2.
If node1 fails completely node2 should be able to take over within minutes loosing only data that has changed since the last successful replication.
I may also want to fail over on purpose if I want to e.g. do some testing, updates, hardware maintenance etc.
In that case the replication should be easily reversible, i.e. node2 -> node1.

I started with having node1 and node2 join a Datacenter cluster.

What are the next steps?
Node1 -> Disks -> ZFS ?
Node2 -> Disks -> ZFS ?
Datacenter -> Storage -> Add -> ZFS ?
Datacenter -> Permissions -> Pools ?

I think at some point I need to give identical names, which doesn't seem possible for ZFS disks.

Replication jobs appear to be documented in details and this seems quite intuitive but I couldn't find a similar guide on appropriate ZFS set up.

Or maybe I should look into using ZFS-sync or something else?

Please advise.

Regards,
Adam
 
Last edited:
To be more specific this is my current set up:

----------------------------------------------------------------------------------------------------------------------------------

root@node1:~# zpool status
pool: local-zfs-node1
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
local-zfs-node1 ONLINE 0 0 0
sdb ONLINE 0 0 0

errors: No known data errors

root@node1:~# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
local-zfs-node1 3.62T 408K 3.62T - 0% 0% 1.00x ONLINE -

----------------------------------------------------------------------------------------------------------------------------------

root@node2:~# zpool status
pool: local-zfs-node2
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
local-zfs-node2 ONLINE 0 0 0
sdb ONLINE 0 0 0

errors: No known data errors

root@node2:~# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
local-zfs-node2 3.62T 456K 3.62T - 0% 0% 1.00x ONLINE -

----------------------------------------------------------------------------------------------------------------------------------

In GUI under node1 I also have:
zfs-storage (node1)

Node2:
?zfs-storage (node2)
could not activate storage 'zfs-storage', zfs error: cannot import 'local-zfs-node1': no such pool available (500)
 
i would rename the pools on both nodes to the same name, then you create one storage entry with exactly that name, then replication should work out of the box

zfs pool renaming works with zpool export and import, see the manpages for more info
note that the disks in existing vm configs may reference the old zpool names, this would have to be manually changed
 
If you want to restart from scratch:

Node1 -> Disks -> ZFS -> Create: ZFS
Make sure that "Add Storage" is ticked.
It should appear in Disks -> ZFS and in the Resource Tree on the left.

Node2 -> Disks -> ZFS -> Create: ZFS
Make sure that
  1. the name equals the one in Node1
  2. "Add Storage is not ticked.
It should appear in Disks -> ZFS only.

Additionally, an entry was created in Datacenter -> Storage. Edit this entry and add Node2. The ZSF pool should now appear as storage for Node2 in the Resource Tree aswell.



root@Node1:~# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
mypool 31.8G 408K 31.7G - 0% 0% 1.00x ONLINE -

root@Node2:~# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
mypool 31.8G 5.71M 31.7G - 0% 0% 1.00x ONLINE -
 
Thank you, that's exactly what I was missing.

I unplugged network cables from node1 and can now run the same VM (101) on node2 (state from last successfully replication).

I've achieved it as below:
- pvecm e 1
- created new VM with the same name on node2 with ID 201
- edited /etc/pve/nodes/tiger/qemu-server/201.conf and repointed disk to 101
- started VM

Is there any better way?

Clean up appears to be a bit of an issue:

Cannot remove image, a guest with VMID '201' exists!
You can delete the image from the guest's hardware pane

Is it advisable to delete it manually?

Where do I find it, my zfs pool is reporting empty?

my-zfs-pool zfs 3.5T 128K 3.5T 1% /my-zfs-pool

My another concern is re-enabling node1.

As soon as I do it it will start replicating from outdated disk images quickly overwriting newer data, right?

What's the best way around it?
 
Thanks again Dominic.

Again I've tried a failed primary host scenario by unplugging network cable.
I've followed steps in the error handling section.
Forced quorum, copied config file, started VM, looks ok.
I can now see that the replication has been reversed (target changed from node2 to node1).
What's the next step before bringing node1 back online?

BTW - in my previous tests when I reinserted network cable to node1 and allowed reversed replication to run I quickly found myself in a split brain situation.
A couple of back and forth failovers, no errors from replication jobs when both nodes online but the VM continues with 2 separate data sets living their independent lives.
As if I had 2 disk images, even though I can see no hard evidence of that.
Have you seen anything like that before?
 
Last edited:
What's the next step before bringing node1 back online?
There should be no need for additional steps before bringing node1 back online.
Have you seen anything like that before?
This is strange. You should be able to move around your VM by moving your config file between the two nodes with a single mv. Making a completely fresh start might save troubles.
 
Hello again Dominic,

Right, so I've decided to give it a fresh start:

1. Created Debian 9 VM3 on node1.
A single disk on my-zfs-pool created earlier.

2. Logged into VM1 and made a change I can easily track (touch `date "+%Y%m%d%H%M%S"` in /var/tmp/)

3. Set up a replication job to node2 with the default 15 mins interval.
The initial sync went ok, I waited and allowed it to make one more scheduled sync.

4. Simulated node1 abrupt failure by pulling the network cable.

5. Logged into node2 and run:
pvecm e 1
cd /etc/pve/nodes/
mv node1/qemu-server/102.conf node2/qemu-server/
qm start 102

6. Logged into VM3 now running on node2>
MAC address preserved and my change from step 2 still there. Great!
I've noticed under VM3's menu that the replication job has been reversed with target changing from node2 to node1.
The job had a failed status since node1 was still offline.
I made a change on VM3 (touch `date "+%Y%m%d%H%M%S"` in /var/tmp/).

7. I plugged the network cable back to node1.
The (still reversed) replication job for VM3 kicked off and completed without errors.
I left the cluster happily running for about an hour.

8. After that time I attempted to migrate VM3 back to node1.

First unsuccessfully in running (online) mode:

2019-03-29 12:30:55 starting migration of VM 102 to node 'node1' (192.168.8.107)
2019-03-29 12:30:55 found local disk 'my-zfs-pool:vm-102-disk-0' (in current VM config)
2019-03-29 12:30:55 can't migrate local disk 'my-zfs-pool:vm-102-disk-0': can't live migrate attached local disks without with-local-disks option
2019-03-29 12:30:55 ERROR: Failed to sync data - can't migrate VM - check log
2019-03-29 12:30:55 aborting phase 1 - cleanup resources
2019-03-29 12:30:55 ERROR: migration aborted (duration 00:00:00): Failed to sync data - can't migrate VM - check log
TASK ERROR: migration aborted

Then successfully in poweroff (offline) mode:

2019-03-29 12:35:28 starting migration of VM 102 to node 'node1' (192.168.8.107)
2019-03-29 12:35:29 found local disk 'my-zfs-pool:vm-102-disk-0' (in current VM config)
2019-03-29 12:35:29 copying disk images
2019-03-29 12:35:29 start replication job
2019-03-29 12:35:29 guest => VM 102, running => 0
2019-03-29 12:35:29 volumes => my-zfs-pool:vm-102-disk-0
2019-03-29 12:35:29 create snapshot '__replicate_102-0_1553862929__' on my-zfs-pool:vm-102-disk-0
2019-03-29 12:35:29 incremental sync 'my-zfs-pool:vm-102-disk-0' (__replicate_102-0_1553862601__ => __replicate_102-0_1553862929__)
2019-03-29 12:35:30 send from @__replicate_102-0_1553862601__ to my-zfs-pool/vm-102-disk-0@__replicate_102-0_1553862929__ estimated size is 772K
2019-03-29 12:35:30 total estimated size is 772K
2019-03-29 12:35:30 TIME SENT SNAPSHOT
2019-03-29 12:35:30 my-zfs-pool/vm-102-disk-0@__replicate_102-0_1553862601__ name my-zfs-pool/vm-102-disk-0@__replicate_102-0_1553862601__ -
2019-03-29 12:35:35 delete previous replication snapshot '__replicate_102-0_1553862601__' on my-zfs-pool:vm-102-disk-0
2019-03-29 12:35:36 (remote_finalize_local_job) delete stale replication snapshot '__replicate_102-0_1553862601__' on my-zfs-pool:vm-102-disk-0
2019-03-29 12:35:37 end replication job
2019-03-29 12:35:37 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=node1' root@192.168.8.107 pvesr set-state 102 \''{"local/node2":{"last_try":1553862929,"last_iteration":1553862929,"storeid_list":["my-zfs-pool"],"fail_count":0,"duration":8.172205,"last_node":"node2","last_sync":1553862929}}'\'
2019-03-29 12:35:39 migration finished successfully (duration 00:00:11)
TASK OK

9. VM3 appeared running back on node1.
I logged into it to discover that my change from step 6 is gone!

10. I repeated steps 1-5, logged into VM3 again and the change from step 6 has re-appeared!
BTW, initially I was getting:
ls -al /var/tmp
ls: cannot access '/var/tmp/20190329115443': Structure needs cleaning
-????????? ? ? ? ? ? 20190329115443
but several seconds later it was listed as expected.

Why am I in this split-brain situation?
Have I missed a vital step somewhere on the way?

I need this kind of a process to both protect the data from unexpected failures and to be able to remove and re-add one of the nodes for any sort of maintenance.

Please advise.
Adam
 
Last edited:
It's also worth mentioning that as a result of the above procedure VM3 ended up in a defunct state.

Root partition is mounted read only.
Mount / mtab report: /dev/sda1 on / type ext4 (ro,relatime,errors=remount-ro,data=ordered)
Nothing can be written to /var/log etc.

It has also been assigned a duplicated IP address by our DHCP server:

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 12:1e:66:96:c9:24 brd ff:ff:ff:ff:ff:ff
inet 192.168.10.253/22 brd 192.168.11.255 scope global ens18
valid_lft forever preferred_lft forever
inet 192.168.10.216/22 brd 192.168.11.255 scope global secondary ens18
valid_lft forever preferred_lft forever
(...49 more entries here...)
inet 192.168.10.230/22 brd 192.168.11.255 scope global secondary ens18 ---> the IP it had before problems started
valid_lft forever preferred_lft forever
inet 192.168.10.196/22 brd 192.168.11.255 scope global secondary ens18 ---> the duplicate IP it's currently assigned
valid_lft forever preferred_lft forever
inet6 fe80::101e:66ff:fe96:c924/64 scope link
valid_lft forever preferred_lft forever

The duplicate IP happens to be simultaneously used by my Ubuntu 16 desktop workstation.
The MAC addresses of both interfaces (VM3 and my desktop) are different and have never been manually tampered with.
 
The defunct VM is still up preventing DHCP clients from getting any address assigned:

Client side (Win10 Pro):
Your computer was not assigned an address from the network (by the DHCP Server) for the Network Card with the network address 0x0800272ECE35. The following error occurred: 0x79. Your computer will continue to try and obtain an address on its own from the network address (DHCP) server.

Server side:
Apr 2 10:35:53 dhcp dhcpd: Abandoning IP address 192.168.10.194: declined.
Apr 2 10:35:53 dhcp dhcpd: DHCPDECLINE of 192.168.10.194 from 08:00:27:2e:ce:35 (test3VM) via br0: abandoned

The MAC address above belongs to the failing Win10 client, not test3VM (Debian 9 on Proxmox).
 
I now have 3 VMs and 1 container set up as per message #4 and #9 above.
All VMs suffer from split brain - they can happily boot on either node but run separate data sets which for some reason fail to merge.
Interestingly my only container maintains data consistency.
Even though it's been moved back and forth several times following exactly the same steps as all the VMs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!