Hi,
I'm testing some DR scenarios and came to errors in replication.
There is no HA configuration for the single VM I'm using for this test.
Scenario is :
- 3 Proxmox 6.0-5 version configured to work in Cluster mode
- Nodes are named pve-t01 / pve-t02 / pve-t03
- Just one VM that is in stopped during all tests
I can migrate this VM as many times I want and can let replication running for hours without any error.
Just to simulate a node crash I shutdown the node where this VM ( stopped ) is.
After the node is no longer responding I moved this VM to one of the 2 nodes.
mv /etc/pve/nodes/pve-t01/qemu-server/111.conf /etc/pve/nodes/pve-t03/qemu-server/111.conf
I can see VM in the target node and replication configuration change nodes as expected.
But, most of the times I do this procedure, one node no longer replicate and I must delete the VM from that node so the replication process can continue.
I found this old thread but in my case there is only one small stopped VM.
There are some LOGs from replication.
Test #1 - VM was on node T01 and I shutdwon thar node
- I move VM from node T01 to node T03
Replication to node T02 is in error with the following LOG
======================
2019-08-03 16:28:03 111-0: start replication job
2019-08-03 16:28:03 111-0: guest => VM 111, running => 0
2019-08-03 16:28:03 111-0: volumes => L-Stor:vm-111-disk-1
2019-08-03 16:28:03 111-0: delete stale replication snapshot '__replicate_111-0_1564860301__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:04 111-0: (remote_prepare_local_job) delete stale replication snapshot '__replicate_111-0_1564860301__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:04 111-0: create snapshot '__replicate_111-0_1564860483__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:04 111-0: full sync 'L-Stor:vm-111-disk-1' (__replicate_111-0_1564860483__)
2019-08-03 16:28:04 111-0: full send of L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__ estimated size is 1.46G
2019-08-03 16:28:04 111-0: send from @__replicate_111-1_1564860304__ to L-Stor/vm-111-disk-1@__replicate_111-0_1564860483__ estimated size is 624B
2019-08-03 16:28:04 111-0: total estimated size is 1.46G
2019-08-03 16:28:04 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:28:04 111-0: L-Stor/vm-111-disk-1 name L-Stor/vm-111-disk-1 -
2019-08-03 16:28:04 111-0: volume 'L-Stor/vm-111-disk-1' already exists
2019-08-03 16:28:04 111-0: 140160 B 136.9 KB 0.77 s 182784 B/s 178.50 KB/s
2019-08-03 16:28:04 111-0: write: Broken pipe
2019-08-03 16:28:04 111-0: warning: cannot send 'L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__': signal received
2019-08-03 16:28:04 111-0: warning: cannot send 'L-Stor/vm-111-disk-1@__replicate_111-0_1564860483__': Broken pipe
2019-08-03 16:28:05 111-0: cannot send 'L-Stor/vm-111-disk-1': I/O error
2019-08-03 16:28:05 111-0: command 'zfs send -Rpv -- L-Stor/vm-111-disk-1@__replicate_111-0_1564860483__' failed: exit code 1
2019-08-03 16:28:05 111-0: delete previous replication snapshot '__replicate_111-0_1564860483__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:05 111-0: end replication job with error: command 'set -o pipefail && pvesm export L-Stor:vm-111-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_111-0_1564860483__ | /usr/bin/cstream -t 50000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-t02' root@192.168.0.228 -- pvesm import L-Stor:vm-111-disk-1 zfs - -with-snapshots 1' failed: exit code 255
======================
After that I deleted VM on node T02 and replication continue to work.
======================
2019-08-03 16:39:00 111-0: start replication job
2019-08-03 16:39:00 111-0: guest => VM 111, running => 12213
2019-08-03 16:39:00 111-0: volumes => L-Stor:vm-111-disk-1
2019-08-03 16:39:01 111-0: freeze guest filesystem
2019-08-03 16:39:02 111-0: create snapshot '__replicate_111-0_1564861140__' on L-Stor:vm-111-disk-1
2019-08-03 16:39:02 111-0: thaw guest filesystem
2019-08-03 16:39:02 111-0: full sync 'L-Stor:vm-111-disk-1' (__replicate_111-0_1564861140__)
2019-08-03 16:39:02 111-0: full send of L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__ estimated size is 1.46G
2019-08-03 16:39:02 111-0: send from @__replicate_111-1_1564860304__ to L-Stor/vm-111-disk-1@__replicate_111-1_1564860900__ estimated size is 1.34M
2019-08-03 16:39:02 111-0: send from @__replicate_111-1_1564860900__ to L-Stor/vm-111-disk-1@__replicate_111-0_1564861140__ estimated size is 361K
2019-08-03 16:39:02 111-0: total estimated size is 1.46G
2019-08-03 16:39:03 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:04 111-0: 16:39:04 40.4M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:05 111-0: 16:39:05 83.7M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:06 111-0: 16:39:06 126M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:07 111-0: 16:39:07 167M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:08 111-0: 16:39:08 205M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:09 111-0: 16:39:09 250M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:10 111-0: 16:39:10 285M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:11 111-0: 16:39:11 323M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:12 111-0: 16:39:12 355M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:13 111-0: 16:39:13 395M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:14 111-0: 16:39:14 436M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:15 111-0: 16:39:15 467M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:16 111-0: 16:39:16 512M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:17 111-0: 16:39:17 549M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:18 111-0: 16:39:18 588M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:19 111-0: 16:39:19 635M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:20 111-0: 16:39:20 667M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:21 111-0: 16:39:21 705M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:22 111-0: 16:39:22 744M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:23 111-0: 16:39:23 777M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:24 111-0: 16:39:24 825M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:25 111-0: 16:39:25 860M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:26 111-0: 16:39:26 892M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:27 111-0: 16:39:27 924M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:28 111-0: 16:39:28 973M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:29 111-0: 16:39:29 1005M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:30 111-0: 16:39:30 1.03G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:31 111-0: 16:39:31 1.06G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:32 111-0: 16:39:32 1.09G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:33 111-0: 16:39:33 1.13G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:34 111-0: 16:39:34 1.16G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:35 111-0: 16:39:35 1.21G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:36 111-0: 16:39:36 1.23G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:37 111-0: 16:39:37 1.27G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:38 111-0: 16:39:38 1.31G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:39 111-0: 16:39:39 1.34G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:40 111-0: 16:39:40 1.38G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:41 111-0: 16:39:41 1.41G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:42 111-0: 16:39:42 1.45G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:42 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-1_1564860900__
2019-08-03 16:39:42 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-0_1564861140__
2019-08-03 16:39:45 111-0: end replication job
======================
Can anyone confirm this behaviour ?
Regards,
Ricardo Jorge
I'm testing some DR scenarios and came to errors in replication.
There is no HA configuration for the single VM I'm using for this test.
Scenario is :
- 3 Proxmox 6.0-5 version configured to work in Cluster mode
- Nodes are named pve-t01 / pve-t02 / pve-t03
- Just one VM that is in stopped during all tests
I can migrate this VM as many times I want and can let replication running for hours without any error.
Just to simulate a node crash I shutdown the node where this VM ( stopped ) is.
After the node is no longer responding I moved this VM to one of the 2 nodes.
mv /etc/pve/nodes/pve-t01/qemu-server/111.conf /etc/pve/nodes/pve-t03/qemu-server/111.conf
I can see VM in the target node and replication configuration change nodes as expected.
But, most of the times I do this procedure, one node no longer replicate and I must delete the VM from that node so the replication process can continue.
I found this old thread but in my case there is only one small stopped VM.
There are some LOGs from replication.
Test #1 - VM was on node T01 and I shutdwon thar node
- I move VM from node T01 to node T03
Replication to node T02 is in error with the following LOG
======================
2019-08-03 16:28:03 111-0: start replication job
2019-08-03 16:28:03 111-0: guest => VM 111, running => 0
2019-08-03 16:28:03 111-0: volumes => L-Stor:vm-111-disk-1
2019-08-03 16:28:03 111-0: delete stale replication snapshot '__replicate_111-0_1564860301__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:04 111-0: (remote_prepare_local_job) delete stale replication snapshot '__replicate_111-0_1564860301__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:04 111-0: create snapshot '__replicate_111-0_1564860483__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:04 111-0: full sync 'L-Stor:vm-111-disk-1' (__replicate_111-0_1564860483__)
2019-08-03 16:28:04 111-0: full send of L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__ estimated size is 1.46G
2019-08-03 16:28:04 111-0: send from @__replicate_111-1_1564860304__ to L-Stor/vm-111-disk-1@__replicate_111-0_1564860483__ estimated size is 624B
2019-08-03 16:28:04 111-0: total estimated size is 1.46G
2019-08-03 16:28:04 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:28:04 111-0: L-Stor/vm-111-disk-1 name L-Stor/vm-111-disk-1 -
2019-08-03 16:28:04 111-0: volume 'L-Stor/vm-111-disk-1' already exists
2019-08-03 16:28:04 111-0: 140160 B 136.9 KB 0.77 s 182784 B/s 178.50 KB/s
2019-08-03 16:28:04 111-0: write: Broken pipe
2019-08-03 16:28:04 111-0: warning: cannot send 'L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__': signal received
2019-08-03 16:28:04 111-0: warning: cannot send 'L-Stor/vm-111-disk-1@__replicate_111-0_1564860483__': Broken pipe
2019-08-03 16:28:05 111-0: cannot send 'L-Stor/vm-111-disk-1': I/O error
2019-08-03 16:28:05 111-0: command 'zfs send -Rpv -- L-Stor/vm-111-disk-1@__replicate_111-0_1564860483__' failed: exit code 1
2019-08-03 16:28:05 111-0: delete previous replication snapshot '__replicate_111-0_1564860483__' on L-Stor:vm-111-disk-1
2019-08-03 16:28:05 111-0: end replication job with error: command 'set -o pipefail && pvesm export L-Stor:vm-111-disk-1 zfs - -with-snapshots 1 -snapshot __replicate_111-0_1564860483__ | /usr/bin/cstream -t 50000000 | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-t02' root@192.168.0.228 -- pvesm import L-Stor:vm-111-disk-1 zfs - -with-snapshots 1' failed: exit code 255
======================
After that I deleted VM on node T02 and replication continue to work.
======================
2019-08-03 16:39:00 111-0: start replication job
2019-08-03 16:39:00 111-0: guest => VM 111, running => 12213
2019-08-03 16:39:00 111-0: volumes => L-Stor:vm-111-disk-1
2019-08-03 16:39:01 111-0: freeze guest filesystem
2019-08-03 16:39:02 111-0: create snapshot '__replicate_111-0_1564861140__' on L-Stor:vm-111-disk-1
2019-08-03 16:39:02 111-0: thaw guest filesystem
2019-08-03 16:39:02 111-0: full sync 'L-Stor:vm-111-disk-1' (__replicate_111-0_1564861140__)
2019-08-03 16:39:02 111-0: full send of L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__ estimated size is 1.46G
2019-08-03 16:39:02 111-0: send from @__replicate_111-1_1564860304__ to L-Stor/vm-111-disk-1@__replicate_111-1_1564860900__ estimated size is 1.34M
2019-08-03 16:39:02 111-0: send from @__replicate_111-1_1564860900__ to L-Stor/vm-111-disk-1@__replicate_111-0_1564861140__ estimated size is 361K
2019-08-03 16:39:02 111-0: total estimated size is 1.46G
2019-08-03 16:39:03 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:04 111-0: 16:39:04 40.4M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:05 111-0: 16:39:05 83.7M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:06 111-0: 16:39:06 126M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:07 111-0: 16:39:07 167M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:08 111-0: 16:39:08 205M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:09 111-0: 16:39:09 250M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:10 111-0: 16:39:10 285M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:11 111-0: 16:39:11 323M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:12 111-0: 16:39:12 355M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:13 111-0: 16:39:13 395M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:14 111-0: 16:39:14 436M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:15 111-0: 16:39:15 467M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:16 111-0: 16:39:16 512M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:17 111-0: 16:39:17 549M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:18 111-0: 16:39:18 588M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:19 111-0: 16:39:19 635M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:20 111-0: 16:39:20 667M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:21 111-0: 16:39:21 705M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:22 111-0: 16:39:22 744M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:23 111-0: 16:39:23 777M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:24 111-0: 16:39:24 825M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:25 111-0: 16:39:25 860M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:26 111-0: 16:39:26 892M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:27 111-0: 16:39:27 924M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:28 111-0: 16:39:28 973M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:29 111-0: 16:39:29 1005M L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:30 111-0: 16:39:30 1.03G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:31 111-0: 16:39:31 1.06G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:32 111-0: 16:39:32 1.09G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:33 111-0: 16:39:33 1.13G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:34 111-0: 16:39:34 1.16G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:35 111-0: 16:39:35 1.21G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:36 111-0: 16:39:36 1.23G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:37 111-0: 16:39:37 1.27G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:38 111-0: 16:39:38 1.31G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:39 111-0: 16:39:39 1.34G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:40 111-0: 16:39:40 1.38G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:41 111-0: 16:39:41 1.41G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:42 111-0: 16:39:42 1.45G L-Stor/vm-111-disk-1@__replicate_111-1_1564860304__
2019-08-03 16:39:42 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-1_1564860900__
2019-08-03 16:39:42 111-0: TIME SENT SNAPSHOT L-Stor/vm-111-disk-1@__replicate_111-0_1564861140__
2019-08-03 16:39:45 111-0: end replication job
======================
Can anyone confirm this behaviour ?
Regards,
Ricardo Jorge