First of all I'd like to thank the devs for another great update. pve 6.2 has brought some really nice features.
During my testing with 6.2 I was pleased to see that live migration with zfs replication works now, this is a feature I've been waiting for.
I use HA with zfs replication as failover but live migration has always been a bit of a hassle.
HA had to be disabled on a vm as wel as the zfs replication and had to be enabled afterwards.
Now live migration will work with zfs replication but only after disabling HA.
The migration will fail on HA because the HA resource agent doesn't add the "--with-local-disks" flag.
Task viewer: HA 102 - Migrate
Task viewer: VM 102 - Migrate
Here's the log for a migration using zfs replication with HA disabled.
Task viewer: VM 102 - Migrate
Would it be possible to add this flag to the migration command used by the HA resource agent?
Using live migration on a ceph backed vm works flawlessly, here is the log for that migration as reference.
Task viewer: HA 100 - Migrate
Task viewer: VM 100 - Migrate
During my testing with 6.2 I was pleased to see that live migration with zfs replication works now, this is a feature I've been waiting for.
I use HA with zfs replication as failover but live migration has always been a bit of a hassle.
HA had to be disabled on a vm as wel as the zfs replication and had to be enabled afterwards.
Now live migration will work with zfs replication but only after disabling HA.
The migration will fail on HA because the HA resource agent doesn't add the "--with-local-disks" flag.
Task viewer: HA 102 - Migrate
Requesting HA migration for VM 102 to node pve2
TASK OK
Task viewer: VM 102 - Migrate
task started by HA resource agent
2020-05-13 11:38:06 use dedicated network address for sending migration traffic (192.168.100.232)
2020-05-13 11:38:07 starting migration of VM 102 to node 'pve2' (192.168.100.232)
2020-05-13 11:38:07 found local, replicated disk 'local-zfs:vm-102-disk-0' (in current VM config)
2020-05-13 11:38:07 can't migrate local disk 'local-zfs:vm-102-disk-0': can't live migrate attached local disks without with-local-disks option
2020-05-13 11:38:07 ERROR: Failed to sync data - can't migrate VM - check log
2020-05-13 11:38:07 aborting phase 1 - cleanup resources
2020-05-13 11:38:07 ERROR: migration aborted (duration 00:00:01): Failed to sync data - can't migrate VM - check log
TASK ERROR: migration aborted
Here's the log for a migration using zfs replication with HA disabled.
Task viewer: VM 102 - Migrate
2020-05-12 22:45:16 use dedicated network address for sending migration traffic (192.168.100.233)
2020-05-12 22:45:16 starting migration of VM 102 to node 'pve3' (192.168.100.233)
2020-05-12 22:45:16 found local, replicated disk 'local-zfs:vm-102-disk-0' (in current VM config)
2020-05-12 22:45:16 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2020-05-12 22:45:16 replicating disk images
2020-05-12 22:45:16 start replication job
2020-05-12 22:45:16 guest => VM 102, running => 76265
2020-05-12 22:45:16 volumes => local-zfs:vm-102-disk-0
2020-05-12 22:45:17 create snapshot '__replicate_102-0_1589316316__' on local-zfs:vm-102-disk-0
2020-05-12 22:45:17 using secure transmission, rate limit: none
2020-05-12 22:45:17 incremental sync 'local-zfs:vm-102-disk-0' (__replicate_102-0_1589316300__ => __replicate_102-0_1589316316__)
2020-05-12 22:45:18 send from @__replicate_102-0_1589316300__ to rpool/data/vm-102-disk-0@__replicate_102-0_1589316316__ estimated size is 70.0K
2020-05-12 22:45:18 total estimated size is 70.0K
2020-05-12 22:45:18 TIME SENT SNAPSHOT rpool/data/vm-102-disk-0@__replicate_102-0_1589316316__
2020-05-12 22:45:18 rpool/data/vm-102-disk-0@__replicate_102-0_1589316300__ name rpool/data/vm-102-disk-0@__replicate_102-0_1589316300__ -
2020-05-12 22:45:18 successfully imported 'local-zfs:vm-102-disk-0'
2020-05-12 22:45:18 delete previous replication snapshot '__replicate_102-0_1589316300__' on local-zfs:vm-102-disk-0
2020-05-12 22:45:19 (remote_finalize_local_job) delete stale replication snapshot '__replicate_102-0_1589316300__' on local-zfs:vm-102-disk-0
2020-05-12 22:45:19 end replication job
2020-05-12 22:45:19 copying local disk images
2020-05-12 22:45:19 starting VM 102 on remote node 'pve3'
2020-05-12 22:45:20 start remote tunnel
2020-05-12 22:45:21 ssh tunnel ver 1
2020-05-12 22:45:21 starting storage migration
2020-05-12 22:45:21 scsi0: start migration to nbd:unix:/run/qemu-server/102_nbd.migrate:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
all mirroring jobs are ready
2020-05-12 22:45:21 volume 'local-zfs:vm-102-disk-0' is 'local-zfs:vm-102-disk-0' on the target
2020-05-12 22:45:21 starting online/live migration on unix:/run/qemu-server/102.migrate
2020-05-12 22:45:21 set migration_caps
2020-05-12 22:45:21 migration speed limit: 8589934592 B/s
2020-05-12 22:45:21 migration downtime limit: 100 ms
2020-05-12 22:45:21 migration cachesize: 268435456 B
2020-05-12 22:45:21 set migration parameters
2020-05-12 22:45:21 start migrate command to unix:/run/qemu-server/102.migrate
2020-05-12 22:45:22 migration status: active (transferred 428878266, remaining 42119168), total 2165121024)
2020-05-12 22:45:22 migration xbzrle cachesize: 268435456 transferred 0 pages 0 cachemiss 0 overflow 0
2020-05-12 22:45:23 migration speed: 1024.00 MB/s - downtime 57 ms
2020-05-12 22:45:23 migration status: completed
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
2020-05-12 22:45:24 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve3' root@192.168.100.233 pvesr set-state 102 \''{"local/pve2":{"last_try":1589316316,"fail_count":0,"last_sync":1589316316,"duration":2.748475,"storeid_list":["local-zfs"],"last_node":"pve2","last_iteration":1589316316}}'\'
2020-05-12 22:45:25 stopping NBD storage migration server on target.
2020-05-12 22:45:28 migration finished successfully (duration 00:00:12)
TASK OK
Would it be possible to add this flag to the migration command used by the HA resource agent?
Using live migration on a ceph backed vm works flawlessly, here is the log for that migration as reference.
Task viewer: HA 100 - Migrate
Requesting HA migration for VM 100 to node pve2
TASK OK
Task viewer: VM 100 - Migrate
task started by HA resource agent
2020-05-13 11:46:26 use dedicated network address for sending migration traffic (192.168.100.232)
2020-05-13 11:46:26 starting migration of VM 100 to node 'pve2' (192.168.100.232)
2020-05-13 11:46:27 starting VM 100 on remote node 'pve2'
2020-05-13 11:46:28 start remote tunnel
2020-05-13 11:46:29 ssh tunnel ver 1
2020-05-13 11:46:29 starting online/live migration on unix:/run/qemu-server/100.migrate
2020-05-13 11:46:29 set migration_caps
2020-05-13 11:46:29 migration speed limit: 8589934592 B/s
2020-05-13 11:46:29 migration downtime limit: 100 ms
2020-05-13 11:46:29 migration cachesize: 268435456 B
2020-05-13 11:46:29 set migration parameters
2020-05-13 11:46:29 start migrate command to unix:/run/qemu-server/100.migrate
2020-05-13 11:46:30 migration speed: 2048.00 MB/s - downtime 58 ms
2020-05-13 11:46:30 migration status: completed
2020-05-13 11:46:33 migration finished successfully (duration 00:00:07)
TASK OK