Trying to figure out if this a bug or just a misunderstanding on how live migrations are supposed to work with the latest enhancements. I'm on the 7.1-10 release and I'm trying to live migrate a vm from one node to another. When you do a live migrate it gives you an option to use a different zfs pool on the target host however in my testing that never works. Consider the log below.
My source pool on cloud10 is zfs-nvme-pool-1 and my destination pool is zfs-local-pool and zfs-nvme-pool-1 does not exist on cloud11 however I wasn't trying to move to the same pool I was trying to move to a different pool.
If I move VM 408 to zfs-local-pool on cloud10 and then try to migrate to zfs-nvme-pool on cloud12 I get a very different error. zfs-local-pool exists on both cloud10 and cloud12 however zfs-nvme-pool-1 only exists on cloud10 and zfs-nvme-pool only exists on cloud12. This time you get a timeout after 300 seconds complaining about the zvol. However I've gone out there and look and the zvol does exist.
And finally if I add a zfs-nvme-pool-1 to cloud12 so that both cloud10 and cloud12 have the same named pools I get the error above about the vm never starting.
My source pool on cloud10 is zfs-nvme-pool-1 and my destination pool is zfs-local-pool and zfs-nvme-pool-1 does not exist on cloud11 however I wasn't trying to move to the same pool I was trying to move to a different pool.
Code:
2022-01-31 07:54:29 use dedicated network address for sending migration traffic (10.0.4.21)
2022-01-31 07:54:29 starting migration of VM 408 to node 'cloud11' (10.0.4.21)
2022-01-31 07:54:29 found generated disk 'zfs-nvme-pool-1:vm-408-cloudinit' (in current VM config)
2022-01-31 07:54:29 found local disk 'zfs-nvme-pool-1:vm-408-disk-0' (in current VM config)
2022-01-31 07:54:30 copying local disk images
2022-01-31 07:54:32 full send of zfs-nvme-pool-1/vm-408-cloudinit@__migration__ estimated size is 57.9K
2022-01-31 07:54:32 total estimated size is 57.9K
2022-01-31 07:54:33 successfully imported 'zfs-local-pool:vm-408-cloudinit'
2022-01-31 07:54:33 volume 'zfs-nvme-pool-1:vm-408-cloudinit' is 'zfs-local-pool:vm-408-cloudinit' on the target
2022-01-31 07:54:33 starting VM 408 on remote node 'cloud11'
2022-01-31 07:54:35 [cloud11] storage 'zfs-nvme-pool-1' is not available on node 'cloud11'
2022-01-31 07:54:36 ERROR: online migrate failure - remote command failed with exit code 255
2022-01-31 07:54:36 aborting phase 2 - cleanup resources
2022-01-31 07:54:36 migrate_cancel
2022-01-31 07:54:38 ERROR: migration finished with problems (duration 00:00:09)
TASK ERROR: migration problems
If I move VM 408 to zfs-local-pool on cloud10 and then try to migrate to zfs-nvme-pool on cloud12 I get a very different error. zfs-local-pool exists on both cloud10 and cloud12 however zfs-nvme-pool-1 only exists on cloud10 and zfs-nvme-pool only exists on cloud12. This time you get a timeout after 300 seconds complaining about the zvol. However I've gone out there and look and the zvol does exist.
Code:
root@cloud12:~# ls -lh -R /dev/zvol/
/dev/zvol/:
total 0
drwxr-xr-x 2 root root 80 Jan 31 09:18 zfs-nvme-pool
/dev/zvol/zfs-nvme-pool:
total 0
lrwxrwxrwx 1 root root 9 Jan 31 09:18 vm-408-cloudinit -> ../../zd0
lrwxrwxrwx 1 root root 10 Jan 31 09:18 vm-408-disk-0 -> ../../zd16
Code:
2022-01-31 09:18:01 use dedicated network address for sending migration traffic (10.0.4.22)
2022-01-31 09:18:01 starting migration of VM 408 to node 'cloud12' (10.0.4.22)
2022-01-31 09:18:02 found generated disk 'zfs-local-pool:vm-408-cloudinit' (in current VM config)
2022-01-31 09:18:02 found local disk 'zfs-local-pool:vm-408-disk-0' (in current VM config)
2022-01-31 09:18:02 drive 'ide2': size of disk 'zfs-local-pool:vm-408-cloudinit' updated from 0T to 4M
2022-01-31 09:18:02 copying local disk images
2022-01-31 09:18:04 full send of zfs-local-pool/vm-408-cloudinit@__migration__ estimated size is 66.0K
2022-01-31 09:18:04 total estimated size is 66.0K
2022-01-31 09:18:04 successfully imported 'zfs-nvme-pool:vm-408-cloudinit'
2022-01-31 09:18:05 volume 'zfs-local-pool:vm-408-cloudinit' is 'zfs-nvme-pool:vm-408-cloudinit' on the target
2022-01-31 09:18:05 starting VM 408 on remote node 'cloud12'
2022-01-31 09:23:06 [cloud12] timeout: no zvol device link for 'vm-408-cloudinit' found after 300 sec found.
2022-01-31 09:23:06 ERROR: online migrate failure - remote command failed with exit code 255
2022-01-31 09:23:06 aborting phase 2 - cleanup resources
2022-01-31 09:23:06 migrate_cancel
2022-01-31 09:23:09 ERROR: migration finished with problems (duration 00:05:08)
TASK ERROR: migration problems
And finally if I add a zfs-nvme-pool-1 to cloud12 so that both cloud10 and cloud12 have the same named pools I get the error above about the vm never starting.