tldr;
If you start a VM migration whos disks reside on a zfs volume and for whatever reason cancel in the middle of its progress, proxmox will not properly clean itself up resulting in failures of consequent attempts to migrate that same VM. Solution is to manually remove the zfs snapshots proxmox left behind on the ORIGINATING host.
Detailed version...
Running pve 4.3, cluster of 10 nodes, if you cancel an offline migration of zfs local storage, proxmox will fail to clean up the volume snapshot(s) that were created to migrate the VM on the originating host. The next time you try to migrate that VM, you'll get an error like this:
So in this case I migrated vmid 103 , from node prx027 to prx019 and purposely canceled the migration. The procedure has been confirmed to be repeatable across any node.
The error above is also deceiving and incorrect. The "stale volume" actually resides on the originating node(prx027 in this case) and not the receiver(prx019) as the error indicates.
Migration and migration cancellation
So heres what happens when you go to migrate that same VM again:
Note in the above error, there are 2 conflicting messages
1) Which is accurate
2) The error people will likely read since its the most explicit and is trying to point you in the direction of where the failure is, but is wrong.
Here are those snapshots, the error is warning about, however, again they reside on the originating node, not the receiver.
The solution is to simply remove these snapshots at shell from the originating host.
Might want to fix that in later releases.
If you start a VM migration whos disks reside on a zfs volume and for whatever reason cancel in the middle of its progress, proxmox will not properly clean itself up resulting in failures of consequent attempts to migrate that same VM. Solution is to manually remove the zfs snapshots proxmox left behind on the ORIGINATING host.
Detailed version...
Running pve 4.3, cluster of 10 nodes, if you cancel an offline migration of zfs local storage, proxmox will fail to clean up the volume snapshot(s) that were created to migrate the VM on the originating host. The next time you try to migrate that VM, you'll get an error like this:
"Dec 25 17:19:38 ERROR: found stale volume copy 'zfsLocalStorage:vm-103-disk-2' on node 'prx019'"
So in this case I migrated vmid 103 , from node prx027 to prx019 and purposely canceled the migration. The procedure has been confirmed to be repeatable across any node.
The error above is also deceiving and incorrect. The "stale volume" actually resides on the originating node(prx027 in this case) and not the receiver(prx019) as the error indicates.
Migration and migration cancellation
Dec 25 17:16:21 starting migration of VM 103 to node 'prx019' (x.x.x.x)
Dec 25 17:16:21 copying disk images
Dec 25 17:16:21 found local disk 'zfsLocalStorage:vm-103-disk-1' (in current VM config)
Dec 25 17:16:21 found local disk 'zfsLocalStorage:vm-103-disk-2' (in current VM config)
Dec 25 17:16:21 found local disk 'zfsLocalStorage:vm-103-disk-3' (in current VM config)
Dec 25 17:16:21 found local disk 'zfsLocalStorage:vm-103-disk-4' (in current VM config)
send from @ to zfsStoragePool001/vm-103-disk-2@__migration__ estimated size is 17.0M
total estimated size is 17.0M
TIME SENT SNAPSHOT
send from @ to zfsStoragePool001/vm-103-disk-4@__migration__ estimated size is 19.3M
total estimated size is 19.3M
TIME SENT SNAPSHOT
send from @ to zfsStoragePool001/vm-103-disk-1@__migration__ estimated size is 4.00G
total estimated size is 4.00G
TIME SENT SNAPSHOT
17:16:23 109M zfsStoragePool001/vm-103-disk-1@__migration__
17:16:24 218M zfsStoragePool001/vm-103-disk-1@__migration__
17:16:25 330M zfsStoragePool001/vm-103-disk-1@__migration__
17:16:26 441M zfsStoragePool001/vm-103-disk-1@__migration__
17:16:27 553M zfsStoragePool001/vm-103-disk-1@__migration__
17:16:28 661M zfsStoragePool001/vm-103-disk-1@__migration__
Dec 25 17:16:28 ERROR: Failed to sync data - command 'set -o pipefail && zfs send -Rpv zfsStoragePool001/vm-103-disk-1@__migration__ | ssh root@x.x.x.x zfs recv zfsStoragePool001/vm-103-disk-1' failed: interrupted by signal
Dec 25 17:16:28 aborting phase 1 - cleanup resources
Dec 25 17:16:28 ERROR: found stale volume copy 'zfsLocalStorage:vm-103-disk-2' on node 'prx019'
Dec 25 17:16:28 ERROR: found stale volume copy 'zfsLocalStorage:vm-103-disk-4' on node 'prx019'
Dec 25 17:16:28 ERROR: found stale volume copy 'zfsLocalStorage:vm-103-disk-1' on node 'prx019'
Dec 25 17:16:28 ERROR: migration aborted (duration 00:00:07): Failed to sync data - command 'set -o pipefail && zfs send -Rpv zfsStoragePool001/vm-103-disk-1@__migration__ | ssh root@x.x.x.x zfs recv zfsStoragePool001/vm-103-disk-1' failed: interrupted by signal
TASK ERROR: migration aborted
So heres what happens when you go to migrate that same VM again:
()
Dec 25 17:19:38 starting migration of VM 103 to node 'prx019' (x.x.x.x)
Dec 25 17:19:38 copying disk images
Dec 25 17:19:38 found local disk 'zfsLocalStorage:vm-103-disk-1' (in current VM config)
Dec 25 17:19:38 found local disk 'zfsLocalStorage:vm-103-disk-2' (in current VM config)
Dec 25 17:19:38 found local disk 'zfsLocalStorage:vm-103-disk-3' (in current VM config)
Dec 25 17:19:38 found local disk 'zfsLocalStorage:vm-103-disk-4' (in current VM config)
cannot create snapshot 'zfsStoragePool001/vm-103-disk-2@__migration__': dataset already exists
Dec 25 17:19:38 ERROR: Failed to sync data - command 'zfs snapshot zfsStoragePool001/vm-103-disk-2@__migration__' failed: exit code 1
Dec 25 17:19:38 aborting phase 1 - cleanup resources
Dec 25 17:19:38 ERROR: found stale volume copy 'zfsLocalStorage:vm-103-disk-2' on node 'prx019'
Dec 25 17:19:38 ERROR: migration aborted (duration 00:00:00): Failed to sync data - command 'zfs snapshot zfsStoragePool001/vm-103-disk-2@__migration__' failed: exit code 1
TASK ERROR: migration aborted
Note in the above error, there are 2 conflicting messages
1) Which is accurate
Dec 25 17:19:38 found local disk 'zfsLocalStorage:vm-103-disk-4' (in current VM config)
cannot create snapshot 'zfsStoragePool001/vm-103-disk-2@__migration__': dataset already exists
Dec 25 17:19:38 ERROR: Failed to sync data - command 'zfs snapshot zfsStoragePool001/vm-103-disk-2@__migration__' failed: exit code 1
2) The error people will likely read since its the most explicit and is trying to point you in the direction of where the failure is, but is wrong.
Dec 25 17:19:38 ERROR: found stale volume copy 'zfsLocalStorage:vm-103-disk-2' on node 'prx019'
Here are those snapshots, the error is warning about, however, again they reside on the originating node, not the receiver.
Code:
root@prx027:/home# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
zfsStoragePool001/vm-103-disk-2@__migration__ 0 - 17.3M -
zfsStoragePool001/vm-103-disk-4@__migration__ 0 - 19.6M -
The solution is to simply remove these snapshots at shell from the originating host.
Code:
root@prx027:/home# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
zfsStoragePool001/vm-103-disk-2@__migration__ 0 - 17.3M -
zfsStoragePool001/vm-103-disk-4@__migration__ 0 - 19.6M -
root@prx027:/home# zfs destroy zfsStoragePool001/vm-103-disk-2@__migration__
root@prx027:/home# zfs destroy zfsStoragePool001/vm-103-disk-4@__migration__
root@prx027:/home# zfs list -t snapshot
no datasets available
Might want to fix that in later releases.