Live migration with local storage (ZFS) problem

SDuch · Jan 29, 2018

Hello,

We plan to upgrade our proxmox nodes with ZFS instead of LVM to enjoy live migration even with only local storage. To test it before doing anything on the productive servers we installed 2 servers with proxmox 5 and a zpool for VM disks.
On proxmox5 node : 1 pool named "zfs_vm_proxmox5" with ID "zfs_vm_proxmox5"

Code:

root@proxmox5:~# zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
zfs_vm_proxmox5                10.3G  27.5G    19K  /zfs_vm_proxmox5
zfs_vm_proxmox5/vm-100-disk-1  10.3G  27.8G  10.0G  -

On proxmox5bis node : 1 pool named "zfs_vm_proxmox5" with ID "zfs_vm_proxmox5bis"

Code:

root@proxmox5bis:~# zfs list
NAME              USED  AVAIL  REFER  MOUNTPOINT
zfs_vm_proxmox5  1.54M  37.8G    19K  /zfs_vm_proxmox5

We create a simple VM with a fresh install of debian in it.
When we do the VM migration online from node "proxmox5" to "proxmox5bis" here is the log :

Code:

root@proxmox5:~# qm migrate 100 proxmox5bis --online --with-local-disks
2018-01-29 17:26:58 starting migration of VM 100 to node 'proxmox5bis' (10.80.32.43)
2018-01-29 17:26:58 found local disk 'zfs_vm_proxmox5:vm-100-disk-1' (in current VM config)
2018-01-29 17:26:58 found local disk 'zfs_vm_proxmox5bis:vm-100-disk-1' (via storage)
2018-01-29 17:26:58 copying disk images
send from @ to zfs_vm_proxmox5/vm-100-disk-1@__migration__ estimated size is 10.0G
total estimated size is 10.0G
TIME        SENT   SNAPSHOT
17:26:59   93.3M   zfs_vm_proxmox5/vm-100-disk-1@__migration__
17:27:00    202M   zfs_vm_proxmox5/vm-100-disk-1@__migration__
17:27:01    311M   zfs_vm_proxmox5/vm-100-disk-1@__migration__
17:27:02    419M   zfs_vm_proxmox5/vm-100-disk-1@__migration__
[...]
17:28:33   10.2G   zfs_vm_proxmox5/vm-100-disk-1@__migration__
17:28:34   10.4G   zfs_vm_proxmox5/vm-100-disk-1@__migration__
2018-01-29 17:28:35 starting VM 100 on remote node 'proxmox5bis'
2018-01-29 17:28:36 start remote tunnel
2018-01-29 17:28:37 ssh tunnel ver 1
2018-01-29 17:28:37 starting storage migration
2018-01-29 17:28:37 scsi0: start migration to to nbd:10.80.32.43:60000:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 0 bytes remaining: 10737418240 bytes total: 10737418240 bytes progression: 0.00 % busy: 1 ready: 0
drive-scsi0: transferred: 114294784 bytes remaining: 10623123456 bytes total: 10737418240 bytes progression: 1.06 % busy: 1 ready: 0
drive-scsi0: transferred: 228589568 bytes remaining: 10508828672 bytes total: 10737418240 bytes progression: 2.13 % busy: 1 ready: 0
drive-scsi0: transferred: 342884352 bytes remaining: 10394533888 bytes total: 10737418240 bytes progression: 3.19 % busy: 1 ready: 0
[...]
drive-scsi0: transferred: 10628366336 bytes remaining: 109051904 bytes total: 10737418240 bytes progression: 98.98 % busy: 1 ready: 0
drive-scsi0: transferred: 10737418240 bytes remaining: 0 bytes total: 10737418240 bytes progression: 100.00 % busy: 1 ready: 0
drive-scsi0: transferred: 10737418240 bytes remaining: 0 bytes total: 10737418240 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
2018-01-29 17:30:11 starting online/live migration on unix:/run/qemu-server/100.migrate
2018-01-29 17:30:11 migrate_set_speed: 8589934592
2018-01-29 17:30:11 migrate_set_downtime: 0.1
2018-01-29 17:30:11 set migration_caps
2018-01-29 17:30:11 set cachesize: 53687091
2018-01-29 17:30:11 start migrate command to unix:/run/qemu-server/100.migrate
2018-01-29 17:30:12 migration status: active (transferred 118607216, remaining 23334912), total 554246144)
2018-01-29 17:30:12 migration xbzrle cachesize: 33554432 transferred 0 pages 0 cachemiss 0 overflow 0
2018-01-29 17:30:13 migration speed: 5.33 MB/s - downtime 20 ms
2018-01-29 17:30:13 migration status: completed
2018-01-29 17:30:19 ERROR: removing local copy of 'zfs_vm_proxmox5bis:vm-100-disk-1' failed - zfs error: cannot destroy 'zfs_vm_proxmox5/vm-100-disk-1': dataset is busy
drive-scsi0: transferred: 10737418240 bytes remaining: 0 bytes total: 10737418240 bytes progression: 100.00 % busy: 0 ready: 1
all mirroring jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0 : finished
2018-01-29 17:30:32 ERROR: migration finished with problems (duration 00:03:34)
migration problems

So, the migration works, but we end with 2 disks on the "new" node of the VM. And we have this error message in the migration log : 2018-01-29 17:30:19 ERROR: removing local copy of 'zfs_vm_proxmox5bis:vm-100-disk-1' failed - zfs error: cannot destroy 'zfs_vm_proxmox5/vm-100-disk-1': dataset is busy

So the disk change of name (*-2 instead of *-1 in original node...) but the original disk stays...
After migration on proxmox5bis :

Code:

root@proxmox5bis:~# zfs list
NAME              USED  AVAIL  REFER  MOUNTPOINT
zfs_vm_proxmox5  1.54M  37.8G    19K  /zfs_vm_proxmox5
root@proxmox5bis:~# zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
zfs_vm_proxmox5                20.6G  17.1G    19K  /zfs_vm_proxmox5
zfs_vm_proxmox5/vm-100-disk-1  10.3G  17.4G  10.0G  -
zfs_vm_proxmox5/vm-100-disk-2  10.3G  17.4G  10.0G  -

We can delete the zfs_vm_proxmox5/vm-100-disk-1 with zfs destroy without problems but it should be automatic no ?

Edit :
The VM config :

Code:

root@proxmox5bis:~# qm config 100
bootdisk: scsi0
cores: 1
ide2: none,media=cdrom
memory: 512
name: test
numa: 0
ostype: l26
scsi0: zfs_vm_proxmox5:vm-100-disk-2,format=raw,replicate=0,size=10G
scsihw: virtio-scsi-pci
smbios1: uuid=0fa2b3f7-ff76-405b-a59c-b2aa12d38207
sockets: 1

and pveversion :

Code:

root@proxmox5bis:~# pveversion -v
proxmox-ve: 5.0-19 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-19
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-3
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.9-pve16~bpo90

Any idea ?
Thx

wolfgang · Jan 30, 2018

Hi,

you are on a outdated version and local storage migration was very new in this version.
Please update to current 5.1 and try again.

SDuch · Jan 30, 2018

Hi,
Current 5.1 gives axactly the same. Either we have a problem with zfs pools naming or there's a bug somewhere.

wolfgang · Jan 30, 2018

Did you clean up the not used vdisks?
Update both nodes and reboot them?

SDuch · Jan 30, 2018

Ok, found a way to make it work.
- Install 2 nodes
- Create zfs pools on both nodes with same name and add them to their node with the same ID
- create the cluster.

Then we have on GUI only 1 storage ZFS wich has the same name on both nodes.

If we create the zpools after joining in cluster, even if we set the same zpool name on both nodes the ID in PVE cannot be the same and that gives you 2 ZFS in storage GUI.

I have to admit that it's a strange way...but it's working. I can't still understand why it's not working without this way.

Search

Search

Live migration with local storage (ZFS) problem

SDuch

New Member

wolfgang

Proxmox Retired Staff

SDuch

New Member

wolfgang

Proxmox Retired Staff

SDuch

New Member

We value your privacy