Live Migration with ZFS failed after latest qemu updates

bofh1337 · Nov 3, 2020

I use 2 nodes with the latest Updates. I live migrated my VMs to my second node, upgraded the few packages:

Upgrade: pve-qemu-kvm:amd64 (5.1.0-3, 5.1.0-4), pve-manager:amd64 (6.2-14, 6.2-15), qemu-server:amd64 (6.2-17, 6.2-18), libproxmox-backup-qemu0:amd64 (0.7.0-1, 0.7.1-1)

After the Update I was still able to migrate the VMs back to the first one, did the same updates on the second node and was not able to migrade the VMs back to it. (I also shut down a non essential VM, did a offline migration with success, only online-migrations are failing)
I get the following error:

Bash:

root@pve-01:~# /usr/sbin/qm migrate 109 pve-02 --online --with-local-disks

2020-11-03 08:05:27 starting migration of VM 109 to node 'pve-02' (10.0.0.6)

2020-11-03 08:05:27 found local, replicated disk 'zfspool:vm-109-disk-0' (in current VM config)

2020-11-03 08:05:27 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'

2020-11-03 08:05:27 replicating disk images

2020-11-03 08:05:27 start replication job

2020-11-03 08:05:27 guest => VM 109, running => 3058

2020-11-03 08:05:27 volumes => zfspool:vm-109-disk-0

2020-11-03 08:05:28 freeze guest filesystem

2020-11-03 08:05:29 create snapshot '__replicate_109-0_1604387127__' on zfspool:vm-109-disk-0

2020-11-03 08:05:29 thaw guest filesystem

2020-11-03 08:05:29 using secure transmission, rate limit: 50 MByte/s

2020-11-03 08:05:29 incremental sync 'zfspool:vm-109-disk-0' (__replicate_109-0_1604386832__ => __replicate_109-0_1604387127__)

2020-11-03 08:05:29 using a bandwidth limit of 50000000 bps for transferring 'zfspool:vm-109-disk-0'

2020-11-03 08:05:30 send from @__replicate_109-0_1604386832__ to zfspool/vm-109-disk-0@__replicate_109-0_1604387127__ estimated size is 36.3M

2020-11-03 08:05:30 total estimated size is 36.3M

2020-11-03 08:05:30 TIME        SENT   SNAPSHOT zfspool/vm-109-disk-0@__replicate_109-0_1604387127__

2020-11-03 08:05:30 zfspool/vm-109-disk-0@__replicate_109-0_1604386832__        name    zfspool/vm-109-disk-0@__replicate_109-0_1604386832__ -

2020-11-03 08:05:32 successfully imported 'zfspool:vm-109-disk-0'

2020-11-03 08:05:32 delete previous replication snapshot '__replicate_109-0_1604386832__' on zfspool:vm-109-disk-0

2020-11-03 08:05:33 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-0_1604386832__' on zfspool:vm-109-disk-0

2020-11-03 08:05:33 end replication job

2020-11-03 08:05:33 copying local disk images

2020-11-03 08:05:33 starting VM 109 on remote node 'pve-02'

2020-11-03 08:05:35 start remote tunnel

2020-11-03 08:05:36 ssh tunnel ver 1

2020-11-03 08:05:36 starting storage migration

2020-11-03 08:05:36 scsi0: start migration to nbd:unix:/run/qemu-server/109_nbd.migrate:exportname=drive-scsi0

drive mirror re-using dirty bitmap 'repl_scsi0'

drive mirror is starting for drive-scsi0

drive-scsi0: transferred: 131072 bytes remaining: 5046272 bytes total: 5177344 bytes progression: 2.53 % busy: 1 ready: 0

drive-scsi0: transferred: 5177344 bytes remaining: 0 bytes total: 5177344 bytes progression: 100.00 % busy: 0 ready: 1

all mirroring jobs are ready

2020-11-03 08:05:37 volume 'zfspool:vm-109-disk-0' is 'zfspool:vm-109-disk-0' on the target

2020-11-03 08:05:37 starting online/live migration on unix:/run/qemu-server/109.migrate

2020-11-03 08:05:37 set migration_caps

2020-11-03 08:05:37 migration speed limit: 8589934592 B/s

2020-11-03 08:05:37 migration downtime limit: 100 ms

2020-11-03 08:05:37 migration cachesize: 536870912 B

2020-11-03 08:05:37 set migration parameters

2020-11-03 08:05:37 spice client_migrate_info

2020-11-03 08:05:37 start migrate command to unix:/run/qemu-server/109.migrate

channel 4: open failed: connect failed: open failed

2020-11-03 08:05:38 migration status error: failed

2020-11-03 08:05:38 ERROR: online migrate failure - aborting

2020-11-03 08:05:38 aborting phase 2 - cleanup resources

2020-11-03 08:05:38 migrate_cancel

drive-scsi0: Cancelling block job

channel 3: open failed: connect failed: open failed

drive-scsi0: Done.

2020-11-03 08:05:38 scsi0: removing block-dirty-bitmap 'repl_scsi0'

2020-11-03 08:05:40 ERROR: migration finished with problems (duration 00:00:13)

migration problems

Bash:

root@pve-01:~# pveversion -v

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)

pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)

pve-kernel-5.4: 6.2-7

pve-kernel-helper: 6.2-7

pve-kernel-5.4.65-1-pve: 5.4.65-1

pve-kernel-5.4.60-1-pve: 5.4.60-2

pve-kernel-5.4.55-1-pve: 5.4.55-1

pve-kernel-5.4.44-2-pve: 5.4.44-2

pve-kernel-5.4.34-1-pve: 5.4.34-2

ceph-fuse: 12.2.11+dfsg1-2.1+b1

corosync: 3.0.4-pve1

criu: 3.11-3

glusterfs-client: 5.5-3

ifupdown: 0.8.35+pve1

ksm-control-daemon: 1.3-1

libjs-extjs: 6.0.1-10

libknet1: 1.16-pve1

libproxmox-acme-perl: 1.0.5

libpve-access-control: 6.1-3

libpve-apiclient-perl: 3.0-3

libpve-common-perl: 6.2-2

libpve-guest-common-perl: 3.1-3

libpve-http-server-perl: 3.0-6

libpve-storage-perl: 6.2-9

libqb0: 1.0.5-1

libspice-server1: 0.14.2-4~pve6+1

lvm2: 2.03.02-pve4

lxc-pve: 4.0.3-1

lxcfs: 4.0.3-pve3

novnc-pve: 1.1.0-1

proxmox-backup-client: 0.9.4-1

proxmox-mini-journalreader: 1.1-1

proxmox-widget-toolkit: 2.3-6

pve-cluster: 6.2-1

pve-container: 3.2-2

pve-docs: 6.2-6

pve-edk2-firmware: 2.20200531-1

pve-firewall: 4.1-3

pve-firmware: 3.1-3

pve-ha-manager: 3.1-1

pve-i18n: 2.2-2

pve-qemu-kvm: 5.1.0-4

pve-xtermjs: 4.7.0-2

qemu-server: 6.2-18

smartmontools: 7.1-pve2

spiceterm: 3.1-1

vncterm: 1.6-2

zfsutils-linux: 0.8.4-pve2

fabian · Nov 3, 2020

do you see any errors on the target node? e.g., in the journal?

bofh1337 · Nov 3, 2020

Yes, i do:

Nov 03 08:48:41 pve-01 systemd[1]: Started Session 5982 of user root.
Nov 03 08:48:43 pve-01 QEMU[24542]: kvm: Unable to read node name string
Nov 03 08:48:43 pve-01 QEMU[24542]: kvm: error while loading state for instance 0x0 of device 'dirty-bitmap'
Nov 03 08:48:43 pve-01 QEMU[24542]: kvm: load of migration failed: Invalid argument
Nov 03 08:48:43 pve-01 kernel: vmbr0v40: port 2(tap103i1) entered disabled state
Nov 03 08:48:43 pve-01 kernel: vmbr0: port 2(tap103i0) entered disabled state
Nov 03 08:48:43 pve-01 sshd[24600]: error: connect to /run/qemu-server/103_nbd.migrate port -2 failed: Connection refused
Nov 03 08:48:43 pve-01 systemd[1]: 103.scope: Succeeded.
Nov 03 08:48:44 pve-01 sshd[24600]: error: connect to /run/qemu-server/103_nbd.migrate port -2 failed: Connection refused
Nov 03 08:48:44 pve-01 sshd[24717]: Accepted publickey for root from 10.0.0.6 port 36656 ssh2: RSA SHA256:XXXXXX
Nov 03 08:48:44 pve-01 sshd[24717]: pam_unix(sshd:session): session opened for user root by (uid=0)
Nov 03 08:48:44 pve-01 systemd-logind[1507]: New session 5983 of user root.

fabian · Nov 3, 2020

~~so you are migrating from an updated to a not-yet-updated host. this is not guaranteed to work, and in this case, there is an incompatibility that prevents it from working.~~

fabian · Nov 3, 2020

@Stefan_R can reproduce a bug in the new bitmap/state migration handling irrespective of new->old when replicated volumes are involved. hopefully a fix will be available soon!

bofh1337 · Nov 3, 2020

Both got the latest updates installed. I Just offline migrated my vms to the second already restarted host and rebooted the other host to be sure, that they using the same version of libs. Itś sadly still not working with the live migration and the latest upgrades.

fabian · Nov 3, 2020

yeah, see my last post the bug is already reproducible and Stefan is working on a fix.

bofh1337 · Nov 6, 2020

Problem persists with the latest updates today:

Start-Date: 2020-11-06 09:37:23
Commandline: apt-get dist-upgrade
Install: libyaml-libyaml-perl:amd64 (0.76+repack-1, automatic), libyaml-0-2:amd64 (0.2.1-1, automatic)
Upgrade: proxmox-widget-toolkit:amd64 (2.3-6, 2.3-8), pve-qemu-kvm:amd64 (5.1.0-4, 5.1.0-6), proxmox-backup-client:amd64 (0.9.4-1, 0.9.6-1), libpve-common-perl:amd64 (6.2-2, 6.2-4), qemu-server:amd64 (6.2-18, 6.2-19)
End-Date: 2020-11-06 09:37:32

Nicolas PEYRESAUBES · Nov 6, 2020

Hey there,

Working for me after today's update

fabian · Nov 6, 2020

bofh1337 said:
Problem persists with the latest updates today:

Start-Date: 2020-11-06 09:37:23
Commandline: apt-get dist-upgrade
Install: libyaml-libyaml-perl:amd64 (0.76+repack-1, automatic), libyaml-0-2:amd64 (0.2.1-1, automatic)
Upgrade: proxmox-widget-toolkit:amd64 (2.3-6, 2.3-8), pve-qemu-kvm:amd64 (5.1.0-4, 5.1.0-6), proxmox-backup-client:amd64 (0.9.4-1, 0.9.6-1), libpve-common-perl:amd64 (6.2-2, 6.2-4), qemu-server:amd64 (6.2-18, 6.2-19)
End-Date: 2020-11-06 09:37:32

did you stop and start the VMs to pick up the new qemu binary?

bofh1337 · Nov 6, 2020

no, i did not do that today. Ill try in the evening as i cannot shut them down right now.

mapM · Nov 6, 2020

Hi there, have you got any news?

bofh1337 · Nov 6, 2020

Yes, its working flawless right now. Thank you so much

Search

Search

Live Migration with ZFS failed after latest qemu updates

bofh1337

Member

fabian

Proxmox Staff Member

bofh1337

Member

fabian

Proxmox Staff Member

fabian

Proxmox Staff Member

bofh1337

Member

fabian

Proxmox Staff Member

bofh1337

Member

Nicolas PEYRESAUBES

Member

fabian

Proxmox Staff Member

bofh1337

Member

mapM

New Member

bofh1337

Member