Live Migration with ZFS failed after latest qemu updates

bofh1337

Member
Nov 3, 2020
10
0
6
34
I use 2 nodes with the latest Updates. I live migrated my VMs to my second node, upgraded the few packages:

Upgrade: pve-qemu-kvm:amd64 (5.1.0-3, 5.1.0-4), pve-manager:amd64 (6.2-14, 6.2-15), qemu-server:amd64 (6.2-17, 6.2-18), libproxmox-backup-qemu0:amd64 (0.7.0-1, 0.7.1-1)


After the Update I was still able to migrate the VMs back to the first one, did the same updates on the second node and was not able to migrade the VMs back to it. (I also shut down a non essential VM, did a offline migration with success, only online-migrations are failing)
I get the following error:


Bash:
root@pve-01:~# /usr/sbin/qm migrate 109 pve-02 --online --with-local-disks

2020-11-03 08:05:27 starting migration of VM 109 to node 'pve-02' (10.0.0.6)

2020-11-03 08:05:27 found local, replicated disk 'zfspool:vm-109-disk-0' (in current VM config)

2020-11-03 08:05:27 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'

2020-11-03 08:05:27 replicating disk images

2020-11-03 08:05:27 start replication job

2020-11-03 08:05:27 guest => VM 109, running => 3058

2020-11-03 08:05:27 volumes => zfspool:vm-109-disk-0

2020-11-03 08:05:28 freeze guest filesystem

2020-11-03 08:05:29 create snapshot '__replicate_109-0_1604387127__' on zfspool:vm-109-disk-0

2020-11-03 08:05:29 thaw guest filesystem

2020-11-03 08:05:29 using secure transmission, rate limit: 50 MByte/s

2020-11-03 08:05:29 incremental sync 'zfspool:vm-109-disk-0' (__replicate_109-0_1604386832__ => __replicate_109-0_1604387127__)

2020-11-03 08:05:29 using a bandwidth limit of 50000000 bps for transferring 'zfspool:vm-109-disk-0'

2020-11-03 08:05:30 send from @__replicate_109-0_1604386832__ to zfspool/vm-109-disk-0@__replicate_109-0_1604387127__ estimated size is 36.3M

2020-11-03 08:05:30 total estimated size is 36.3M

2020-11-03 08:05:30 TIME        SENT   SNAPSHOT zfspool/vm-109-disk-0@__replicate_109-0_1604387127__

2020-11-03 08:05:30 zfspool/vm-109-disk-0@__replicate_109-0_1604386832__        name    zfspool/vm-109-disk-0@__replicate_109-0_1604386832__ -

2020-11-03 08:05:32 successfully imported 'zfspool:vm-109-disk-0'

2020-11-03 08:05:32 delete previous replication snapshot '__replicate_109-0_1604386832__' on zfspool:vm-109-disk-0

2020-11-03 08:05:33 (remote_finalize_local_job) delete stale replication snapshot '__replicate_109-0_1604386832__' on zfspool:vm-109-disk-0

2020-11-03 08:05:33 end replication job

2020-11-03 08:05:33 copying local disk images

2020-11-03 08:05:33 starting VM 109 on remote node 'pve-02'

2020-11-03 08:05:35 start remote tunnel

2020-11-03 08:05:36 ssh tunnel ver 1

2020-11-03 08:05:36 starting storage migration

2020-11-03 08:05:36 scsi0: start migration to nbd:unix:/run/qemu-server/109_nbd.migrate:exportname=drive-scsi0

drive mirror re-using dirty bitmap 'repl_scsi0'

drive mirror is starting for drive-scsi0

drive-scsi0: transferred: 131072 bytes remaining: 5046272 bytes total: 5177344 bytes progression: 2.53 % busy: 1 ready: 0

drive-scsi0: transferred: 5177344 bytes remaining: 0 bytes total: 5177344 bytes progression: 100.00 % busy: 0 ready: 1

all mirroring jobs are ready

2020-11-03 08:05:37 volume 'zfspool:vm-109-disk-0' is 'zfspool:vm-109-disk-0' on the target

2020-11-03 08:05:37 starting online/live migration on unix:/run/qemu-server/109.migrate

2020-11-03 08:05:37 set migration_caps

2020-11-03 08:05:37 migration speed limit: 8589934592 B/s

2020-11-03 08:05:37 migration downtime limit: 100 ms

2020-11-03 08:05:37 migration cachesize: 536870912 B

2020-11-03 08:05:37 set migration parameters

2020-11-03 08:05:37 spice client_migrate_info

2020-11-03 08:05:37 start migrate command to unix:/run/qemu-server/109.migrate

channel 4: open failed: connect failed: open failed

2020-11-03 08:05:38 migration status error: failed

2020-11-03 08:05:38 ERROR: online migrate failure - aborting

2020-11-03 08:05:38 aborting phase 2 - cleanup resources

2020-11-03 08:05:38 migrate_cancel

drive-scsi0: Cancelling block job

channel 3: open failed: connect failed: open failed

drive-scsi0: Done.

2020-11-03 08:05:38 scsi0: removing block-dirty-bitmap 'repl_scsi0'

2020-11-03 08:05:40 ERROR: migration finished with problems (duration 00:00:13)

migration problems





Bash:
root@pve-01:~# pveversion -v

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)

pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)

pve-kernel-5.4: 6.2-7

pve-kernel-helper: 6.2-7

pve-kernel-5.4.65-1-pve: 5.4.65-1

pve-kernel-5.4.60-1-pve: 5.4.60-2

pve-kernel-5.4.55-1-pve: 5.4.55-1

pve-kernel-5.4.44-2-pve: 5.4.44-2

pve-kernel-5.4.34-1-pve: 5.4.34-2

ceph-fuse: 12.2.11+dfsg1-2.1+b1

corosync: 3.0.4-pve1

criu: 3.11-3

glusterfs-client: 5.5-3

ifupdown: 0.8.35+pve1

ksm-control-daemon: 1.3-1

libjs-extjs: 6.0.1-10

libknet1: 1.16-pve1

libproxmox-acme-perl: 1.0.5

libpve-access-control: 6.1-3

libpve-apiclient-perl: 3.0-3

libpve-common-perl: 6.2-2

libpve-guest-common-perl: 3.1-3

libpve-http-server-perl: 3.0-6

libpve-storage-perl: 6.2-9

libqb0: 1.0.5-1

libspice-server1: 0.14.2-4~pve6+1

lvm2: 2.03.02-pve4

lxc-pve: 4.0.3-1

lxcfs: 4.0.3-pve3

novnc-pve: 1.1.0-1

proxmox-backup-client: 0.9.4-1

proxmox-mini-journalreader: 1.1-1

proxmox-widget-toolkit: 2.3-6

pve-cluster: 6.2-1

pve-container: 3.2-2

pve-docs: 6.2-6

pve-edk2-firmware: 2.20200531-1

pve-firewall: 4.1-3

pve-firmware: 3.1-3

pve-ha-manager: 3.1-1

pve-i18n: 2.2-2

pve-qemu-kvm: 5.1.0-4

pve-xtermjs: 4.7.0-2

qemu-server: 6.2-18

smartmontools: 7.1-pve2

spiceterm: 3.1-1

vncterm: 1.6-2

zfsutils-linux: 0.8.4-pve2
 
Last edited:
do you see any errors on the target node? e.g., in the journal?
 
  • Like
Reactions: bofh1337
Yes, i do:

Nov 03 08:48:41 pve-01 systemd[1]: Started Session 5982 of user root.
Nov 03 08:48:43 pve-01 QEMU[24542]: kvm: Unable to read node name string
Nov 03 08:48:43 pve-01 QEMU[24542]: kvm: error while loading state for instance 0x0 of device 'dirty-bitmap'
Nov 03 08:48:43 pve-01 QEMU[24542]: kvm: load of migration failed: Invalid argument
Nov 03 08:48:43 pve-01 kernel: vmbr0v40: port 2(tap103i1) entered disabled state
Nov 03 08:48:43 pve-01 kernel: vmbr0: port 2(tap103i0) entered disabled state
Nov 03 08:48:43 pve-01 sshd[24600]: error: connect to /run/qemu-server/103_nbd.migrate port -2 failed: Connection refused
Nov 03 08:48:43 pve-01 systemd[1]: 103.scope: Succeeded.
Nov 03 08:48:44 pve-01 sshd[24600]: error: connect to /run/qemu-server/103_nbd.migrate port -2 failed: Connection refused
Nov 03 08:48:44 pve-01 sshd[24717]: Accepted publickey for root from 10.0.0.6 port 36656 ssh2: RSA SHA256:XXXXXX
Nov 03 08:48:44 pve-01 sshd[24717]: pam_unix(sshd:session): session opened for user root by (uid=0)
Nov 03 08:48:44 pve-01 systemd-logind[1507]: New session 5983 of user root.
 
so you are migrating from an updated to a not-yet-updated host. this is not guaranteed to work, and in this case, there is an incompatibility that prevents it from working.
 
Last edited:
@Stefan_R can reproduce a bug in the new bitmap/state migration handling irrespective of new->old when replicated volumes are involved. hopefully a fix will be available soon!
 
Both got the latest updates installed. I Just offline migrated my vms to the second already restarted host and rebooted the other host to be sure, that they using the same version of libs. Itś sadly still not working with the live migration and the latest upgrades.
 
yeah, see my last post the bug is already reproducible and Stefan is working on a fix.
 
Problem persists with the latest updates today:

Start-Date: 2020-11-06 09:37:23
Commandline: apt-get dist-upgrade
Install: libyaml-libyaml-perl:amd64 (0.76+repack-1, automatic), libyaml-0-2:amd64 (0.2.1-1, automatic)
Upgrade: proxmox-widget-toolkit:amd64 (2.3-6, 2.3-8), pve-qemu-kvm:amd64 (5.1.0-4, 5.1.0-6), proxmox-backup-client:amd64 (0.9.4-1, 0.9.6-1), libpve-common-perl:amd64 (6.2-2, 6.2-4), qemu-server:amd64 (6.2-18, 6.2-19)
End-Date: 2020-11-06 09:37:32
 
Problem persists with the latest updates today:

Start-Date: 2020-11-06 09:37:23
Commandline: apt-get dist-upgrade
Install: libyaml-libyaml-perl:amd64 (0.76+repack-1, automatic), libyaml-0-2:amd64 (0.2.1-1, automatic)
Upgrade: proxmox-widget-toolkit:amd64 (2.3-6, 2.3-8), pve-qemu-kvm:amd64 (5.1.0-4, 5.1.0-6), proxmox-backup-client:amd64 (0.9.4-1, 0.9.6-1), libpve-common-perl:amd64 (6.2-2, 6.2-4), qemu-server:amd64 (6.2-18, 6.2-19)
End-Date: 2020-11-06 09:37:32

did you stop and start the VMs to pick up the new qemu binary?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!