[SOLVED] Live migration fails, but works after first offline migration

mrmanuel · Dec 20, 2024

Hello,

I have a very strange problem. I have two PVE 8.3.0 nodes with AMD Ryzen CPU's, 64 GB RAM, 1 TB NVMe with ZFS and 2x 2,5 GBit NIC's on each node. All VM's replicate from node1 to node2 every 15 minutes. Node1 is the primary node, node2 was just added. When I try to live migrate a VM (does not matter, if Linux or Windows, with or without EFI disk) from node1 to node2, then I get this error:

Code:

2024-12-20 19:40:45 starting migration of VM 111 to node 'pve02' (10.4.4.241)
2024-12-20 19:40:45 found local, replicated disk 'local-zfs:vm-111-disk-0' (attached)
2024-12-20 19:40:45 virtio0: start tracking writes using block-dirty-bitmap 'repl_virtio0'
2024-12-20 19:40:45 replicating disk images
2024-12-20 19:40:45 start replication job
2024-12-20 19:40:45 guest => VM 111, running => 21415
2024-12-20 19:40:45 volumes => local-zfs:vm-111-disk-0
2024-12-20 19:40:46 freeze guest filesystem
2024-12-20 19:40:46 create snapshot '__replicate_111-0_1734720045__' on local-zfs:vm-111-disk-0
2024-12-20 19:40:46 thaw guest filesystem
2024-12-20 19:40:46 using secure transmission, rate limit: 200 MByte/s
2024-12-20 19:40:46 incremental sync 'local-zfs:vm-111-disk-0' (__replicate_111-0_1734719425__ => __replicate_111-0_1734720045__)
2024-12-20 19:40:46 using a bandwidth limit of 200000000 bytes per second for transferring 'local-zfs:vm-111-disk-0'
2024-12-20 19:40:46 send from @__replicate_111-0_1734719425__ to rpool/data/vm-111-disk-0@__replicate_111-0_1734720045__ estimated size is 2.28M
2024-12-20 19:40:46 total estimated size is 2.28M
2024-12-20 19:40:46 TIME        SENT   SNAPSHOT rpool/data/vm-111-disk-0@__replicate_111-0_1734720045__
2024-12-20 19:40:47 successfully imported 'local-zfs:vm-111-disk-0'
2024-12-20 19:40:47 delete previous replication snapshot '__replicate_111-0_1734719425__' on local-zfs:vm-111-disk-0
2024-12-20 19:40:48 (remote_finalize_local_job) delete stale replication snapshot '__replicate_111-0_1734719425__' on local-zfs:vm-111-disk-0
2024-12-20 19:40:48 end replication job
2024-12-20 19:40:48 starting VM 111 on remote node 'pve02'
2024-12-20 19:40:49 volume 'local-zfs:vm-111-disk-0' is 'local-zfs:vm-111-disk-0' on the target
2024-12-20 19:40:49 start remote tunnel
2024-12-20 19:40:50 ssh tunnel ver 1
2024-12-20 19:40:50 starting storage migration
2024-12-20 19:40:50 virtio0: start migration to nbd:unix:/run/qemu-server/111_nbd.migrate:exportname=drive-virtio0
drive mirror re-using dirty bitmap 'repl_virtio0'
drive mirror is starting for drive-virtio0
drive-virtio0: transferred 0.0 B of 1.4 MiB (0.00%) in 0s
drive-virtio0: transferred 1.4 MiB of 1.4 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-12-20 19:40:51 switching mirror jobs to actively synced mode
drive-virtio0: switching to actively synced mode
drive-virtio0: successfully switched to actively synced mode
2024-12-20 19:40:52 starting online/live migration on unix:/run/qemu-server/111.migrate
2024-12-20 19:40:52 set migration capabilities
2024-12-20 19:40:52 migration downtime limit: 100 ms
2024-12-20 19:40:52 migration cachesize: 128.0 MiB
2024-12-20 19:40:52 set migration parameters
2024-12-20 19:40:52 start migrate command to unix:/run/qemu-server/111.migrate
2024-12-20 19:40:53 migration active, transferred 280.3 MiB of 1.0 GiB VM-state, 283.7 MiB/s
2024-12-20 19:40:54 average migration speed: 520.4 MiB/s - downtime 39 ms
2024-12-20 19:40:54 migration status: completed
all 'mirror' jobs are ready
drive-virtio0: Completing block job...
drive-virtio0: Completed successfully.
drive-virtio0: Cancelling block job
drive-virtio0: Done.
2024-12-20 19:40:55 ERROR: online migrate failure - Failed to complete storage migration: block job (mirror) error: drive-virtio0: Input/output error (io-status: ok)
2024-12-20 19:40:55 aborting phase 2 - cleanup resources
2024-12-20 19:40:55 migrate_cancel
2024-12-20 19:40:55 virtio0: removing block-dirty-bitmap 'repl_virtio0'
2024-12-20 19:40:57 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems

If I offline migrate the VM from node1 to node2, then start the VM on node2, I'm able to live migrate the VM from node2 to node1 and from node1 to node2. When I shutdown the VM on node1 and start it again on node1, then the live migration again throws the error above.

In the system logs I see:

Code:

Dec 20 19:40:45 pve01 pvedaemon[1778]: <root@pam> starting task UPID:pve01:00006C6A:00064488:6765BA2D:qmigrate:111:root@pam:
Dec 20 19:40:48 pve01 pmxcfs[1617]: [status] notice: received log
Dec 20 19:40:49 pve01 pmxcfs[1617]: [status] notice: received log
Dec 20 19:40:53 pve01 kernel:  zd64: p1 p2 < p5 >
Dec 20 19:40:55 pve01 pvedaemon[27754]: VM 111 qmp command failed - VM 111 qmp command 'block-job-cancel' failed - Block job 'drive-virtio0' not found
Dec 20 19:40:56 pve01 pmxcfs[1617]: [status] notice: received log
Dec 20 19:40:56 pve01 pmxcfs[1617]: [status] notice: received log
Dec 20 19:40:57 pve01 pvedaemon[27754]: migration problems
Dec 20 19:40:57 pve01 pvedaemon[1778]: <root@pam> end task UPID:pve01:00006C6A:00064488:6765BA2D:qmigrate:111:root@pam: migration problems

Has anyone an idea what I could check or what the problem is?

mrmanuel · Thursday at 10:26

Finally I found the problem. The error message is very misleading.

The CPU type for all VM's was set to host. After I changed it again to the default CPU type x86-64-v2-AES it worked, the migration worked normally.

Now it makes sense, that the VM migration worked, when starting the VM on the node with the older CPU and less ISA features, but not when the VM started on the node with the newer CPU with additional ISA features that the older CPU does not have.

Search

Search

[SOLVED] Live migration fails, but works after first offline migration

mrmanuel

New Member

mrmanuel

New Member