Migration from one host to another fails

Hi!

We are testing proxmox before we plan to migrate our VMware cluster to proxmox. At first step we setup two hosts as small cluster and are testing basic things. We are also testing live migration between two hosts without shared storage.
We were able to migrate two linux vms just fine while the vms were running.
Now we tried to migrate a windows vm with two bigger harddisks and uefi and tpm. But the migrations always fails after the first disk. What are we doing wrong? Or could someone point me at something we should check or do?

Here is the output of the migration job. Are there any other logs or config files that are useful?

Regards
Dennis

Code:
2024-03-18 18:04:09 starting migration of VM 102 to node 'link' (192.168.11.220)
2024-03-18 18:04:09 found local disk 'VM-Storage:102/vm-102-disk-0.qcow2' (attached)
2024-03-18 18:04:09 found local disk 'VM-Storage:102/vm-102-disk-1.qcow2' (attached)
2024-03-18 18:04:09 found local disk 'VM-Storage:102/vm-102-disk-2.qcow2' (attached)
2024-03-18 18:04:09 found generated disk 'local-lvm:vm-102-disk-0' (in current VM config)
2024-03-18 18:04:09 copying local disk images
2024-03-18 18:04:12   Logical volume "vm-102-disk-0" created.
2024-03-18 18:04:12 64+0 records in
2024-03-18 18:04:12 64+0 records out
2024-03-18 18:04:12 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.60556 s, 6.9 MB/s
2024-03-18 18:04:12 34+60 records in
2024-03-18 18:04:12 34+60 records out
2024-03-18 18:04:12 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0232518 s, 180 MB/s
2024-03-18 18:04:12 successfully imported 'local-lvm:vm-102-disk-0'
2024-03-18 18:04:12 volume 'local-lvm:vm-102-disk-0' is 'local-lvm:vm-102-disk-0' on the target
2024-03-18 18:04:12 starting VM 102 on remote node 'link'
2024-03-18 18:04:16 volume 'VM-Storage:102/vm-102-disk-2.qcow2' is 'local-lvm:vm-102-disk-1' on the target
2024-03-18 18:04:16 volume 'VM-Storage:102/vm-102-disk-0.qcow2' is 'local-lvm:vm-102-disk-3' on the target
2024-03-18 18:04:16 volume 'VM-Storage:102/vm-102-disk-1.qcow2' is 'local-lvm:vm-102-disk-4' on the target
2024-03-18 18:04:16 [link] Task finished with 2 warning(s)!
2024-03-18 18:04:16 start remote tunnel
2024-03-18 18:04:17 ssh tunnel ver 1
2024-03-18 18:04:17 starting storage migration
2024-03-18 18:04:17 scsi0: start migration to nbd:unix:/run/qemu-server/102_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 135.9 MiB of 60.0 GiB (0.22%) in 4m 3s
drive-scsi0: transferred 268.4 MiB of 60.0 GiB (0.44%) in 4m 4s
........
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 8s
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 9s
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 10s
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 11s
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 12s
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 13s
drive-scsi0: transferred 60.2 GiB of 60.2 GiB (100.00%) in 10m 14s
drive-scsi0: transferred 60.3 GiB of 60.4 GiB (99.87%) in 10m 15s, still busy
all 'mirror' jobs are ready
2024-03-18 18:14:32 scsi1: start migration to nbd:unix:/run/qemu-server/102_nbd.migrate:exportname=drive-scsi1
drive mirror is starting for drive-scsi1
drive-scsi1: Cancelling block job
drive-scsi0: Cancelling block job
drive-scsi1: Done.
drive-scsi0: Done.
2024-03-18 18:14:33 ERROR: online migrate failure - block job (mirror) error: drive-scsi1: 'mirror' has been cancelled
2024-03-18 18:14:33 aborting phase 2 - cleanup resources
2024-03-18 18:14:33 migrate_cancel
2024-03-18 18:14:47 ERROR: migration finished with problems (duration 00:10:38)
TASK ERROR: migration problems
 
At first glance it would appear that the environment you have is not well suited for live migration with local disks.
I am guessing its not an apple to apple type comparison to VMware, as VMware is probably using shared storage?

Keep in mind that live migration with local storage has to copy entire disk context from one node to another. We can see that its taking quite a bit of time. One guess is that the failure at the end is due a disk flush is happening. But for an exact answer there has to be a comprehensive system and log review.

I would suggest to start with something smaller, build a quick small VM - it doesnt even need to be able to boot to OS. See if migration succeeds.

Good luck.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
It is just for some tests. So yes. The network and hosts aren't suited for real workloads. It's just to get and compare the key functionalities between VMware and Proxmox to get an idea of what we can expect will still work, will not work und will be as new features we couldn't use with VMware.
I just wanted to test live migration between hosts that do not share storage. Which isn't possible in VMware and I was curious how this works in Proxmox.
So. I didn't got this running for that windows VM. And I don't know why. I'm now trying to migrate with shared SMB storage which seems to work. But if there was such functionality like live migration with no shared storage, why not use it? Perhaps someone else might lead me to the right point to look at, why my first try didn't work.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!