Live migration failed - There's a migration process in progress

May 9, 2017
17
0
41
Currently we are trying to live migrate a VM to another server within the same cluster.
The first migration successfully migrated all the attached disks and got a hangup at the "VM-state" migration step.
After 15 minutes of no progress I pressed the "Stop" button to abort the migration.
Code:
2023-12-18 12:01:38 starting online/live migration on tcp:10.40.4.72:60000
2023-12-18 12:01:38 set migration capabilities
2023-12-18 12:01:38 migration downtime limit: 100 ms
2023-12-18 12:01:38 migration cachesize: 2.0 GiB
2023-12-18 12:01:38 set migration parameters
2023-12-18 12:01:38 start migrate command to tcp:10.40.4.72:60000
2023-12-18 12:19:07 ERROR: online migrate failure - interrupted by signal
2023-12-18 12:19:07 aborting phase 2 - cleanup resources
2023-12-18 12:19:07 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi5: Cancelling block job
drive-scsi4: Cancelling block job
drive-scsi2: Cancelling block job
drive-scsi6: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi3: Cancelling block job
drive-scsi0: Done.
drive-scsi5: Done.
drive-scsi4: Done.
drive-scsi2: Done.
drive-scsi6: Done.
drive-scsi1: Done.
drive-scsi3: Done.

Now when we try to start a new migration the disks also successfully migrate to the new hypervisor but it still fails on the "VM-state" migration step.
But now the error states there is already migration process in progress.
Code:
2023-12-18 12:30:49 starting online/live migration on tcp:10.40.4.72:60000
2023-12-18 12:30:49 set migration capabilities
VM 162 qmp command 'migrate-set-capabilities' failed - There's a migration process in progress
2023-12-18 12:30:49 migration downtime limit: 100 ms
2023-12-18 12:30:49 migration cachesize: 2.0 GiB
2023-12-18 12:30:49 set migration parameters
2023-12-18 12:30:49 start migrate command to tcp:10.40.4.72:60000
2023-12-18 12:30:49 migrate uri => tcp:10.40.4.72:60000 failed: VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-18 12:30:50 ERROR: online migrate failure - VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-18 12:30:50 aborting phase 2 - cleanup resources
2023-12-18 12:30:50 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi4: Cancelling block job
drive-scsi2: Cancelling block job
drive-scsi5: Cancelling block job
drive-scsi6: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi3: Cancelling block job
drive-scsi0: Done.
drive-scsi4: Done.
drive-scsi2: Done.
drive-scsi5: Done.
drive-scsi6: Done.
drive-scsi1: Done.
drive-scsi3: Done.
2023-12-18 12:31:00 ERROR: migration finished with problems (duration 00:03:50)
TASK ERROR: migration problems
I already view the "ps" output on the hypervisors but I cannot seem to find a any running process that has a reference to the VM migration.
Source hypervisor
Code:
proxmox-ve: 8.0.2 (running kernel: 6.2.16-15-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2: 6.2.16-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.4
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.9
pve-cluster: 8.0.2
pve-container: 5.0.4
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-2
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-6
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1
Destination hypervisor
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1
 
Hi,
Currently we are trying to live migrate a VM to another server within the same cluster.
The first migration successfully migrated all the attached disks and got a hangup at the "VM-state" migration step.
After 15 minutes of no progress I pressed the "Stop" button to abort the migration.
Code:
2023-12-18 12:01:38 starting online/live migration on tcp:10.40.4.72:60000
2023-12-18 12:01:38 set migration capabilities
2023-12-18 12:01:38 migration downtime limit: 100 ms
2023-12-18 12:01:38 migration cachesize: 2.0 GiB
2023-12-18 12:01:38 set migration parameters
2023-12-18 12:01:38 start migrate command to tcp:10.40.4.72:60000
should it happen again, please check if a kvm process was successfully started on the target side and is listening on that port.

2023-12-18 12:19:07 ERROR: online migrate failure - interrupted by signal
2023-12-18 12:19:07 aborting phase 2 - cleanup resources
2023-12-18 12:19:07 migrate_cancel
Now when we try to start a new migration the disks also successfully migrate to the new hypervisor but it still fails on the "VM-state" migration step.
But now the error states there is already migration process in progress.
That likely means that the migrate_cancel command above didn't clean up properly for some reason. You can try using qm monitor 162 and re-issuing the command there. But if that doesn't help, I'm afraid it might be necessary to shutdown+start the VM (or reboot in the web interface, not inside the guest).

I already view the "ps" output on the hypervisors but I cannot seem to find a any running process that has a reference to the VM migration.
The migration is done (with a thread) in the QEMU process itself and there is another QEMU instance on the target.

Please share the VM configuration. You could try if the migration works with a (more minimal) test VM to see if the problem is specific to the VM or more general.
 
Hey Fiona,

Issuing another migrate_cancel command using `qm monitor 162` does not seem to do anything.
There is no command output or a syslog entry that says that it did anything.

Just tried another migration of the VM after issuing the migrate_cancel and it failed with the same error as before
Code:
2023-12-19 15:56:09 use dedicated network address for sending migration traffic (10.40.4.72)
2023-12-19 15:56:09 starting migration of VM 162 to node '<target node>' (10.40.4.72)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-0' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-1' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-2' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-3' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-4' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-5' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-6' (attached)
2023-12-19 15:56:10 starting VM 162 on remote node '<target node>'
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-0' is 'local-nvme:vm-162-disk-3' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-1' is 'local-nvme:vm-162-disk-4' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-2' is 'local-nvme:vm-162-disk-5' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-3' is 'local-nvme:vm-162-disk-7' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-4' is 'local-nvme:vm-162-disk-8' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-5' is 'local-nvme:vm-162-disk-9' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-6' is 'local-nvme:vm-162-disk-10' on the target
2023-12-19 15:56:16 start remote tunnel
2023-12-19 15:56:17 ssh tunnel ver 1
2023-12-19 15:56:17 starting storage migration
2023-12-19 15:56:17 scsi5: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi5
drive mirror is starting for drive-scsi5
drive-scsi5: transferred 0.0 B of 100.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi5: transferred 100.0 GiB of 100.0 GiB (100.00%) in 1m 11s, ready
all 'mirror' jobs are ready
2023-12-19 15:57:28 scsi6: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi6
drive mirror is starting for drive-scsi6
drive-scsi6: transferred 0.0 B of 100.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi6: transferred 100.0 GiB of 100.0 GiB (100.00%) in 1m 9s, ready
all 'mirror' jobs are ready
2023-12-19 15:58:37 scsi4: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi4
drive mirror is starting for drive-scsi4
drive-scsi4: transferred 0.0 B of 50.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi4: transferred 50.0 GiB of 50.0 GiB (100.00%) in 36s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:13 scsi1: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi1
drive mirror is starting for drive-scsi1
drive-scsi1: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi1: transferred 10.0 GiB of 10.0 GiB (100.00%) in 7s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:20 scsi0: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 22.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi0: transferred 22.0 GiB of 22.0 GiB (100.00%) in 15s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:35 scsi3: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi3
drive mirror is starting for drive-scsi3
drive-scsi3: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi3: transferred 10.0 GiB of 10.0 GiB (100.00%) in 7s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:42 scsi2: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi2
drive mirror is starting for drive-scsi2
drive-scsi2: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi2: transferred 10.0 GiB of 10.0 GiB (100.00%) in 7s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:49 starting online/live migration on tcp:10.40.4.72:60000
2023-12-19 15:59:49 set migration capabilities
VM 162 qmp command 'migrate-set-capabilities' failed - There's a migration process in progress
2023-12-19 15:59:49 migration downtime limit: 100 ms
2023-12-19 15:59:49 migration cachesize: 2.0 GiB
2023-12-19 15:59:49 set migration parameters
2023-12-19 15:59:49 start migrate command to tcp:10.40.4.72:60000
2023-12-19 15:59:49 migrate uri => tcp:10.40.4.72:60000 failed: VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-19 15:59:50 ERROR: online migrate failure - VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-19 15:59:50 aborting phase 2 - cleanup resources
2023-12-19 15:59:50 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi5: Cancelling block job
drive-scsi2: Cancelling block job
drive-scsi4: Cancelling block job
drive-scsi6: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi3: Cancelling block job
drive-scsi0: Done.
drive-scsi5: Done.
drive-scsi2: Done.
drive-scsi4: Done.
drive-scsi6: Done.
drive-scsi1: Done.
drive-scsi3: Done.
2023-12-19 16:00:00 ERROR: migration finished with problems (duration 00:03:51)
TASK ERROR: migration problems
Code:
agent: enabled=1,freeze-fs-on-backup=1,fstrim_cloned_disks=1
balloon: 0
boot: order=ide0;scsi0
cores: 2
cpu: cputype=EPYC-v3
ide0: none,media=cdrom
memory: 16384
meta: creation-qemu=8.0.2,ctime=1697495165
name: <redacted>
net0: virtio=<redacted>,bridge=vmbr0,firewall=0,mtu=1500,tag=<redacted>
numa: 1
ostype: l26
scsi0: local-nvme:vm-162-disk-0,discard=ignore,format=raw,iothread=1,size=22G,ssd=1
scsi1: local-nvme:vm-162-disk-1,discard=ignore,format=raw,iothread=1,size=10G,ssd=1
scsi2: local-nvme:vm-162-disk-2,discard=ignore,format=raw,iothread=1,size=10G,ssd=1
scsi3: local-nvme:vm-162-disk-3,discard=ignore,format=raw,iothread=1,size=10G,ssd=1
scsi4: local-nvme:vm-162-disk-4,discard=ignore,format=raw,iothread=1,size=50G,ssd=1
scsi5: local-nvme:vm-162-disk-5,discard=ignore,format=raw,iothread=1,size=100G,ssd=1
scsi6: local-nvme:vm-162-disk-6,discard=ignore,format=raw,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=<redacted>
sockets: 2
vmgenid: <redacted>
If there is no other way to fix it I can restart the VM as a last resort, but this is not really preferred as it is a customers VM.
So a way to reset the state without restarting would be preferred.
 
Hi @fiona ,

I work with Thaillie and have some additional information.

In qm monitor, this is what we see for the migration:
```
qm> info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: cancelling
total time: 363778293 ms
expected downtime: 0 ms
setup: 0 ms
transferred ram: 0 kbytes
throughput: 0.00 mbps
remaining ram: 0 kbytes
total ram: 16794440 kbytes
duplicate: 0 pages
skipped: 0 pages
normal: 0 pages
normal bytes: 0 kbytes
dirty sync count: 0
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 0
cache size: 2147483648 bytes
xbzrle transferred: 0 kbytes
xbzrle pages: 0 pages
xbzrle cache miss: 0 pages
xbzrle cache miss rate: 0.00
xbzrle encoding rate: 0.00
xbzrle overflow: 0
```

It seems to be stuck on cancelling.
I also noticed that on the destination hypervisor that we tried to migrate the VM to, some disks are visible in the thin-lvm storage with this VM id.

Perhaps we need to remove those manually..? Could you advice on the best course of action?
 
As an update from our side, it appears there is a TCP connection "stuck". I guess this is why it is in the cancelling state..?

Code:
kvm     28475 root  219u     IPv4         1008176839      0t0    TCP 10.40.4.71:51308->10.40.4.72:60000 (CLOSE_WAIT)
 
Issuing another migrate_cancel command using `qm monitor 162` does not seem to do anything.
That's unfortunate.
If there is no other way to fix it I can restart the VM as a last resort, but this is not really preferred as it is a customers VM.
So a way to reset the state without restarting would be preferred.
I understand, but migrate-cancel would be the command for that. It seems there's a bug in there or deeper issue.

It seems to be stuck on cancelling.
I also noticed that on the destination hypervisor that we tried to migrate the VM to, some disks are visible in the thin-lvm storage with this VM id.
The disks are migrated first and they are usually cleaned up when migration is aborted. But likely there was an issue with that too.
Perhaps we need to remove those manually..? Could you advice on the best course of action?
Yes, I'd guess that those disks were left-overs from a failed migration. You can check sizes (and to really make sure contents) to verify.

As an update from our side, it appears there is a TCP connection "stuck". I guess this is why it is in the cancelling state..?

Code:
kvm     28475 root  219u     IPv4         1008176839      0t0    TCP 10.40.4.71:51308->10.40.4.72:60000 (CLOSE_WAIT)
Is there still a kvm process for the corresponding VM ID running on the target? If yes, you might want to try and terminate that one. Otherwise, you'd need to find out why the connection is stuck.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!