Live migration failed - There's a migration process in progress

Thaillie · Dec 18, 2023

Currently we are trying to live migrate a VM to another server within the same cluster.
The first migration successfully migrated all the attached disks and got a hangup at the "VM-state" migration step.
After 15 minutes of no progress I pressed the "Stop" button to abort the migration.

Code:

2023-12-18 12:01:38 starting online/live migration on tcp:10.40.4.72:60000
2023-12-18 12:01:38 set migration capabilities
2023-12-18 12:01:38 migration downtime limit: 100 ms
2023-12-18 12:01:38 migration cachesize: 2.0 GiB
2023-12-18 12:01:38 set migration parameters
2023-12-18 12:01:38 start migrate command to tcp:10.40.4.72:60000
2023-12-18 12:19:07 ERROR: online migrate failure - interrupted by signal
2023-12-18 12:19:07 aborting phase 2 - cleanup resources
2023-12-18 12:19:07 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi5: Cancelling block job
drive-scsi4: Cancelling block job
drive-scsi2: Cancelling block job
drive-scsi6: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi3: Cancelling block job
drive-scsi0: Done.
drive-scsi5: Done.
drive-scsi4: Done.
drive-scsi2: Done.
drive-scsi6: Done.
drive-scsi1: Done.
drive-scsi3: Done.

Now when we try to start a new migration the disks also successfully migrate to the new hypervisor but it still fails on the "VM-state" migration step.
But now the error states there is already migration process in progress.

Code:

2023-12-18 12:30:49 starting online/live migration on tcp:10.40.4.72:60000
2023-12-18 12:30:49 set migration capabilities
VM 162 qmp command 'migrate-set-capabilities' failed - There's a migration process in progress
2023-12-18 12:30:49 migration downtime limit: 100 ms
2023-12-18 12:30:49 migration cachesize: 2.0 GiB
2023-12-18 12:30:49 set migration parameters
2023-12-18 12:30:49 start migrate command to tcp:10.40.4.72:60000
2023-12-18 12:30:49 migrate uri => tcp:10.40.4.72:60000 failed: VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-18 12:30:50 ERROR: online migrate failure - VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-18 12:30:50 aborting phase 2 - cleanup resources
2023-12-18 12:30:50 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi4: Cancelling block job
drive-scsi2: Cancelling block job
drive-scsi5: Cancelling block job
drive-scsi6: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi3: Cancelling block job
drive-scsi0: Done.
drive-scsi4: Done.
drive-scsi2: Done.
drive-scsi5: Done.
drive-scsi6: Done.
drive-scsi1: Done.
drive-scsi3: Done.
2023-12-18 12:31:00 ERROR: migration finished with problems (duration 00:03:50)
TASK ERROR: migration problems

I already view the "ps" output on the hypervisors but I cannot seem to find a any running process that has a reference to the VM migration.

Source hypervisor

Code:

proxmox-ve: 8.0.2 (running kernel: 6.2.16-15-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-15-pve: 6.2.16-15
proxmox-kernel-6.2: 6.2.16-15
proxmox-kernel-6.2.16-12-pve: 6.2.16-12
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.4
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.9
pve-cluster: 8.0.2
pve-container: 5.0.4
pve-docs: 8.0.5
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.8-2
pve-ha-manager: 4.0.2
pve-i18n: 3.0.7
pve-qemu-kvm: 8.0.2-6
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.7
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.13-pve1

Destination hypervisor

Code:

proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5: 6.5.11-7
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph-fuse: 17.2.7-pve1
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.3
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve1

fiona · Dec 19, 2023

Hi,

Thaillie said:
Currently we are trying to live migrate a VM to another server within the same cluster.
The first migration successfully migrated all the attached disks and got a hangup at the "VM-state" migration step.
After 15 minutes of no progress I pressed the "Stop" button to abort the migration.

Code:

2023-12-18 12:01:38 starting online/live migration on tcp:10.40.4.72:60000 2023-12-18 12:01:38 set migration capabilities 2023-12-18 12:01:38 migration downtime limit: 100 ms 2023-12-18 12:01:38 migration cachesize: 2.0 GiB 2023-12-18 12:01:38 set migration parameters 2023-12-18 12:01:38 start migrate command to tcp:10.40.4.72:60000

should it happen again, please check if a kvm process was successfully started on the target side and is listening on that port.

Thaillie said:
2023-12-18 12:19:07 ERROR: online migrate failure - interrupted by signal
2023-12-18 12:19:07 aborting phase 2 - cleanup resources
2023-12-18 12:19:07 migrate_cancel

Thaillie said:
Now when we try to start a new migration the disks also successfully migrate to the new hypervisor but it still fails on the "VM-state" migration step.
But now the error states there is already migration process in progress.

That likely means that the migrate_cancel command above didn't clean up properly for some reason. You can try using qm monitor 162 and re-issuing the command there. But if that doesn't help, I'm afraid it might be necessary to shutdown+start the VM (or reboot in the web interface, not inside the guest).

Thaillie said:
I already view the "ps" output on the hypervisors but I cannot seem to find a any running process that has a reference to the VM migration.

The migration is done (with a thread) in the QEMU process itself and there is another QEMU instance on the target.

Please share the VM configuration. You could try if the migration works with a (more minimal) test VM to see if the problem is specific to the VM or more general.

Thaillie · Dec 19, 2023

Hey Fiona,

Issuing another migrate_cancel command using `qm monitor 162` does not seem to do anything.
There is no command output or a syslog entry that says that it did anything.

Just tried another migration of the VM after issuing the migrate_cancel and it failed with the same error as before

Code:

2023-12-19 15:56:09 use dedicated network address for sending migration traffic (10.40.4.72)
2023-12-19 15:56:09 starting migration of VM 162 to node '<target node>' (10.40.4.72)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-0' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-1' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-2' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-3' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-4' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-5' (attached)
2023-12-19 15:56:10 found local disk 'local-nvme:vm-162-disk-6' (attached)
2023-12-19 15:56:10 starting VM 162 on remote node '<target node>'
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-0' is 'local-nvme:vm-162-disk-3' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-1' is 'local-nvme:vm-162-disk-4' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-2' is 'local-nvme:vm-162-disk-5' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-3' is 'local-nvme:vm-162-disk-7' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-4' is 'local-nvme:vm-162-disk-8' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-5' is 'local-nvme:vm-162-disk-9' on the target
2023-12-19 15:56:16 volume 'local-nvme:vm-162-disk-6' is 'local-nvme:vm-162-disk-10' on the target
2023-12-19 15:56:16 start remote tunnel
2023-12-19 15:56:17 ssh tunnel ver 1
2023-12-19 15:56:17 starting storage migration
2023-12-19 15:56:17 scsi5: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi5
drive mirror is starting for drive-scsi5
drive-scsi5: transferred 0.0 B of 100.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi5: transferred 100.0 GiB of 100.0 GiB (100.00%) in 1m 11s, ready
all 'mirror' jobs are ready
2023-12-19 15:57:28 scsi6: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi6
drive mirror is starting for drive-scsi6
drive-scsi6: transferred 0.0 B of 100.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi6: transferred 100.0 GiB of 100.0 GiB (100.00%) in 1m 9s, ready
all 'mirror' jobs are ready
2023-12-19 15:58:37 scsi4: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi4
drive mirror is starting for drive-scsi4
drive-scsi4: transferred 0.0 B of 50.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi4: transferred 50.0 GiB of 50.0 GiB (100.00%) in 36s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:13 scsi1: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi1
drive mirror is starting for drive-scsi1
drive-scsi1: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi1: transferred 10.0 GiB of 10.0 GiB (100.00%) in 7s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:20 scsi0: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 22.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi0: transferred 22.0 GiB of 22.0 GiB (100.00%) in 15s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:35 scsi3: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi3
drive mirror is starting for drive-scsi3
drive-scsi3: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi3: transferred 10.0 GiB of 10.0 GiB (100.00%) in 7s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:42 scsi2: start migration to nbd:10.40.4.72:60001:exportname=drive-scsi2
drive mirror is starting for drive-scsi2
drive-scsi2: transferred 0.0 B of 10.0 GiB (0.00%) in 0s
3< snip 3<
drive-scsi2: transferred 10.0 GiB of 10.0 GiB (100.00%) in 7s, ready
all 'mirror' jobs are ready
2023-12-19 15:59:49 starting online/live migration on tcp:10.40.4.72:60000
2023-12-19 15:59:49 set migration capabilities
VM 162 qmp command 'migrate-set-capabilities' failed - There's a migration process in progress
2023-12-19 15:59:49 migration downtime limit: 100 ms
2023-12-19 15:59:49 migration cachesize: 2.0 GiB
2023-12-19 15:59:49 set migration parameters
2023-12-19 15:59:49 start migrate command to tcp:10.40.4.72:60000
2023-12-19 15:59:49 migrate uri => tcp:10.40.4.72:60000 failed: VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-19 15:59:50 ERROR: online migrate failure - VM 162 qmp command 'migrate' failed - There's a migration process in progress
2023-12-19 15:59:50 aborting phase 2 - cleanup resources
2023-12-19 15:59:50 migrate_cancel
drive-scsi0: Cancelling block job
drive-scsi5: Cancelling block job
drive-scsi2: Cancelling block job
drive-scsi4: Cancelling block job
drive-scsi6: Cancelling block job
drive-scsi1: Cancelling block job
drive-scsi3: Cancelling block job
drive-scsi0: Done.
drive-scsi5: Done.
drive-scsi2: Done.
drive-scsi4: Done.
drive-scsi6: Done.
drive-scsi1: Done.
drive-scsi3: Done.
2023-12-19 16:00:00 ERROR: migration finished with problems (duration 00:03:51)
TASK ERROR: migration problems

Code:

agent: enabled=1,freeze-fs-on-backup=1,fstrim_cloned_disks=1
balloon: 0
boot: order=ide0;scsi0
cores: 2
cpu: cputype=EPYC-v3
ide0: none,media=cdrom
memory: 16384
meta: creation-qemu=8.0.2,ctime=1697495165
name: <redacted>
net0: virtio=<redacted>,bridge=vmbr0,firewall=0,mtu=1500,tag=<redacted>
numa: 1
ostype: l26
scsi0: local-nvme:vm-162-disk-0,discard=ignore,format=raw,iothread=1,size=22G,ssd=1
scsi1: local-nvme:vm-162-disk-1,discard=ignore,format=raw,iothread=1,size=10G,ssd=1
scsi2: local-nvme:vm-162-disk-2,discard=ignore,format=raw,iothread=1,size=10G,ssd=1
scsi3: local-nvme:vm-162-disk-3,discard=ignore,format=raw,iothread=1,size=10G,ssd=1
scsi4: local-nvme:vm-162-disk-4,discard=ignore,format=raw,iothread=1,size=50G,ssd=1
scsi5: local-nvme:vm-162-disk-5,discard=ignore,format=raw,iothread=1,size=100G,ssd=1
scsi6: local-nvme:vm-162-disk-6,discard=ignore,format=raw,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=<redacted>
sockets: 2
vmgenid: <redacted>

If there is no other way to fix it I can restart the VM as a last resort, but this is not really preferred as it is a customers VM.
So a way to reset the state without restarting would be preferred.

nielsh · Dec 22, 2023

Hi @fiona ,

I work with Thaillie and have some additional information.

In qm monitor, this is what we see for the migration:
```
qm> info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: cancelling
total time: 363778293 ms
expected downtime: 0 ms
setup: 0 ms
transferred ram: 0 kbytes
throughput: 0.00 mbps
remaining ram: 0 kbytes
total ram: 16794440 kbytes
duplicate: 0 pages
skipped: 0 pages
normal: 0 pages
normal bytes: 0 kbytes
dirty sync count: 0
page size: 4 kbytes
multifd bytes: 0 kbytes
pages-per-second: 0
cache size: 2147483648 bytes
xbzrle transferred: 0 kbytes
xbzrle pages: 0 pages
xbzrle cache miss: 0 pages
xbzrle cache miss rate: 0.00
xbzrle encoding rate: 0.00
xbzrle overflow: 0
```

It seems to be stuck on cancelling.
I also noticed that on the destination hypervisor that we tried to migrate the VM to, some disks are visible in the thin-lvm storage with this VM id.

Perhaps we need to remove those manually..? Could you advice on the best course of action?

nielsh · Dec 27, 2023

As an update from our side, it appears there is a TCP connection "stuck". I guess this is why it is in the cancelling state..?

Code:

kvm     28475 root  219u     IPv4         1008176839      0t0    TCP 10.40.4.71:51308->10.40.4.72:60000 (CLOSE_WAIT)

fiona · Jan 4, 2024

Thaillie said:
Issuing another migrate_cancel command using `qm monitor 162` does not seem to do anything.

That's unfortunate.

Thaillie said:
If there is no other way to fix it I can restart the VM as a last resort, but this is not really preferred as it is a customers VM.
So a way to reset the state without restarting would be preferred.

I understand, but migrate-cancel would be the command for that. It seems there's a bug in there or deeper issue.

nielsh said:
It seems to be stuck on cancelling.
I also noticed that on the destination hypervisor that we tried to migrate the VM to, some disks are visible in the thin-lvm storage with this VM id.

The disks are migrated first and they are usually cleaned up when migration is aborted. But likely there was an issue with that too.

nielsh said:
Perhaps we need to remove those manually..? Could you advice on the best course of action?

Yes, I'd guess that those disks were left-overs from a failed migration. You can check sizes (and to really make sure contents) to verify.

nielsh said:
As an update from our side, it appears there is a TCP connection "stuck". I guess this is why it is in the cancelling state..?

Code:

kvm 28475 root 219u IPv4 1008176839 0t0 TCP 10.40.4.71:51308->10.40.4.72:60000 (CLOSE_WAIT)

Is there still a kvm process for the corresponding VM ID running on the target? If yes, you might want to try and terminate that one. Otherwise, you'd need to find out why the connection is stuck.

Search

Search

Live migration failed - There's a migration process in progress

Thaillie

Active Member

fiona

Proxmox Staff Member

Thaillie

Active Member

nielsh

New Member

nielsh

New Member

fiona

Proxmox Staff Member