Live migration issue after upgrade to PVE 7.3

jeinwag · Nov 24, 2022

Hello there,

we're experiencing some issues with live migration after upgrading to PVE 7.3.
Our cluster is a 9 node cluster with external Ceph storage and HA configured.

When rebooting a host from the web ui, virtual machines are migrated off the host just fine and the host is rebooted.
After the reboot, when the VMs are moved back to the previously rebooted machine, sometimes migrations fail and affected VMs end up in a "paused" state. Manually resuming them works, in most cases. Here's a log file of a failed migration, proxmox-07 was the rebooted host on which machine 136 originally resided on. Only one VM out of 7 was affected.

Code:

task started by HA resource agent
2022-11-24 14:18:38 starting migration of VM 136 to node 'proxmox-07' (192.168.11.21)
2022-11-24 14:18:38 starting VM 136 on remote node 'proxmox-07'
2022-11-24 14:18:39 start remote tunnel
2022-11-24 14:18:40 ssh tunnel ver 1
2022-11-24 14:18:40 starting online/live migration on unix:/run/qemu-server/136.migrate
2022-11-24 14:18:40 set migration capabilities
2022-11-24 14:18:40 migration downtime limit: 100 ms
2022-11-24 14:18:40 migration cachesize: 512.0 MiB
2022-11-24 14:18:40 set migration parameters
2022-11-24 14:18:40 start migrate command to unix:/run/qemu-server/136.migrate
2022-11-24 14:18:41 migration active, transferred 275.6 MiB of 4.0 GiB VM-state, 306.7 MiB/s
2022-11-24 14:18:42 migration active, transferred 507.0 MiB of 4.0 GiB VM-state, 266.7 MiB/s
2022-11-24 14:18:43 migration active, transferred 713.4 MiB of 4.0 GiB VM-state, 266.3 MiB/s
2022-11-24 14:18:44 migration active, transferred 940.9 MiB of 4.0 GiB VM-state, 316.7 MiB/s
2022-11-24 14:18:45 migration active, transferred 1.2 GiB of 4.0 GiB VM-state, 531.6 MiB/s
2022-11-24 14:18:46 migration active, transferred 1.5 GiB of 4.0 GiB VM-state, 478.6 MiB/s
2022-11-24 14:18:47 migration active, transferred 1.8 GiB of 4.0 GiB VM-state, 310.2 MiB/s
2022-11-24 14:18:48 migration active, transferred 2.1 GiB of 4.0 GiB VM-state, 311.3 MiB/s
2022-11-24 14:18:49 migration active, transferred 2.3 GiB of 4.0 GiB VM-state, 360.6 MiB/s
2022-11-24 14:18:50 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 500.1 MiB/s
2022-11-24 14:18:52 average migration speed: 342.7 MiB/s - downtime 127 ms
2022-11-24 14:18:52 migration status: completed
2022-11-24 14:18:52 ERROR: tunnel replied 'ERR: resume failed - Configuration file 'nodes/proxmox-04/qemu-server/136.conf' does not exist' to command 'resume 136'
2022-11-24 14:18:55 ERROR: migration finished with problems (duration 00:00:17)
TASK ERROR: migration problems

To me it looks like PVE is trying to resume the machine on proxmox-04, even when it's no longer there.

I hope somebody can share some insights on this.

Kind regards,
Julian

href · Dec 2, 2022

We see the same problem, out of 130 VMs, we had four that experienced the following, with only the VM IDs and the nodes they started on being different each time:

Code:

task started by HA resource agent
2022-12-02 16:08:23 use dedicated network address for sending migration traffic (172.22.0.13)
2022-12-02 16:08:23 starting migration of VM 577 to node 'lab-proxmox3-l-lpg1' (172.22.0.13)
2022-12-02 16:08:23 starting VM 577 on remote node 'lab-proxmox3-l-lpg1'
2022-12-02 16:08:26 start remote tunnel
2022-12-02 16:08:27 ssh tunnel ver 1
2022-12-02 16:08:27 starting online/live migration on tcp:172.22.0.13:60001
2022-12-02 16:08:27 set migration capabilities
2022-12-02 16:08:27 migration downtime limit: 100 ms
2022-12-02 16:08:27 migration cachesize: 128.0 MiB
2022-12-02 16:08:27 set migration parameters
2022-12-02 16:08:27 spice client_migrate_info
2022-12-02 16:08:27 start migrate command to tcp:172.22.0.13:60001
2022-12-02 16:08:28 average migration speed: 1.1 GiB/s - downtime 148 ms
2022-12-02 16:08:28 migration status: completed
2022-12-02 16:08:28 ERROR: tunnel replied 'ERR: resume failed - Configuration file 'nodes/lab-proxmox2-l-lpg1/qemu-server/577.conf' does not exist' to command 'resume 577'
2022-12-02 16:08:29 Waiting for spice server migration
2022-12-02 16:08:30 ERROR: migration finished with problems (duration 00:00:08)
TASK ERROR: migration problems

bbgeek17 · Dec 2, 2022

We've run into this issue in our Automated Testing environment and reported it via bugzilla:
https://bugzilla.proxmox.com/show_bug.cgi?id=4372

Proxmox staff proposed a fix, which we fully tested and confirmed the resolution. You can ask in the bug when the fix will be released as a package. In the meantime, you can also manually apply it in your environment:

Code:

patch -b /usr/share/perl5/PVE/QemuServer.pm pve.patch

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

jeinwag · Dec 15, 2022

I applied the patch and can confirm that it fixes the issue. Let's hope it will be released soon!

href · Jan 25, 2023

qemu-server 7.3-2 has been released, which includes this fix. It seems to work for us.

Search

Search

Live migration issue after upgrade to PVE 7.3

jeinwag

New Member

href

New Member

bbgeek17

Distinguished Member

jeinwag

New Member

href

New Member