Live migration issue after upgrade to PVE 7.3

Nov 24, 2022
2
1
3
Hello there,

we're experiencing some issues with live migration after upgrading to PVE 7.3.
Our cluster is a 9 node cluster with external Ceph storage and HA configured.

When rebooting a host from the web ui, virtual machines are migrated off the host just fine and the host is rebooted.
After the reboot, when the VMs are moved back to the previously rebooted machine, sometimes migrations fail and affected VMs end up in a "paused" state. Manually resuming them works, in most cases. Here's a log file of a failed migration, proxmox-07 was the rebooted host on which machine 136 originally resided on. Only one VM out of 7 was affected.

Code:
task started by HA resource agent
2022-11-24 14:18:38 starting migration of VM 136 to node 'proxmox-07' (192.168.11.21)
2022-11-24 14:18:38 starting VM 136 on remote node 'proxmox-07'
2022-11-24 14:18:39 start remote tunnel
2022-11-24 14:18:40 ssh tunnel ver 1
2022-11-24 14:18:40 starting online/live migration on unix:/run/qemu-server/136.migrate
2022-11-24 14:18:40 set migration capabilities
2022-11-24 14:18:40 migration downtime limit: 100 ms
2022-11-24 14:18:40 migration cachesize: 512.0 MiB
2022-11-24 14:18:40 set migration parameters
2022-11-24 14:18:40 start migrate command to unix:/run/qemu-server/136.migrate
2022-11-24 14:18:41 migration active, transferred 275.6 MiB of 4.0 GiB VM-state, 306.7 MiB/s
2022-11-24 14:18:42 migration active, transferred 507.0 MiB of 4.0 GiB VM-state, 266.7 MiB/s
2022-11-24 14:18:43 migration active, transferred 713.4 MiB of 4.0 GiB VM-state, 266.3 MiB/s
2022-11-24 14:18:44 migration active, transferred 940.9 MiB of 4.0 GiB VM-state, 316.7 MiB/s
2022-11-24 14:18:45 migration active, transferred 1.2 GiB of 4.0 GiB VM-state, 531.6 MiB/s
2022-11-24 14:18:46 migration active, transferred 1.5 GiB of 4.0 GiB VM-state, 478.6 MiB/s
2022-11-24 14:18:47 migration active, transferred 1.8 GiB of 4.0 GiB VM-state, 310.2 MiB/s
2022-11-24 14:18:48 migration active, transferred 2.1 GiB of 4.0 GiB VM-state, 311.3 MiB/s
2022-11-24 14:18:49 migration active, transferred 2.3 GiB of 4.0 GiB VM-state, 360.6 MiB/s
2022-11-24 14:18:50 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 500.1 MiB/s
2022-11-24 14:18:52 average migration speed: 342.7 MiB/s - downtime 127 ms
2022-11-24 14:18:52 migration status: completed
2022-11-24 14:18:52 ERROR: tunnel replied 'ERR: resume failed - Configuration file 'nodes/proxmox-04/qemu-server/136.conf' does not exist' to command 'resume 136'
2022-11-24 14:18:55 ERROR: migration finished with problems (duration 00:00:17)
TASK ERROR: migration problems

To me it looks like PVE is trying to resume the machine on proxmox-04, even when it's no longer there.

I hope somebody can share some insights on this.

Kind regards,
Julian
 
  • Like
Reactions: bfwdd
We see the same problem, out of 130 VMs, we had four that experienced the following, with only the VM IDs and the nodes they started on being different each time:

Code:
task started by HA resource agent
2022-12-02 16:08:23 use dedicated network address for sending migration traffic (172.22.0.13)
2022-12-02 16:08:23 starting migration of VM 577 to node 'lab-proxmox3-l-lpg1' (172.22.0.13)
2022-12-02 16:08:23 starting VM 577 on remote node 'lab-proxmox3-l-lpg1'
2022-12-02 16:08:26 start remote tunnel
2022-12-02 16:08:27 ssh tunnel ver 1
2022-12-02 16:08:27 starting online/live migration on tcp:172.22.0.13:60001
2022-12-02 16:08:27 set migration capabilities
2022-12-02 16:08:27 migration downtime limit: 100 ms
2022-12-02 16:08:27 migration cachesize: 128.0 MiB
2022-12-02 16:08:27 set migration parameters
2022-12-02 16:08:27 spice client_migrate_info
2022-12-02 16:08:27 start migrate command to tcp:172.22.0.13:60001
2022-12-02 16:08:28 average migration speed: 1.1 GiB/s - downtime 148 ms
2022-12-02 16:08:28 migration status: completed
2022-12-02 16:08:28 ERROR: tunnel replied 'ERR: resume failed - Configuration file 'nodes/lab-proxmox2-l-lpg1/qemu-server/577.conf' does not exist' to command 'resume 577'
2022-12-02 16:08:29 Waiting for spice server migration
2022-12-02 16:08:30 ERROR: migration finished with problems (duration 00:00:08)
TASK ERROR: migration problems
 
We've run into this issue in our Automated Testing environment and reported it via bugzilla:
https://bugzilla.proxmox.com/show_bug.cgi?id=4372

Proxmox staff proposed a fix, which we fully tested and confirmed the resolution. You can ask in the bug when the fix will be released as a package. In the meantime, you can also manually apply it in your environment:

Code:
patch -b /usr/share/perl5/PVE/QemuServer.pm pve.patch


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
  • Like
Reactions: jeinwag

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!