Hello there,
we're experiencing some issues with live migration after upgrading to PVE 7.3.
Our cluster is a 9 node cluster with external Ceph storage and HA configured.
When rebooting a host from the web ui, virtual machines are migrated off the host just fine and the host is rebooted.
After the reboot, when the VMs are moved back to the previously rebooted machine, sometimes migrations fail and affected VMs end up in a "paused" state. Manually resuming them works, in most cases. Here's a log file of a failed migration, proxmox-07 was the rebooted host on which machine 136 originally resided on. Only one VM out of 7 was affected.
To me it looks like PVE is trying to resume the machine on proxmox-04, even when it's no longer there.
I hope somebody can share some insights on this.
Kind regards,
Julian
we're experiencing some issues with live migration after upgrading to PVE 7.3.
Our cluster is a 9 node cluster with external Ceph storage and HA configured.
When rebooting a host from the web ui, virtual machines are migrated off the host just fine and the host is rebooted.
After the reboot, when the VMs are moved back to the previously rebooted machine, sometimes migrations fail and affected VMs end up in a "paused" state. Manually resuming them works, in most cases. Here's a log file of a failed migration, proxmox-07 was the rebooted host on which machine 136 originally resided on. Only one VM out of 7 was affected.
Code:
task started by HA resource agent
2022-11-24 14:18:38 starting migration of VM 136 to node 'proxmox-07' (192.168.11.21)
2022-11-24 14:18:38 starting VM 136 on remote node 'proxmox-07'
2022-11-24 14:18:39 start remote tunnel
2022-11-24 14:18:40 ssh tunnel ver 1
2022-11-24 14:18:40 starting online/live migration on unix:/run/qemu-server/136.migrate
2022-11-24 14:18:40 set migration capabilities
2022-11-24 14:18:40 migration downtime limit: 100 ms
2022-11-24 14:18:40 migration cachesize: 512.0 MiB
2022-11-24 14:18:40 set migration parameters
2022-11-24 14:18:40 start migrate command to unix:/run/qemu-server/136.migrate
2022-11-24 14:18:41 migration active, transferred 275.6 MiB of 4.0 GiB VM-state, 306.7 MiB/s
2022-11-24 14:18:42 migration active, transferred 507.0 MiB of 4.0 GiB VM-state, 266.7 MiB/s
2022-11-24 14:18:43 migration active, transferred 713.4 MiB of 4.0 GiB VM-state, 266.3 MiB/s
2022-11-24 14:18:44 migration active, transferred 940.9 MiB of 4.0 GiB VM-state, 316.7 MiB/s
2022-11-24 14:18:45 migration active, transferred 1.2 GiB of 4.0 GiB VM-state, 531.6 MiB/s
2022-11-24 14:18:46 migration active, transferred 1.5 GiB of 4.0 GiB VM-state, 478.6 MiB/s
2022-11-24 14:18:47 migration active, transferred 1.8 GiB of 4.0 GiB VM-state, 310.2 MiB/s
2022-11-24 14:18:48 migration active, transferred 2.1 GiB of 4.0 GiB VM-state, 311.3 MiB/s
2022-11-24 14:18:49 migration active, transferred 2.3 GiB of 4.0 GiB VM-state, 360.6 MiB/s
2022-11-24 14:18:50 migration active, transferred 2.6 GiB of 4.0 GiB VM-state, 500.1 MiB/s
2022-11-24 14:18:52 average migration speed: 342.7 MiB/s - downtime 127 ms
2022-11-24 14:18:52 migration status: completed
2022-11-24 14:18:52 ERROR: tunnel replied 'ERR: resume failed - Configuration file 'nodes/proxmox-04/qemu-server/136.conf' does not exist' to command 'resume 136'
2022-11-24 14:18:55 ERROR: migration finished with problems (duration 00:00:17)
TASK ERROR: migration problems
To me it looks like PVE is trying to resume the machine on proxmox-04, even when it's no longer there.
I hope somebody can share some insights on this.
Kind regards,
Julian