if reboot is triggered pve node goes away too fast before ha migration is finished

tvtue

New Member
Nov 22, 2024
21
5
3
With pve 9.0.5 when configuring a vm as a ha-resource with status "started" and then doing a reboot of the underlying hypervisor, I noticed the following error message

Code:
task started by HA resource agent
2025-08-25 15:18:18 conntrack state migration not supported or disabled, active connections might get dropped
2025-08-25 15:18:18 starting migration of VM 101 to node 'sm01a' (10.27.33.1)
2025-08-25 15:18:18 starting VM 101 on remote node 'sm01a'
2025-08-25 15:18:21 start remote tunnel
2025-08-25 15:18:21 ssh tunnel ver 1
2025-08-25 15:18:21 starting online/live migration on unix:/run/qemu-server/101.migrate
2025-08-25 15:18:21 set migration capabilities
2025-08-25 15:18:21 migration downtime limit: 100 ms
2025-08-25 15:18:21 migration cachesize: 1.0 GiB
2025-08-25 15:18:21 set migration parameters
2025-08-25 15:18:21 start migrate command to unix:/run/qemu-server/101.migrate
2025-08-25 15:18:22 average migration speed: 8.0 GiB/s - downtime 6 ms
2025-08-25 15:18:22 migration completed, transferred 20.7 MiB VM-state
2025-08-25 15:18:22 migration status: completed
2025-08-25 15:18:24 ERROR: Cleanup after stopping VM failed - org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2025-08-25 15:18:25 ERROR: migration finished with problems (duration 00:00:07)
TASK ERROR: migration problems

First I thought it has to do with the nf_conntrack kernel modul, which others have reported about, but I am already on version 9.0.18 with qemu-server.

When I do a manual migration it works fine. Also when I put the node in maintenance via cli command (ha-manager crm-command node-maintenance enable nodename) the migration works as expected and without error. So I suspect that the node goes away too fast in order to complete all things, that would be done normally with a migration.

Cheers,
Timo
 
I got the same error messages when updating from 9.0.5 to 9.0.6.
migration status is completed, but Cleanup after stopping VM failed errors as above.
 
I experienced the same issue. Applied patches and ran reboot from the command line. Usually the VMs migrate off and the system reboots. All sorts of problems this time with VM migrations failing and VMs getting rebooted.

Seems like we'll have to put hosts in Maintenance Mode before applying patches in future.
 
Same issue on a fresh install on PVE 9.0.10. Initiated node shutdown and VM successfully migrated with error.

task started by HA resource agent
2025-09-27 15:45:30 conntrack state migration not supported or disabled, active connections might get dropped
2025-09-27 15:45:30 use dedicated network address for sending migration traffic (10.0.2.74)
2025-09-27 15:45:30 starting migration of VM 2191 to node 'node04' (10.0.2.74)
2025-09-27 15:45:30 starting VM 2191 on remote node 'node04'
2025-09-27 15:45:31 start remote tunnel
2025-09-27 15:45:31 ssh tunnel ver 1
2025-09-27 15:45:31 starting online/live migration on unix:/run/qemu-server/2191.migrate
2025-09-27 15:45:31 set migration capabilities
2025-09-27 15:45:31 migration downtime limit: 100 ms
2025-09-27 15:45:31 migration cachesize: 1.0 GiB
2025-09-27 15:45:31 set migration parameters
2025-09-27 15:45:31 start migrate command to unix:/run/qemu-server/2191.migrate
2025-09-27 15:45:58 average migration speed: 304.2 MiB/s - downtime 45 ms
2025-09-27 15:45:58 migration completed, transferred 7.4 GiB VM-state
2025-09-27 15:45:58 migration status: completed
2025-09-27 15:46:01 ERROR: Cleanup after stopping VM failed - org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2025-09-27 15:46:01 ERROR: migration finished with problems (duration 00:00:31)
TASK ERROR: migration problems

Detailed logs are attached.
 

Attachments

Same issue on fresh 9.0.10.
It's pretty annoying - it was seamless and non-disruptive - now it's not.

My testbed was accidentally touching hypervisor's power button in DC and triggering a reboot of tenths of VMs ... lol
 
  • Like
Reactions: Soko and UdoB
We use Proxmox version 9.0.18 and unfortunately cannot compare it with previous versions.

Here is the task, when rebooting the node, move it to another node:

task started by HA resource agent
2025-11-19 12:37:22 conntrack state migration not supported or disabled, active connections might get dropped
2025-11-19 12:37:22 use dedicated network address for sending migration traffic (10.10.50.224)
2025-11-19 12:37:22 starting migration of VM 125 to node 'PVE224' (10.10.50.224)
2025-11-19 12:37:22 starting VM 125 on remote node 'PVE224'
2025-11-19 12:37:26 start remote tunnel
2025-11-19 12:37:27 ssh tunnel ver 1
2025-11-19 12:37:27 starting online/live migration on unix:/run/qemu-server/125.migrate
2025-11-19 12:37:27 set migration capabilities
2025-11-19 12:37:27 migration downtime limit: 100 ms
2025-11-19 12:37:27 migration cachesize: 1.0 GiB
2025-11-19 12:37:27 set migration parameters
2025-11-19 12:37:27 start migrate command to unix:/run/qemu-server/125.migrate
2025-11-19 12:37:28 migration active, transferred 807.3 MiB of 10.0 GiB VM-state, 4.0 GiB/s
2025-11-19 12:37:29 average migration speed: 5.0 GiB/s - downtime 39 ms
2025-11-19 12:37:29 migration completed, transferred 1.0 GiB VM-state
2025-11-19 12:37:29 migration status: completed
2025-11-19 12:37:31 ERROR: Cleanup after stopping VM failed - org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2025-11-19 12:37:32 ERROR: migration finished with problems (duration 00:00:10)
TASK ERROR: migration problems

Here is the task when the VM is moved back to the original node after the node reboot.

task started by HA resource agent
2025-11-19 12:41:37 conntrack state migration not supported or disabled, active connections might get dropped
2025-11-19 12:41:37 use dedicated network address for sending migration traffic (10.10.50.223)
2025-11-19 12:41:37 starting migration of VM 125 to node 'PVE223' (10.10.50.223)
2025-11-19 12:41:37 starting VM 125 on remote node 'PVE223'
2025-11-19 12:41:40 start remote tunnel
2025-11-19 12:41:40 ssh tunnel ver 1
2025-11-19 12:41:40 starting online/live migration on unix:/run/qemu-server/125.migrate
2025-11-19 12:41:40 set migration capabilities
2025-11-19 12:41:40 migration downtime limit: 100 ms
2025-11-19 12:41:40 migration cachesize: 1.0 GiB
2025-11-19 12:41:40 set migration parameters
2025-11-19 12:41:40 start migrate command to unix:/run/qemu-server/125.migrate
2025-11-19 12:41:41 migration active, transferred 757.4 MiB of 10.0 GiB VM-state, 1.9 GiB/s
2025-11-19 12:41:42 migration active, transferred 1.0 GiB of 10.0 GiB VM-state, 8.9 GiB/s
2025-11-19 12:41:43 average migration speed: 3.3 GiB/s - downtime 38 ms
2025-11-19 12:41:43 migration completed, transferred 1.0 GiB VM-state
2025-11-19 12:41:43 migration status: completed
2025-11-19 12:41:47 migration finished successfully (duration 00:00:11)
TASK OK

It appears that when the node was shut down, something that is definitely available after rollback had already been terminated.

Best regards

Bjoern
 
Same exact problem here with latest rev. 9.1.1. By the way I got exact same error after luanching qm migrate from an old node version 8.1 to migrate vm to a new cluster (this one with 9.1.1)... I was initially thinking it was due to EFI disk and TPM disk but was wrong also got on vm with seabios.... I'm not understanding it but it's pretty disappointing.
Regards
Andrea
 
I am facing the same issue with PVE version 9.1.2 in a HA cluster whenever I want to shutdown or reboot a host for maintenance. Therefore I would like to share a workaround when the problem happens and a way to avoid the problem. Both worked fine for me.
  • During a host shutdown with HA policy of "migrate": In that scenario, the VMs with HA setup will fail migration and because those VMs are not migrated to a different host, the host does not shut down or reboot (you will se a lot of "Migration failed" entries in the tasks log.
    Workaround: In "Datacenter -> HA" change the HA state to "stopped", the VM will gracefully shut down, followed by a host shutdown/reboot. After the host is up again, change the HA state to "started" and the VM starts.
  • To avoid the problem, this preparation before host reboot does the trick: Change "Datacenter -> HA -> Affinity rules" to force the migration of the VMs to another host. When migration has completed, shutdown/restart of the host works fine.
I hope this helps :)
 
Last edited:
Another way to migrate all VMs (and have them migrate back to the same server) is to enable maintenance mode. On any server run:

ha-manager crm-command node-maintenance enable pve1
(wait for migration, then reboot)
ha-manager crm-command node-maintenance disable pve1
 
qemu-server 9.1.3 available on pve-test now should fix this issue. affected VMs do need to be stopped and started (or live-migrated *outside of a node reboot/shutdown!) for the fix to take affect.
 
qemu-server 9.1.3 available on pve-test now should fix this issue. affected VMs do need to be stopped and started (or live-migrated *outside of a node reboot/shutdown!) for the fix to take affect.
I can confirm that with the qemu-server 9.1.3 installed, the problem is solved. Tested with PVE 9.1.2 and the updated qemu-server package. Many thanks for the fix :)
 
  • Like
Reactions: fabian