if reboot is triggered pve node goes away too fast before ha migration is finished

tvtue

New Member
Nov 22, 2024
15
3
3
With pve 9.0.5 when configuring a vm as a ha-resource with status "started" and then doing a reboot of the underlying hypervisor, I noticed the following error message

Code:
task started by HA resource agent
2025-08-25 15:18:18 conntrack state migration not supported or disabled, active connections might get dropped
2025-08-25 15:18:18 starting migration of VM 101 to node 'sm01a' (10.27.33.1)
2025-08-25 15:18:18 starting VM 101 on remote node 'sm01a'
2025-08-25 15:18:21 start remote tunnel
2025-08-25 15:18:21 ssh tunnel ver 1
2025-08-25 15:18:21 starting online/live migration on unix:/run/qemu-server/101.migrate
2025-08-25 15:18:21 set migration capabilities
2025-08-25 15:18:21 migration downtime limit: 100 ms
2025-08-25 15:18:21 migration cachesize: 1.0 GiB
2025-08-25 15:18:21 set migration parameters
2025-08-25 15:18:21 start migrate command to unix:/run/qemu-server/101.migrate
2025-08-25 15:18:22 average migration speed: 8.0 GiB/s - downtime 6 ms
2025-08-25 15:18:22 migration completed, transferred 20.7 MiB VM-state
2025-08-25 15:18:22 migration status: completed
2025-08-25 15:18:24 ERROR: Cleanup after stopping VM failed - org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2025-08-25 15:18:25 ERROR: migration finished with problems (duration 00:00:07)
TASK ERROR: migration problems

First I thought it has to do with the nf_conntrack kernel modul, which others have reported about, but I am already on version 9.0.18 with qemu-server.

When I do a manual migration it works fine. Also when I put the node in maintenance via cli command (ha-manager crm-command node-maintenance enable nodename) the migration works as expected and without error. So I suspect that the node goes away too fast in order to complete all things, that would be done normally with a migration.

Cheers,
Timo
 
  • Like
Reactions: mulindji_mayhem
I got the same error messages when updating from 9.0.5 to 9.0.6.
migration status is completed, but Cleanup after stopping VM failed errors as above.
 
Same problem after upgrade 8.x to 9.0.6
Occurs only on clusternode shutdown or reboot.
 
I experienced the same issue. Applied patches and ran reboot from the command line. Usually the VMs migrate off and the system reboots. All sorts of problems this time with VM migrations failing and VMs getting rebooted.

Seems like we'll have to put hosts in Maintenance Mode before applying patches in future.
 
Same issue on a fresh install on PVE 9.0.10. Initiated node shutdown and VM successfully migrated with error.

task started by HA resource agent
2025-09-27 15:45:30 conntrack state migration not supported or disabled, active connections might get dropped
2025-09-27 15:45:30 use dedicated network address for sending migration traffic (10.0.2.74)
2025-09-27 15:45:30 starting migration of VM 2191 to node 'node04' (10.0.2.74)
2025-09-27 15:45:30 starting VM 2191 on remote node 'node04'
2025-09-27 15:45:31 start remote tunnel
2025-09-27 15:45:31 ssh tunnel ver 1
2025-09-27 15:45:31 starting online/live migration on unix:/run/qemu-server/2191.migrate
2025-09-27 15:45:31 set migration capabilities
2025-09-27 15:45:31 migration downtime limit: 100 ms
2025-09-27 15:45:31 migration cachesize: 1.0 GiB
2025-09-27 15:45:31 set migration parameters
2025-09-27 15:45:31 start migrate command to unix:/run/qemu-server/2191.migrate
2025-09-27 15:45:58 average migration speed: 304.2 MiB/s - downtime 45 ms
2025-09-27 15:45:58 migration completed, transferred 7.4 GiB VM-state
2025-09-27 15:45:58 migration status: completed
2025-09-27 15:46:01 ERROR: Cleanup after stopping VM failed - org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2025-09-27 15:46:01 ERROR: migration finished with problems (duration 00:00:31)
TASK ERROR: migration problems

Detailed logs are attached.
 

Attachments