Configure migration settings network to share the same network of Ceph

Mar 4, 2026
1
0
1
Hello everyone!

I was wondering if someone could give me a couple tips on the configuration for the network used by Migration Settings, to migrate VMs from one node to the other. We have a cluster with 15 nodes which are configured with 2 networks:
  • Ceph is configured to use a Bond made of 2 10G physical interfaces in balance-rr
  • Management and VMs are instead using another (a Bridge pointing to a Bond, actually) of 2 1G physical interfaces, with the bond configured in active-backup
Right now the Migration Settings are using the "Default" which, from my understanding, is using the 1G connection. I was wondering if I could simply switch this to the 10G, sharing it with Ceph traffic to have better migration speed available.

Reason I am asking is that many times, when we try to migrate a VM from one node to another, it usually fails with something like:

Code:
2026-03-04 16:29:27 xbzrle: send updates to 4310672 pages in 4.5 GiB encoded memory, cache-miss 27.43%, overflow 548054
2026-03-04 16:29:28 migration active, transferred 49.1 GiB of 32.0 GiB VM-state, 222.9 MiB/s
2026-03-04 16:29:28 xbzrle: send updates to 4369514 pages in 4.6 GiB encoded memory, cache-miss 24.91%, overflow 553321
2026-03-04 16:29:28 average migration speed: 73.0 MiB/s - downtime 188 ms
2026-03-04 16:29:28 migration status: completed
2026-03-04 16:29:28 ERROR: tunnel replied 'ERR: resume failed - VM 163 qmp command 'query-status' failed - client closed connection' to command 'resume 163'
VM quit/powerdown failed - terminating now with SIGTERM
2026-03-04 16:29:42 ERROR: migration finished with problems (duration 00:07:46)
TASK ERROR: migration problems

Which I was thinking could be improved if there was a faster network transfer speed during migrations, to make the memory be moved quicker and with less cache-miss.

Is the configuration that I was thinking a good idea? Are there any possible issues that could come from this change?
 
Last edited:
Yeah, you can switch the network in the Datacenter menu under Options. Just keep in mind that 2x10G isn’t a lot for Ceph, if a migration starts running, you can hit bottlenecks pretty fast. I think, there’s also an option to limit the migration bandwidth so Ceph gets some air to breathe.
 
  • Like
Reactions: gurubert
The migration failure and the slow migration speed are two separate issues — worth separating them before changing anything.

On the "resume failed / client closed connection" error:​


This error occurs after "migration status: completed" in the log — meaning QEMU on the destination received the full VM state successfully. The failure happens when PVE sends the `resume` command to the destination QEMU and the QMP socket is already closed. That means the destination QEMU process exited right at the moment it was supposed to unsuspend the VM.

Switching to a faster network won't prevent this. To find out what actually happened, check the destination node around the migration timestamp:

Bash:
journalctl --since "2026-03-04 16:28:00" --until "2026-03-04 16:31:00" \
  | grep -i "oom\|out of memory\|killed\|segfault\|qemu"
dmesg | grep -i "oom\|killed\|segfault\|qemu"

The crash happened during post-migration device activation — the point where QEMU re-initializes all emulated devices after receiving the migrated state. Could be an OOM kill, a QEMU segfault, or a storage activation failure (e.g. the RBD volume not opening on the destination). The logs will show which. Please share what you find there.

If there's no OOM event, look for a segfault or assertion failure in the QEMU process logs, and share what you find.

On switching migration to the 10G network:​


Yes, this is worth doing independently. With a 32 GiB VM under active load, your migration log shows the VM was dirtying pages faster than the 1G link could flush them ("transferred 49.1 GiB of 32.0 GiB VM-state" — more than the full RAM had to be retransmitted due to re-dirtied pages). A 10G link will substantially reduce the time the VM spends in the final convergence phase.

In the GUI: Datacenter → Options → Migration Settings → Network. The field takes a CIDR — enter the IP range of your Ceph network (e.g. `10.0.2.0/24`). PVE resolves which local IP on each node belongs to that subnet and uses it as the migration endpoint. Or directly in `/etc/pve/datacenter.cfg`:

INI:
migration: network=10.0.2.0/24

(replace with the actual CIDR of your Ceph network).

On bandwidth sharing with Ceph:​


@lucavornheder is right to flag this. With 15 nodes actively replicating data, unthrottled migration on the same bond can spike latency for Ceph I/O. Set a cluster-wide default in Datacenter → Options → Bandwidth Limits → Migration, or pass `--bwlimit` per migration (unit is KiB/s):

Bash:
qm migrate <vmid> <target-node> --bwlimit 6250000   # ~5 Gbps