I am experiencing a recurring issue with online (live) migration of a VM in a Proxmox VE cluster.
The migration starts normally and progresses as expected for most of the process, but it consistently fails during the final completion phase, even after Proxmox automatically increases the allowed downtime.
Below is a summary of the behavior and the relevant logs.
Ceph shared storage
Online / live migration
VM with 16 GiB RAM
Dedicated migration network (high throughput observed)
Live migration starts correctly.
Memory state transfer progresses normally up to about 14.9 GiB / 16.0 GiB.
Transfer rates are variable but generally high (peaks close to 900 MiB/s).
Near the end of the migration, the transfer stalls at 14.9 GiB with 0.0 B/s throughput.
Proxmox automatically increases the allowed downtime multiple times:
100 ms → 200 ms → 400 ms → … → up to 204800 ms
Despite this, the migration never completes.
(The migration continues despite this warning.)
Cleanup is triggered.
The VM remains on the source node.
The issue is reproducible for this VM.
Could this be related to Ceph, the migration network, or kernel-level networking (conntrack)?
Are there recommended tunings or workarounds (migration cache size, downtime limits, precopy/postcopy, disabling conntrack, etc.)?
Thx
The migration starts normally and progresses as expected for most of the process, but it consistently fails during the final completion phase, even after Proxmox automatically increases the allowed downtime.
Below is a summary of the behavior and the relevant logs.
- Environment:
Ceph shared storage
Online / live migration
VM with 16 GiB RAM
Dedicated migration network (high throughput observed)
- Observed behavior:
Live migration starts correctly.
Memory state transfer progresses normally up to about 14.9 GiB / 16.0 GiB.
Transfer rates are variable but generally high (peaks close to 900 MiB/s).
Near the end of the migration, the transfer stalls at 14.9 GiB with 0.0 B/s throughput.
Proxmox automatically increases the allowed downtime multiple times:
100 ms → 200 ms → 400 ms → … → up to 204800 ms
Despite this, the migration never completes.
- Final error:
Code:
migration status error: failed - Error in migration completion: Bad address
ERROR: online migrate failure - aborting
ERROR: migration finished with problems
- Additional message seen at the beginning:
Code:
conntrack state migration not supported or disabled, active connections might get dropped
(The migration continues despite this warning.)
- Result:
Cleanup is triggered.
The VM remains on the source node.
The issue is reproducible for this VM.
- Open questions:
Could this be related to Ceph, the migration network, or kernel-level networking (conntrack)?
Are there recommended tunings or workarounds (migration cache size, downtime limits, precopy/postcopy, disabling conntrack, etc.)?
Thx