Proxmox 9.1.2 - live migration crashes

pchecinski

New Member
Dec 12, 2025
5
0
1
Hello, after updating my proxmox cluster to latest version live migration became less stable and crashes randomly, here are logs of it happening while migrating between two servers with identical Intel CPU:

Proxmox log:
Code:
mirror-scsi0: transferred 56.0 GiB of 56.0 GiB (100.00%) in 6m 11s, ready
all 'mirror' jobs are ready
2025-12-13 10:24:43 switching mirror jobs to actively synced mode
mirror-scsi0: switching to actively synced mode
mirror-scsi0: successfully switched to actively synced mode
2025-12-13 10:24:44 starting online/live migration on unix:/run/qemu-server/21069.migrate
2025-12-13 10:24:44 set migration capabilities
2025-12-13 10:24:44 migration downtime limit: 100 ms
2025-12-13 10:24:44 migration cachesize: 512.0 MiB
2025-12-13 10:24:44 set migration parameters
2025-12-13 10:24:44 start migrate command to unix:/run/qemu-server/21069.migrate
2025-12-13 10:24:45 migration active, transferred 112.9 MiB of 4.0 GiB VM-state, 118.1 MiB/s
2025-12-13 10:24:46 migration active, transferred 225.2 MiB of 4.0 GiB VM-state, 137.5 MiB/s
2025-12-13 10:24:47 migration active, transferred 336.9 MiB of 4.0 GiB VM-state, 122.8 MiB/s
2025-12-13 10:24:48 migration active, transferred 437.7 MiB of 4.0 GiB VM-state, 120.4 MiB/s
2025-12-13 10:24:49 migration active, transferred 549.8 MiB of 4.0 GiB VM-state, 114.8 MiB/s
2025-12-13 10:24:50 migration active, transferred 662.3 MiB of 4.0 GiB VM-state, 113.7 MiB/s
2025-12-13 10:24:51 migration active, transferred 773.5 MiB of 4.0 GiB VM-state, 132.0 MiB/s
2025-12-13 10:24:52 migration active, transferred 884.8 MiB of 4.0 GiB VM-state, 111.2 MiB/s
2025-12-13 10:24:53 migration active, transferred 996.4 MiB of 4.0 GiB VM-state, 109.9 MiB/s
2025-12-13 10:24:54 migration active, transferred 1.1 GiB of 4.0 GiB VM-state, 97.3 MiB/s
2025-12-13 10:24:55 migration active, transferred 1.2 GiB of 4.0 GiB VM-state, 118.4 MiB/s
2025-12-13 10:24:56 migration active, transferred 1.3 GiB of 4.0 GiB VM-state, 117.0 MiB/s
2025-12-13 10:24:57 migration active, transferred 1.4 GiB of 4.0 GiB VM-state, 118.2 MiB/s
2025-12-13 10:24:58 migration active, transferred 1.5 GiB of 4.0 GiB VM-state, 122.7 MiB/s
2025-12-13 10:24:59 migration active, transferred 1.6 GiB of 4.0 GiB VM-state, 131.9 MiB/s
2025-12-13 10:25:00 migration active, transferred 1.7 GiB of 4.0 GiB VM-state, 130.6 MiB/s
2025-12-13 10:25:01 migration active, transferred 1.8 GiB of 4.0 GiB VM-state, 112.0 MiB/s
2025-12-13 10:25:02 migration active, transferred 1.9 GiB of 4.0 GiB VM-state, 120.6 MiB/s
query migrate failed: VM 21069 not running

2025-12-13 10:25:03 query migrate failed: VM 21069 not running
query migrate failed: VM 21069 not running

2025-12-13 10:25:05 query migrate failed: VM 21069 not running
query migrate failed: VM 21069 not running

2025-12-13 10:25:07 query migrate failed: VM 21069 not running
query migrate failed: VM 21069 not running

2025-12-13 10:25:09 query migrate failed: VM 21069 not running
query migrate failed: VM 21069 not running

2025-12-13 10:25:11 query migrate failed: VM 21069 not running
query migrate failed: VM 21069 not running

2025-12-13 10:25:13 query migrate failed: VM 21069 not running
2025-12-13 10:25:13 ERROR: online migrate failure - too many query migrate failures - aborting
2025-12-13 10:25:13 aborting phase 2 - cleanup resources
2025-12-13 10:25:13 migrate_cancel
2025-12-13 10:25:13 migrate_cancel error: VM 21069 not running
2025-12-13 10:25:13 ERROR: query-status error: VM 21069 not running
mirror-scsi0: Cancelling block job
2025-12-13 10:25:13 ERROR: VM 21069 not running
2025-12-13 10:25:17 ERROR: migration finished with problems (duration 00:06:52)
TASK ERROR: migration problems

Journal on source:
Code:
Dec 13 10:25:02 HOST QEMU[4706]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.
Dec 13 10:25:03 HOST pvedaemon[1576635]: VM 21069 qmp command failed - VM 21069 not running
Dec 13 10:25:03 HOST pvedaemon[1576635]: query migrate failed: VM 21069 not running
Dec 13 10:25:03 HOST kernel: vmbr0: port 17(tap21069i0) entered disabled state
Dec 13 10:25:03 HOST kernel: tap21069i0 (unregistering): left allmulticast mode
Dec 13 10:25:03 HOST kernel: vmbr0: port 17(tap21069i0) entered disabled state
Dec 13 10:25:03 HOST kernel:  zd48: p1
Dec 13 10:25:03 HOST kernel: vmbr1: port 17(tap21069i1) entered disabled state
Dec 13 10:25:03 HOST kernel: tap21069i1 (unregistering): left allmulticast mode
Dec 13 10:25:03 HOST kernel: vmbr1: port 17(tap21069i1) entered disabled state

Journal on target:
Code:
Dec 13 10:25:02 HOST2 QEMU[1595041]: kvm: error while loading state section id 1(ram)
Dec 13 10:25:02 HOST2 QEMU[1595041]: kvm: load of migration failed: Input/output error

Migrated Virtual Machine is running AlmaLinux 8 with hotpluggable memory, migration is initiated normally via GUI (bulk action). While migrating a large amount of VMs crash happens in about 10% of cases, other VMs migrate normally.

Is it a known issue with this proxmox version? Should I consider downgrading, using older kernel or qemu? I can provide more logs if that helps resolve this issue.
 
For more context - maybe it will help pin point the issue:

Code:
proxmox-ve: 9.1.0 (running kernel: 6.17.2-2-pve)
pve-manager: 9.1.2 (running version: 9.1.2/9d436f37a0ac4172)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.4-1-pve-signed: 6.17.4-1
proxmox-kernel-6.17: 6.17.4-1
proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2
proxmox-kernel-6.8: 6.8.12-15
proxmox-kernel-6.8.12-15-pve-signed: 6.8.12-15
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx11
intel-microcode: 3.20250812.1~deb13u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.5
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.1
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.4
libpve-rs-perl: 0.11.4
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-3
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.0-1
proxmox-backup-file-restore: 4.1.0-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.5
pve-cluster: 9.0.7
pve-container: 6.0.18
pve-docs: 9.1.1
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.0.8
pve-i18n: 3.6.6
pve-qemu-kvm: 10.1.2-4
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.2
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1

Here is the config of a VM that always crashes during live migration:
Code:
agent: 1,freeze-fs-on-backup=0
boot: order=scsi0;net0
cores: 20
cpu: x86-64-v2
hotplug: disk,network,usb,memory,cpu
memory: 16384
meta: creation-qemu=9.2.0,ctime=1750068908
name: [redacted]
net0: virtio=BC:24:11:A8:DE:08,bridge=vmbr0
net1: virtio=BC:24:11:3D:96:0E,bridge=vmbr1
numa: 1
onboot: 1
ostype: l26
scsi0: local-zfs:vm-12882-disk-0,discard=on,format=raw,iothread=1,size=196G
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=[redacted]
sockets: 2
vcpus: 8
vmgenid: [redacted]

memory related things - cpu/memory hot-plug is supported by the VM so it shouldn't be an issue:
Code:
root@[redacted] ~ # cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-553.el8_10.x86_64 root=UUID=[redacted] ro console=tty0 console=ttyS0,115200 memhp_default_state=online memory_hotplug.online_policy=auto-movable movable_node crashkernel=auto rhgb quiet

Problem occurred with qemu-server: 9.1.1 and still happens after the update to qemu-server: 9.1.2.
 
Another attempt with latest kernel and updates:
Upgrade: qemu-server:amd64 (9.1.2, 9.1.3), libpve-common-perl:amd64 (9.1.1, 9.1.3), libcares2:amd64 (1.34.5-1, 1.34.5-1+deb13u1)

Still having the same issue:
Code:
proxmox log:
2025-12-19 22:14:22 migration active, transferred 5.0 GiB of 16.0 GiB VM-state, 148.7 MiB/s
query migrate failed: VM 12882 not running

journal:
QEMU[3125]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.
pvedaemon[4521]: VM 12882 qmp command failed - VM 12882 not running

Still looking for some suggestions on maybe how to get more detailed logs.
 
migrate the vm offline. does it start? sounds like maybe the destination server does not have VTX/AMDV enabled.
I've tried to live migrate a clone of the VM that has issues, live migration worked this time. Here are the logs:
Code:
2025-12-20 10:25:03 migration active, transferred 5.0 GiB of 16.0 GiB VM-state, 4.7 GiB/s
2025-12-20 10:25:05 migration active, transferred 5.2 GiB of 16.0 GiB VM-state, 114.4 MiB/s
2025-12-20 10:25:05 average migration speed: 334.7 MiB/s - downtime 242 ms
2025-12-20 10:25:05 migration completed, transferred 5.2 GiB VM-state
2025-12-20 10:25:05 migration status: completed
all 'mirror' jobs are ready
mirror-scsi0: Completing block job...
mirror-scsi0: Completed successfully.
mirror-scsi0: mirror-job finished
2025-12-20 10:25:08 stopping NBD storage migration server on target.
2025-12-20 10:25:09 stopping migration dbus-vmstate helpers
2025-12-20 10:25:09 migrated 0 conntrack state entries
2025-12-20 10:25:11 flushing conntrack state for guest on source node
2025-12-20 10:25:17 migration finished successfully (duration 00:30:40)

Source VM has higher memory usage due to having production services running on it, other than that it was identical.
VM 12882 (one of the VMs that fail live migrate): "Host memory usage 15.78 GiB", clone: "Host memory usage 5.50 GiB".

It seems that issue is more likely to occur for VMs with high memory usage.
 
Another attempt with latest kernel and updates:
Upgrade: qemu-server:amd64 (9.1.2, 9.1.3), libpve-common-perl:amd64 (9.1.1, 9.1.3), libcares2:amd64 (1.34.5-1, 1.34.5-1+deb13u1)

Still having the same issue:
Code:
proxmox log:
2025-12-19 22:14:22 migration active, transferred 5.0 GiB of 16.0 GiB VM-state, 148.7 MiB/s
query migrate failed: VM 12882 not running

journal:
QEMU[3125]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.
pvedaemon[4521]: VM 12882 qmp command failed - VM 12882 not running

Still looking for some suggestions on maybe how to get more detailed logs.
"QEMU[3125]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed." --> could be related to backup bitmap tracking. Maybe try to do a backup, then live migrate, and see if the crash is occuring
 
"QEMU[3125]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed." --> could be related to backup bitmap tracking. Maybe try to do a backup, then live migrate, and see if the crash is occuring

Backup to a PBS server works correctly:
Code:
INFO: starting new backup job: vzdump 12882 --notification-mode notification-system --storage STORAGE --mode snapshot --node SRCNODE --notes-template '{{guestname}}' --remove 0
INFO: Starting Backup of VM 12882 (qemu)
INFO: Backup started at 2025-12-20 17:29:11
INFO: status = running
INFO: VM Name: HOSTNAME
INFO: include disk 'scsi0' 'local-zfs:vm-12882-disk-0' 196G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/12882/2025-12-20T16:29:11Z'
INFO: skipping guest-agent 'fs-freeze', disabled in VM options
INFO: started backup task '2e5c2edd-b7db-4080-b489-05b504370f18'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (37.7 GiB of 196.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 37.7 GiB dirty of 196.0 GiB total
INFO:   0% (256.0 MiB of 37.7 GiB) in 3s, read: 85.3 MiB/s, write: 85.3 MiB/s
...
INFO: 100% (37.7 GiB of 37.7 GiB) in 7m 32s, read: 70.4 MiB/s, write: 66.4 MiB/s
INFO: Waiting for server to finish backup validation...
INFO: backup was done incrementally, reused 159.70 GiB (81%)
INFO: transferred 37.66 GiB in 454 seconds (85.0 MiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 12882 (00:07:34)
INFO: Backup finished at 2025-12-20 17:36:45
INFO: Backup job finished successfully
INFO: notified via target `mail-to-root`
TASK OK

Migration still crashed:
Code:
2025-12-20 17:41:12 starting migration of VM 12882 to node 'DEST_NODE' (IP)
2025-12-20 17:41:12 found local disk 'local-zfs:vm-12882-disk-0' (attached)
2025-12-20 17:41:12 starting VM 12882 on remote node 'DEST_NODE'
2025-12-20 17:41:16 volume 'local-zfs:vm-12882-disk-0' is 'local-zfs:vm-12882-disk-0' on the target
2025-12-20 17:41:16 start remote tunnel
2025-12-20 17:41:17 ssh tunnel ver 1
2025-12-20 17:41:17 starting storage migration
2025-12-20 17:41:17 scsi0: start migration to nbd:unix:/run/qemu-server/12882_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
mirror-scsi0: transferred 0.0 B of 196.0 GiB (0.00%) in 0s
...
mirror-scsi0: transferred 196.3 GiB of 196.3 GiB (100.00%) in 29m 33s, ready
all 'mirror' jobs are ready
2025-12-20 18:10:50 switching mirror jobs to actively synced mode
mirror-scsi0: switching to actively synced mode
mirror-scsi0: successfully switched to actively synced mode
2025-12-20 18:10:51 starting online/live migration on unix:/run/qemu-server/12882.migrate
2025-12-20 18:10:51 set migration capabilities
2025-12-20 18:10:51 migration downtime limit: 100 ms
2025-12-20 18:10:51 migration cachesize: 2.0 GiB
2025-12-20 18:10:51 set migration parameters
2025-12-20 18:10:51 start migrate command to unix:/run/qemu-server/12882.migrate
2025-12-20 18:10:52 migration active, transferred 112.5 MiB of 16.0 GiB VM-state, 112.2 MiB/s
...
2025-12-20 18:12:46 migration active, transferred 12.5 GiB of 16.0 GiB VM-state, 113.7 MiB/s
query migrate failed: VM 12882 not running

2025-12-20 18:12:47 query migrate failed: VM 12882 not running
query migrate failed: VM 12882 not running

2025-12-20 18:12:49 query migrate failed: VM 12882 not running
query migrate failed: VM 12882 not running

2025-12-20 18:12:51 query migrate failed: VM 12882 not running
query migrate failed: VM 12882 not running

2025-12-20 18:12:53 query migrate failed: VM 12882 not running
query migrate failed: VM 12882 not running

2025-12-20 18:12:55 query migrate failed: VM 12882 not running
query migrate failed: VM 12882 not running

2025-12-20 18:12:57 query migrate failed: VM 12882 not running
2025-12-20 18:12:57 ERROR: online migrate failure - too many query migrate failures - aborting
2025-12-20 18:12:57 aborting phase 2 - cleanup resources
2025-12-20 18:12:57 migrate_cancel
2025-12-20 18:12:57 migrate_cancel error: VM 12882 not running
2025-12-20 18:12:57 ERROR: query-status error: VM 12882 not running
mirror-scsi0: Cancelling block job
2025-12-20 18:12:57 ERROR: VM 12882 not running
2025-12-20 18:13:01 ERROR: migration finished with problems (duration 00:31:51)
TASK ERROR: migration problems