Live migration occasionally fails with QEMU assertion failure

stuartthebruce

New Member
Dec 6, 2025
10
0
1
Testing a new PVE9 cluster with high-speed (2x400G LACP) network connections has shown that live migration of large VMs will occasionally fail with,

QEMU[949247]: kvm: ../util/bitmap.c:167: bitmap_set: Assertion `start >= 0 && nr >= 0' failed.

This was initially thought to be related to the PBS network problem with recent kernels as the frequency of this problem does appear to depend on the kernel version with no successful migrations obtained with 6.14.0-{1,2}-pve and one-out-of-a-few failure rate with older or newer kernels. However, as @fiona indicated in https://forum.proxmox.com/threads/s...o-pve-9-1-1-and-pbs-4-0-20.176444/post-823442 this **QEMU** assertion failure should not be triggered by those kernel TCP receive buffer problems. Here is additional information from n an example VM (1TB memory/2TB local storage) failure between host hov1 and hov2. Note, In all of the tests I have run so far there has never been a failure while migrating the 2TB local storage, just during the 1TB memory migration, so I believe the network is stable.

VM configuration:
Code:
root@hov1:~# qm config 102
allow-ksm: 0
balloon: 0
boot: order=scsi0;ide2;net0
cores: 96
cpu: host
hotplug: disk,network,usb,cpu
ide2: none,media=cdrom
memory: 1048576
meta: creation-qemu=10.0.2,ctime=1761354940
name: node2412.cluster.ldas.cit
net0: virtio=BC:24:11:D3:10:A8,bridge=vmbr0,queues=32
numa: 1
ostype: l26
rng0: source=/dev/urandom
scsi0: local-zfs:vm-102-disk-0,format=raw,iothread=1,size=2T
scsihw: virtio-scsi-single
smbios1: uuid=20721900-0449-43a2-aec7-41c44ce7a68d
sockets: 1
vcpus: 96
vmgenid: 6b1b9d99-097c-4b3f-a290-f7d79b89160e

Code:
root@hov1:~# pveversion -v
proxmox-ve: 9.1.0 (running kernel: 6.17.11-2-test-pve)
pve-manager: 9.1.2 (running version: 9.1.2/9d436f37a0ac4172)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.11-2-test-pve: 6.17.11-2
proxmox-kernel-6.17.11-1-test-pve: 6.17.11-1
proxmox-kernel-6.17.4-1-pve-signed: 6.17.4-1
proxmox-kernel-6.17: 6.17.4-1
proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
proxmox-kernel-6.14: 6.14.11-4
proxmox-kernel-6.14.11-4-pve: 6.14.11-4
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
amd64-microcode: 3.20250311.1
ceph: 19.2.3-pve2
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: not correctly installed
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.4
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.0
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.3
libpve-rs-perl: 0.11.3
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-3
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.0-1
proxmox-backup-file-restore: 4.1.0-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.2
pve-cluster: 9.0.7
pve-container: 6.0.18
pve-docs: 9.1.1
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.0.8
pve-i18n: 3.6.5
pve-qemu-kvm: 10.1.2-4
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.1
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1

Code:
root@hov2:~# pveversion -v
proxmox-ve: 9.1.0 (running kernel: 6.17.11-2-test-pve)
pve-manager: 9.1.2 (running version: 9.1.2/9d436f37a0ac4172)
proxmox-kernel-helper: 9.0.4
proxmox-kernel-6.17.11-2-test-pve: 6.17.11-2
proxmox-kernel-6.17.11-1-test-pve: 6.17.11-1
proxmox-kernel-6.17.4-1-pve-signed: 6.17.4-1
proxmox-kernel-6.17: 6.17.4-1
proxmox-kernel-6.17.2-2-pve-signed: 6.17.2-2
proxmox-kernel-6.17.2-1-pve-signed: 6.17.2-1
proxmox-kernel-6.14.11-4-pve-signed: 6.14.11-4
proxmox-kernel-6.14: 6.14.11-4
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
amd64-microcode: 3.20250311.1
ceph: 19.2.3-pve2
ceph-fuse: 19.2.3-pve2
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.4.1-1+pve1
ifupdown2: 3.3.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: not correctly installed
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.4
libpve-apiclient-perl: 3.4.2
libpve-cluster-api-perl: 9.0.7
libpve-cluster-perl: 9.0.7
libpve-common-perl: 9.1.0
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.5
libpve-network-perl: 1.2.3
libpve-rs-perl: 0.11.3
libpve-storage-perl: 9.1.0
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2+pmx1
lxc-pve: 6.0.5-3
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.1.0-1
proxmox-backup-file-restore: 4.1.0-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.2.1
proxmox-kernel-helper: 9.0.4
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.3
proxmox-widget-toolkit: 5.1.2
pve-cluster: 9.0.7
pve-container: 6.0.18
pve-docs: 9.1.1
pve-edk2-firmware: 4.2025.05-2
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.4
pve-firmware: 3.17-2
pve-ha-manager: 5.0.8
pve-i18n: 3.6.5
pve-qemu-kvm: 10.1.2-4
pve-xtermjs: 5.5.0-3
qemu-server: 9.1.1
smartmontools: 7.4-pve1
spiceterm: 3.4.1
swtpm: 0.8.0+pve3
vncterm: 1.9.1
zfsutils-linux: 2.3.4-pve1

Failed migration task log and journalctl logs attached

journalctl logs from both nodes around the time of the failure
 

Attachments

I have isolated this QEMU assertion failure to live migration of Linux VMs that dynamically provision filesystems on a local HV zpool with LVM during migration. Is this a known limitation of QEMU? Note, if I move the LVM to shared ceph storage than there is no problem; and I have also not seen any failures for Linux VM that are not actively creating/destroying LV on local HV storage during migration.
 
Hi,
this bug is not known and I haven't seen other people reporting it yet. Just to make sure I understand your setup correctly: the virtual disk for the VM was residing on ZFS/Ceph storage on the host and inside the VM/guest, LVM is used and frequently creates and destroys LVs? Can you also share the part for the ZFS storage of /etc/pve/storage.cfg?

Does the failure always occur during the migration of the VM state/RAM? I.e. after the initial disk mirror already finished with all 'mirror' jobs are ready?

Do you have a dummy VM to reproduce the issue (e.g. clone of an exisiting VM)? The error message is unfortunately not really telling. It would be possible to get a full backtrace with GDB. Run apt install pve-qemu-kvm-dbgsym gdb to install debugger and debug symbols, and attach GDB before the migration with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /run/qemu-server/<ID>.pid)
replacing <ID> with the ID of the VM. Should it crash, you will see the error in the GDB interface and can run t a a bt there to get the backtrace.

And just to be sure (since you are currently the only reporter), could you do a memtest during a future maintenance window?

EDIT: forgot to add GDB command for obtaining the backtrace
 
Last edited:
Hi,
this bug is not known and I haven't seen other people reporting it yet.

Good to know. Do you also know if upstream QEMU claims to support dynamic LVM creation/deletion during live migration?

Just to make sure I understand your setup correctly: the virtual disk for the VM was residing on ZFS/Ceph storage on the host and inside the VM/guest, LVM is used and frequently creates and destroys LVs?

The VM/guest disk image is either stored on the HV as a local ZFS zvol (exhibits the problem) or Ceph RBD (avoid the problem); and the problem with ZFS only occurs if the VM/guest is actively creating/destroying LVM during live migration.

Can you also share the part for the ZFS storage of /etc/pve/storage.cfg?
Code:
root@hov1:~# cat /etc/pve/storage.cfg
dir: local
    path /var/lib/vz
    content iso,vztmpl,backup

zfspool: local-zfs
    pool rpool/data
    content rootdir,images
    sparse 1

pbs: pbs-hov
    datastore hov
    server pbs.ldas.cit
    content backup
    fingerprint e2:9b:2b:27:cc:2d:63:41:20:c8:03:0e:9b:48:ec:6d:15:b9:d8:fd:e2:7d:6e:a0:a5:de:c5:25:e9:af:2b:29
    prune-backups keep-all=1
    username root@pam

zfspool: optane-zfs
    pool optane-zfs
    content images,rootdir
    mountpoint /optane-zfs
    sparse 0

rbd: ceph-rbd
    content images,rootdir
    krbd 1
    pool ceph-rbd

rbd: optane-rbd
    content images,rootdir
    krbd 1
    pool optane-rbd

zfspool: micron-zfs
    pool micron-zfs
    content rootdir,images
    mountpoint /micron-zfs
    nodes hov4,hov1,hov5,hov3,hov2

Does the failure always occur during the migration of the VM state/RAM?

Yes.

I.e. after the initial disk mirror already finished with all 'mirror' jobs are ready?

Yes.

Do you have a dummy VM to reproduce the issue (e.g. clone of an exisiting VM)?

Yes. I can reproduce this at will with anyone of 5 different VMs that have shown the problem so far. And I am willing to reconfigure those VMs by changing BIOS/Machine/SCSI Controller/... to help further isolate this problem as needed.

The error message is unfortunately not really telling. It would be possible to get a full backtrace with GDB. Run apt install pve-qemu-kvm-dbgsym gdb to install debugger and debug symbols, and attach GDB before the migration with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /run/qemu-server/<ID>.pid)
replacing <ID> with the ID of the VM. Should it crash, you will see the error in the GDB interface and can run t a a bt there to get the backtrace.

I will attempt this later today.

And just to be sure (since you are currently the only reporter), could you do a memtest during a future maintenance window?

I have reproduced this problem on 5 out of 5 new enterprise class servers that all have ECC without any reported single-bit flips being detected/corrected under heavy memory stress testing using `/usr/bin/stress`

EDIT: forgot to add GDB command for obtaining the backtrace

Thank you for taking the time to follow up on this.
 
Do you have a dummy VM to reproduce the issue (e.g. clone of an exisiting VM)? The error message is unfortunately not really telling. It would be possible to get a full backtrace with GDB. Run apt install pve-qemu-kvm-dbgsym gdb to install debugger and debug symbols, and attach GDB before the migration with
Code:
gdb --ex 'set pagination off' --ex 'handle SIGUSR1 noprint nostop' --ex 'handle SIGPIPE noprint nostop' --ex 'c' -p $(cat /run/qemu-server/<ID>.pid)
replacing <ID> with the ID of the VM. Should it crash, you will see the error in the GDB interface and can run t a a bt there to get the backtrace.
The first attempt under gdb reproduced the failure with the following job output and attached gdb backtrace
Code:
2026-01-05 13:52:02 migration active, transferred 163.9 GiB of 1.0 TiB VM-state, 1.8 GiB/s
2026-01-05 13:52:03 migration active, transferred 165.7 GiB of 1.0 TiB VM-state, 1.8 GiB/s
2026-01-05 13:52:04 migration active, transferred 167.4 GiB of 1.0 TiB VM-state, 1.7 GiB/s
query migrate failed: VM 104 qmp command 'query-migrate' failed - got timeout

2026-01-05 14:52:05 query migrate failed: VM 104 qmp command 'query-migrate' failed - got timeout
 

Attachments