While I still wasn't able to reproduce the issue myself, I think I found the reason for the crash now: https://lore.kernel.org/qemu-devel/20250108124649.333668-1-f.ebner@proxmox.com/T/#u
2025-01-14 14:31:45 set migration parameters
2025-01-14 14:31:45 start migrate command to unix:/run/qemu-server/102.migrate
2025-01-14 14:31:47 migration active, transferred 107.2 MiB of 1.5 GiB VM-state, 161.4 MiB/s
2025-01-14 14:31:48 migration active, transferred 234.3 MiB of 1.5 GiB VM-state, 695.3 MiB/s
2025-01-14 14:31:49 migration active, transferred 329.5 MiB of 1.5 GiB VM-state, 878.6 MiB/s
2025-01-14 14:31:50 migration active, transferred 500.7 MiB of 1.5 GiB VM-state, 34.5 MiB/s
query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection
2025-01-14 14:32:04 query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 102 not running
2025-01-14 14:32:05 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running
2025-01-14 14:32:07 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running
2025-01-14 14:32:08 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running
2025-01-14 14:32:09 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running
2025-01-14 14:32:10 query migrate failed: VM 102 not running
2025-01-14 14:32:10 ERROR: online migrate failure - too many query migrate failures - aborting
2025-01-14 14:32:10 aborting phase 2 - cleanup resources
2025-01-14 14:32:10 migrate_cancel
2025-01-14 14:32:10 migrate_cancel error: VM 102 not running
2025-01-14 14:32:10 ERROR: query-status error: VM 102 not running
2025-01-14 14:32:13 ERROR: migration finished with problems (duration 00:00:34)
TASK ERROR: migration problems
Please check the system logs and coredumps, it still looks like the VM is crashing.Thanks.
I tested but still have problems. But the error is now different:
Code:2025-01-14 14:31:45 set migration parameters 2025-01-14 14:31:45 start migrate command to unix:/run/qemu-server/102.migrate 2025-01-14 14:31:47 migration active, transferred 107.2 MiB of 1.5 GiB VM-state, 161.4 MiB/s 2025-01-14 14:31:48 migration active, transferred 234.3 MiB of 1.5 GiB VM-state, 695.3 MiB/s 2025-01-14 14:31:49 migration active, transferred 329.5 MiB of 1.5 GiB VM-state, 878.6 MiB/s 2025-01-14 14:31:50 migration active, transferred 500.7 MiB of 1.5 GiB VM-state, 34.5 MiB/s query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection 2025-01-14 14:32:04 query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection query migrate failed: VM 102 not running 2025-01-14 14:32:05 query migrate failed: VM 102 not running query migrate failed: VM 102 not running 2025-01-14 14:32:07 query migrate failed: VM 102 not running query migrate failed: VM 102 not running 2025-01-14 14:32:08 query migrate failed: VM 102 not running query migrate failed: VM 102 not running 2025-01-14 14:32:09 query migrate failed: VM 102 not running query migrate failed: VM 102 not running 2025-01-14 14:32:10 query migrate failed: VM 102 not running 2025-01-14 14:32:10 ERROR: online migrate failure - too many query migrate failures - aborting 2025-01-14 14:32:10 aborting phase 2 - cleanup resources 2025-01-14 14:32:10 migrate_cancel 2025-01-14 14:32:10 migrate_cancel error: VM 102 not running 2025-01-14 14:32:10 ERROR: query-status error: VM 102 not running 2025-01-14 14:32:13 ERROR: migration finished with problems (duration 00:00:34) TASK ERROR: migration problems
2025-01-14T14:31:51.920161+01:00 test-pve-01 kernel: [ 3214.087254] kvm[18733]: segfault at 41b8 ip 00005d1ea425fb00 sp 00007b07488fef38 error 4 in qemu-system-x86_64[5d1ea3d7c000+6a4000] likely on CPU 1 (core 1, socket 0)
2025-01-14T14:32:04.378628+01:00 test-pve-01 systemd-coredump[20239]: Process 18733 (kvm) of user 0 dumped core.#012#012Module libsystemd.so.0 from deb systemd-252.31-1~deb12u1.amd64#012Module libudev.so.1 from deb systemd-252.31-1~deb12u1.amd64#012Stack trace of thread 18733:#012#0 0x00005d1ea425fb00 bdrv_primary_child (qemu-system-x86_64 + 0x80eb00)#012#1 0x00005d1ea42886d6 bdrv_co_flush (qemu-system-x86_64 + 0x8376d6)#012#2 0x00005d1ea424a882 bdrv_co_flush_entry (qemu-system-x86_64 + 0x7f9882)#012#3 0x00005d1ea43cef7b coroutine_trampoline (qemu-system-x86_64 + 0x97df7b)#012#4 0x00007b074e0a69c0 n/a (libc.so.6 + 0x519c0)#012ELF object binary architecture: AMD x86-64
2025-01-15T13:49:23.118482+01:00 test-pve-01 kernel: [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2025-01-15T13:49:23.118483+01:00 test-pve-01 kernel: [ 0.000001] kvm-clock: using sched offset of 16430787326 cycles
2025-01-15T13:49:23.118484+01:00 test-pve-01 kernel: [ 0.000003] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2025-01-15T13:49:23.118656+01:00 test-pve-01 kernel: [ 0.057410] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
2025-01-15T13:49:23.118657+01:00 test-pve-01 kernel: [ 0.057428] kvm-guest: KVM setup pv remote TLB flush
2025-01-15T13:49:23.118658+01:00 test-pve-01 kernel: [ 0.057432] kvm-guest: setup PV sched yield
2025-01-15T13:49:23.118715+01:00 test-pve-01 kernel: [ 0.058259] kvm-guest: PV spinlocks enabled
2025-01-15T13:49:23.119040+01:00 test-pve-01 kernel: [ 0.183520] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
2025-01-15T13:49:23.119085+01:00 test-pve-01 kernel: [ 0.183527] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
2025-01-15T13:49:23.119095+01:00 test-pve-01 kernel: [ 0.183530] kvm-guest: setup PV IPIs
2025-01-15T13:49:23.121082+01:00 test-pve-01 kernel: [ 0.377328] clocksource: Switched to clocksource kvm-clock
2025-01-15T14:13:17.746968+01:00 test-pve-01 kernel: [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2025-01-15T14:13:17.746970+01:00 test-pve-01 kernel: [ 0.000001] kvm-clock: using sched offset of 22372401189 cycles
2025-01-15T14:13:17.746972+01:00 test-pve-01 kernel: [ 0.000004] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2025-01-15T14:13:17.747239+01:00 test-pve-01 kernel: [ 0.058603] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
2025-01-15T14:13:17.747241+01:00 test-pve-01 kernel: [ 0.058621] kvm-guest: KVM setup pv remote TLB flush
2025-01-15T14:13:17.747243+01:00 test-pve-01 kernel: [ 0.058626] kvm-guest: setup PV sched yield
2025-01-15T14:13:17.747365+01:00 test-pve-01 kernel: [ 0.059445] kvm-guest: PV spinlocks enabled
2025-01-15T14:13:17.747927+01:00 test-pve-01 kernel: [ 0.185408] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
2025-01-15T14:13:17.747942+01:00 test-pve-01 kernel: [ 0.185415] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
2025-01-15T14:13:17.747946+01:00 test-pve-01 kernel: [ 0.185418] kvm-guest: setup PV IPIs
2025-01-15T14:13:17.753817+01:00 test-pve-01 kernel: [ 0.527399] clocksource: Switched to clocksource kvm-clock
2025-01-15T14:23:58.077113+01:00 test-pve-01 QEMU[5351]: kvm: check_section_footer: Read section footer failed: -5
2025-01-15T14:23:58.085319+01:00 test-pve-01 QEMU[5351]: kvm: load of migration failed: Invalid argument
2025-01-15T14:44:08.717865+01:00 test-pve-01 QEMU[13133]: kvm: check_section_footer: Read section footer failed: -5
2025-01-15T14:44:08.718721+01:00 test-pve-01 QEMU[13133]: kvm: load of migration failed: Invalid argument
2025-01-15T14:48:20.594322+01:00 test-pve-01 systemd-coredump[15088]: Process 13660 (kvm) of user 0 dumped core.#012#012Module libsystemd.so.0 from deb systemd-252.33-1~deb12u1.amd64#012Module libudev.so.1 from deb systemd-252.33-1~deb12u1.amd64#012Stack trace of thread 13660:#012#0 0x0000593961935b00 bdrv_primary_child (qemu-system-x86_64 + 0x80eb00)#012#1 0x000059396195e6d6 bdrv_co_flush (qemu-system-x86_64 + 0x8376d6)#012#2 0x0000593961920882 bdrv_co_flush_entry (qemu-system-x86_64 + 0x7f9882)#012#3 0x0000593961aa4f7b coroutine_trampoline (qemu-system-x86_64 + 0x97df7b)#012#4 0x00007cb0fb6a69c0 n/a (libc.so.6 + 0x519c0)#012ELF object binary architecture: AMD x86-64
handle SIGUSR1 noprint nostop
handle SIGPIPE noprint nostop
break blk_aio_flush
break bdrv_flush
break bdrv_co_flush
break blk_co_do_flush
break blk_aio_flush
break blk_remove_bs
break bdrv_graph_wrlock
break bdrv_graph_wrunlock
break bdrv_root_unref_child
break blk_drain
break bdrv_replace_child_noperm
commands 1-11
bt
c
end
c
I tested the patched 9.02 QEMU package and it does seem to resolve my issue. I'm using Proxmox in a homelab so I am adventurous with the testing repo, and the 9.20 QEMU package might have been causing this behavior.I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.
can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.
Hi Fiona,Hi,
can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?
Apologies, here is the VM config from the referenced VM 201 as well:Hi,
can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?
apt install pve-qemu-kvm-dbgsym gdb systemd-coredump
(doesn't really hurt to have these packages installed in any case, but APT will not like it if the version for the dbgsym
package doesn't match the version of the main package, so you'd need to downgrade that too). And then the next time a crash happens, you can run coredumpctl -1 gdb
and then in the GDB prompt thread apply all backtrace
.Yes, my apologies for confusing the original topic... when I have some time I can attempt to recreate the issue and provide logging as requested. The behavior is very similar to what the previous poster described, although I am indeed using replicated ZFS storage.Okay, so this is ZFS not Ceph/RBD. Please share the system journal from the source node of the migration around the time the error happened, there should be more details in there.
Should you get around to test with the newer QEMU again, please useapt install pve-qemu-kvm-dbgsym gdb systemd-coredump
(doesn't really hurt to have these packages installed in any case, but APT will not like it if the version for thedbgsym
package doesn't match the version of the main package, so you'd need to downgrade that too). And then the next time a crash happens, you can runcoredumpctl -1 gdb
and then in the GDB promptthread apply all backtrace
.
We use essential cookies to make this site work, and optional cookies to enhance your experience.