Proxmox VE 8.3: live migration problems

Thanks.

I tested but still have problems. But the error is now different:
Code:
2025-01-14 14:31:45 set migration parameters
2025-01-14 14:31:45 start migrate command to unix:/run/qemu-server/102.migrate
2025-01-14 14:31:47 migration active, transferred 107.2 MiB of 1.5 GiB VM-state, 161.4 MiB/s
2025-01-14 14:31:48 migration active, transferred 234.3 MiB of 1.5 GiB VM-state, 695.3 MiB/s
2025-01-14 14:31:49 migration active, transferred 329.5 MiB of 1.5 GiB VM-state, 878.6 MiB/s
2025-01-14 14:31:50 migration active, transferred 500.7 MiB of 1.5 GiB VM-state, 34.5 MiB/s
query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection

2025-01-14 14:32:04 query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 102 not running

2025-01-14 14:32:05 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:07 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:08 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:09 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:10 query migrate failed: VM 102 not running
2025-01-14 14:32:10 ERROR: online migrate failure - too many query migrate failures - aborting
2025-01-14 14:32:10 aborting phase 2 - cleanup resources
2025-01-14 14:32:10 migrate_cancel
2025-01-14 14:32:10 migrate_cancel error: VM 102 not running
2025-01-14 14:32:10 ERROR: query-status error: VM 102 not running
2025-01-14 14:32:13 ERROR: migration finished with problems (duration 00:00:34)

TASK ERROR: migration problems
 
Thanks.

I tested but still have problems. But the error is now different:
Code:
2025-01-14 14:31:45 set migration parameters
2025-01-14 14:31:45 start migrate command to unix:/run/qemu-server/102.migrate
2025-01-14 14:31:47 migration active, transferred 107.2 MiB of 1.5 GiB VM-state, 161.4 MiB/s
2025-01-14 14:31:48 migration active, transferred 234.3 MiB of 1.5 GiB VM-state, 695.3 MiB/s
2025-01-14 14:31:49 migration active, transferred 329.5 MiB of 1.5 GiB VM-state, 878.6 MiB/s
2025-01-14 14:31:50 migration active, transferred 500.7 MiB of 1.5 GiB VM-state, 34.5 MiB/s
query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection

2025-01-14 14:32:04 query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 102 not running

2025-01-14 14:32:05 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:07 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:08 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:09 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:10 query migrate failed: VM 102 not running
2025-01-14 14:32:10 ERROR: online migrate failure - too many query migrate failures - aborting
2025-01-14 14:32:10 aborting phase 2 - cleanup resources
2025-01-14 14:32:10 migrate_cancel
2025-01-14 14:32:10 migrate_cancel error: VM 102 not running
2025-01-14 14:32:10 ERROR: query-status error: VM 102 not running
2025-01-14 14:32:13 ERROR: migration finished with problems (duration 00:00:34)

TASK ERROR: migration problems
Please check the system logs and coredumps, it still looks like the VM is crashing.
 
Here are the system logs:

Code:
2025-01-14T14:31:51.920161+01:00 test-pve-01 kernel: [ 3214.087254] kvm[18733]: segfault at 41b8 ip 00005d1ea425fb00 sp 00007b07488fef38 error 4 in qemu-system-x86_64[5d1ea3d7c000+6a4000] likely on CPU 1 (core 1, socket 0)
2025-01-14T14:32:04.378628+01:00 test-pve-01 systemd-coredump[20239]: Process 18733 (kvm) of user 0 dumped core.#012#012Module libsystemd.so.0 from deb systemd-252.31-1~deb12u1.amd64#012Module libudev.so.1 from deb systemd-252.31-1~deb12u1.amd64#012Stack trace of thread 18733:#012#0  0x00005d1ea425fb00 bdrv_primary_child (qemu-system-x86_64 + 0x80eb00)#012#1  0x00005d1ea42886d6 bdrv_co_flush (qemu-system-x86_64 + 0x8376d6)#012#2  0x00005d1ea424a882 bdrv_co_flush_entry (qemu-system-x86_64 + 0x7f9882)#012#3  0x00005d1ea43cef7b coroutine_trampoline (qemu-system-x86_64 + 0x97df7b)#012#4  0x00007b074e0a69c0 n/a (libc.so.6 + 0x519c0)#012ELF object binary architecture: AMD x86-64
2025-01-15T13:49:23.118482+01:00 test-pve-01 kernel: [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2025-01-15T13:49:23.118483+01:00 test-pve-01 kernel: [    0.000001] kvm-clock: using sched offset of 16430787326 cycles
2025-01-15T13:49:23.118484+01:00 test-pve-01 kernel: [    0.000003] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2025-01-15T13:49:23.118656+01:00 test-pve-01 kernel: [    0.057410] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
2025-01-15T13:49:23.118657+01:00 test-pve-01 kernel: [    0.057428] kvm-guest: KVM setup pv remote TLB flush
2025-01-15T13:49:23.118658+01:00 test-pve-01 kernel: [    0.057432] kvm-guest: setup PV sched yield
2025-01-15T13:49:23.118715+01:00 test-pve-01 kernel: [    0.058259] kvm-guest: PV spinlocks enabled
2025-01-15T13:49:23.119040+01:00 test-pve-01 kernel: [    0.183520] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
2025-01-15T13:49:23.119085+01:00 test-pve-01 kernel: [    0.183527] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
2025-01-15T13:49:23.119095+01:00 test-pve-01 kernel: [    0.183530] kvm-guest: setup PV IPIs
2025-01-15T13:49:23.121082+01:00 test-pve-01 kernel: [    0.377328] clocksource: Switched to clocksource kvm-clock
2025-01-15T14:13:17.746968+01:00 test-pve-01 kernel: [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2025-01-15T14:13:17.746970+01:00 test-pve-01 kernel: [    0.000001] kvm-clock: using sched offset of 22372401189 cycles
2025-01-15T14:13:17.746972+01:00 test-pve-01 kernel: [    0.000004] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2025-01-15T14:13:17.747239+01:00 test-pve-01 kernel: [    0.058603] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
2025-01-15T14:13:17.747241+01:00 test-pve-01 kernel: [    0.058621] kvm-guest: KVM setup pv remote TLB flush
2025-01-15T14:13:17.747243+01:00 test-pve-01 kernel: [    0.058626] kvm-guest: setup PV sched yield
2025-01-15T14:13:17.747365+01:00 test-pve-01 kernel: [    0.059445] kvm-guest: PV spinlocks enabled
2025-01-15T14:13:17.747927+01:00 test-pve-01 kernel: [    0.185408] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
2025-01-15T14:13:17.747942+01:00 test-pve-01 kernel: [    0.185415] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
2025-01-15T14:13:17.747946+01:00 test-pve-01 kernel: [    0.185418] kvm-guest: setup PV IPIs
2025-01-15T14:13:17.753817+01:00 test-pve-01 kernel: [    0.527399] clocksource: Switched to clocksource kvm-clock
2025-01-15T14:23:58.077113+01:00 test-pve-01 QEMU[5351]: kvm: check_section_footer: Read section footer failed: -5
2025-01-15T14:23:58.085319+01:00 test-pve-01 QEMU[5351]: kvm: load of migration failed: Invalid argument
2025-01-15T14:44:08.717865+01:00 test-pve-01 QEMU[13133]: kvm: check_section_footer: Read section footer failed: -5
2025-01-15T14:44:08.718721+01:00 test-pve-01 QEMU[13133]: kvm: load of migration failed: Invalid argument
2025-01-15T14:48:20.594322+01:00 test-pve-01 systemd-coredump[15088]: Process 13660 (kvm) of user 0 dumped core.#012#012Module libsystemd.so.0 from deb systemd-252.33-1~deb12u1.amd64#012Module libudev.so.1 from deb systemd-252.33-1~deb12u1.amd64#012Stack trace of thread 13660:#012#0  0x0000593961935b00 bdrv_primary_child (qemu-system-x86_64 + 0x80eb00)#012#1  0x000059396195e6d6 bdrv_co_flush (qemu-system-x86_64 + 0x8376d6)#012#2  0x0000593961920882 bdrv_co_flush_entry (qemu-system-x86_64 + 0x7f9882)#012#3  0x0000593961aa4f7b coroutine_trampoline (qemu-system-x86_64 + 0x97df7b)#012#4  0x00007cb0fb6a69c0 n/a (libc.so.6 + 0x519c0)#012ELF object binary architecture: AMD x86-64

And I attached the associated gdb-log file.

Best regards.

Fabien
 

Attachments

Unfortunately, it still shows the same issue.
 
Could you please try again with the following script together with the new QEMU package and debug package:
Code:
handle SIGUSR1 noprint nostop
handle SIGPIPE noprint nostop
break blk_aio_flush
break bdrv_flush
break bdrv_co_flush
break blk_co_do_flush
break blk_aio_flush
break blk_remove_bs
break bdrv_graph_wrlock
break bdrv_graph_wrunlock
break bdrv_root_unref_child
break blk_drain
break bdrv_replace_child_noperm
commands 1-11
bt
c
end
c
Hopefully that finally gives a clear picture of what happens.
 
I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.
 
I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.
I tested the patched 9.02 QEMU package and it does seem to resolve my issue. I'm using Proxmox in a homelab so I am adventurous with the testing repo, and the 9.20 QEMU package might have been causing this behavior.
 
Hi,
I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.
can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?
 
Hi,

can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?
Hi Fiona,

I will dig back through my logs, but an example as logged to Graylog is below:

{
"process_id": "217525",
"gl2_accounted_message_size": 219,
"gl2_receive_timestamp": "2025-02-08 06:33:56.967",
"level": 4,
"gl2_remote_ip": "xxx",
"gl2_remote_port": 52320,
"streams": [
"000000000000000000000001"
],
"gl2_message_id": "01JKJ47KZ6001W51QWXE0XD9XA",
"source": "gilgamesh",
"message": "query migrate failed: VM 201 not running#012",
"gl2_source_input": "64e4f960c34f8025bd861d05",
"gl2_processing_timestamp": "2025-02-08 06:33:56.968",
"application_name": "pve-ha-lrm",
"facility_num": 3,
"gl2_source_node": "c202ff1e-feae-4a45-8aac-2d2168db5a54",
"_id": "ac6a9680-e5e6-11ef-ba36-aecf83b738dd",
"facility": "system daemon",
"gl2_processing_duration_ms": 1,
"timestamp": "2025-02-08T06:33:56.966Z"
}

The "query migrate failed" error repeats until the live migration fails. The errors stopped after reverting to the patched qemu 9.0.2 package linked above, if helpful, I could upgrade and attempt to recreate the error to give you more details. I'm fairly certain the error will reappear if I do that.
 
Last edited:
Hi,

can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?
Apologies, here is the VM config from the referenced VM 201 as well:

agent: 1,fstrim_cloned_disks=1
balloon: 1024
bios: ovmf
boot: order=scsi1;net0
cores: 4
cpu: max
efidisk0: local-zfs:vm-201-disk-0,efitype=4m,format=raw,size=528K
machine: pc-i440fx-9.0
memory: 4096
meta: creation-qemu=7.2.0,ctime=1685513687
name: home
net0: virtio=xxx,bridge=vmbr0
numa: 1
onboot: 1
ostype: l26
scsi1: local-zfs:vm-201-disk-1,discard=on,format=raw,iothread=1,size=50G
scsihw: virtio-scsi-single
smbios1: uuid=cb3ebf8c-24e7-40b9-909d-c590fcadc6f9
sockets: 1
vga: memory=32
vmgenid: f04d4443-8d95-4ebb-be1c-919676c6dd93
 
Okay, so this is ZFS not Ceph/RBD. Please share the system journal from the source node of the migration around the time the error happened, there should be more details in there.

Should you get around to test with the newer QEMU again, please use apt install pve-qemu-kvm-dbgsym gdb systemd-coredump (doesn't really hurt to have these packages installed in any case, but APT will not like it if the version for the dbgsym package doesn't match the version of the main package, so you'd need to downgrade that too). And then the next time a crash happens, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace.
 
Okay, so this is ZFS not Ceph/RBD. Please share the system journal from the source node of the migration around the time the error happened, there should be more details in there.

Should you get around to test with the newer QEMU again, please use apt install pve-qemu-kvm-dbgsym gdb systemd-coredump (doesn't really hurt to have these packages installed in any case, but APT will not like it if the version for the dbgsym package doesn't match the version of the main package, so you'd need to downgrade that too). And then the next time a crash happens, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace.
Yes, my apologies for confusing the original topic... when I have some time I can attempt to recreate the issue and provide logging as requested. The behavior is very similar to what the previous poster described, although I am indeed using replicated ZFS storage.