Proxmox VE 8.3: live migration problems

fiona · Jan 8, 2025

While I still wasn't able to reproduce the issue myself, I think I found the reason for the crash now: https://lore.kernel.org/qemu-devel/20250108124649.333668-1-f.ebner@proxmox.com/T/#u

GPLExpert · Jan 13, 2025

Because I am able to reproduce, please let me know if I can help doing some tests.

fiona · Jan 14, 2025

There is a patched QEMU package now at: http://download.proxmox.com/temp/pve-qemu-kvm-9.0.2-4+fixflush/
Would be great if you could check whether the issue is resolved there. You need to start a VM after installing the new package to have it actually use the new binary version.

GPLExpert · Jan 14, 2025

Thanks.

I tested but still have problems. But the error is now different:

Code:

2025-01-14 14:31:45 set migration parameters
2025-01-14 14:31:45 start migrate command to unix:/run/qemu-server/102.migrate
2025-01-14 14:31:47 migration active, transferred 107.2 MiB of 1.5 GiB VM-state, 161.4 MiB/s
2025-01-14 14:31:48 migration active, transferred 234.3 MiB of 1.5 GiB VM-state, 695.3 MiB/s
2025-01-14 14:31:49 migration active, transferred 329.5 MiB of 1.5 GiB VM-state, 878.6 MiB/s
2025-01-14 14:31:50 migration active, transferred 500.7 MiB of 1.5 GiB VM-state, 34.5 MiB/s
query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection

2025-01-14 14:32:04 query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 102 not running

2025-01-14 14:32:05 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:07 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:08 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:09 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:10 query migrate failed: VM 102 not running
2025-01-14 14:32:10 ERROR: online migrate failure - too many query migrate failures - aborting
2025-01-14 14:32:10 aborting phase 2 - cleanup resources
2025-01-14 14:32:10 migrate_cancel
2025-01-14 14:32:10 migrate_cancel error: VM 102 not running
2025-01-14 14:32:10 ERROR: query-status error: VM 102 not running
2025-01-14 14:32:13 ERROR: migration finished with problems (duration 00:00:34)

TASK ERROR: migration problems

fiona · Jan 15, 2025

GPLExpert said:

Thanks.

I tested but still have problems. But the error is now different:

Code:

2025-01-14 14:31:45 set migration parameters
2025-01-14 14:31:45 start migrate command to unix:/run/qemu-server/102.migrate
2025-01-14 14:31:47 migration active, transferred 107.2 MiB of 1.5 GiB VM-state, 161.4 MiB/s
2025-01-14 14:31:48 migration active, transferred 234.3 MiB of 1.5 GiB VM-state, 695.3 MiB/s
2025-01-14 14:31:49 migration active, transferred 329.5 MiB of 1.5 GiB VM-state, 878.6 MiB/s
2025-01-14 14:31:50 migration active, transferred 500.7 MiB of 1.5 GiB VM-state, 34.5 MiB/s
query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection

2025-01-14 14:32:04 query migrate failed: VM 102 qmp command 'query-migrate' failed - client closed connection
query migrate failed: VM 102 not running

2025-01-14 14:32:05 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:07 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:08 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:09 query migrate failed: VM 102 not running
query migrate failed: VM 102 not running

2025-01-14 14:32:10 query migrate failed: VM 102 not running
2025-01-14 14:32:10 ERROR: online migrate failure - too many query migrate failures - aborting
2025-01-14 14:32:10 aborting phase 2 - cleanup resources
2025-01-14 14:32:10 migrate_cancel
2025-01-14 14:32:10 migrate_cancel error: VM 102 not running
2025-01-14 14:32:10 ERROR: query-status error: VM 102 not running
2025-01-14 14:32:13 ERROR: migration finished with problems (duration 00:00:34)

TASK ERROR: migration problems

Please check the system logs and coredumps, it still looks like the VM is crashing.

GPLExpert · Jan 15, 2025

Here are the system logs:

Code:

2025-01-14T14:31:51.920161+01:00 test-pve-01 kernel: [ 3214.087254] kvm[18733]: segfault at 41b8 ip 00005d1ea425fb00 sp 00007b07488fef38 error 4 in qemu-system-x86_64[5d1ea3d7c000+6a4000] likely on CPU 1 (core 1, socket 0)
2025-01-14T14:32:04.378628+01:00 test-pve-01 systemd-coredump[20239]: Process 18733 (kvm) of user 0 dumped core.#012#012Module libsystemd.so.0 from deb systemd-252.31-1~deb12u1.amd64#012Module libudev.so.1 from deb systemd-252.31-1~deb12u1.amd64#012Stack trace of thread 18733:#012#0  0x00005d1ea425fb00 bdrv_primary_child (qemu-system-x86_64 + 0x80eb00)#012#1  0x00005d1ea42886d6 bdrv_co_flush (qemu-system-x86_64 + 0x8376d6)#012#2  0x00005d1ea424a882 bdrv_co_flush_entry (qemu-system-x86_64 + 0x7f9882)#012#3  0x00005d1ea43cef7b coroutine_trampoline (qemu-system-x86_64 + 0x97df7b)#012#4  0x00007b074e0a69c0 n/a (libc.so.6 + 0x519c0)#012ELF object binary architecture: AMD x86-64
2025-01-15T13:49:23.118482+01:00 test-pve-01 kernel: [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2025-01-15T13:49:23.118483+01:00 test-pve-01 kernel: [    0.000001] kvm-clock: using sched offset of 16430787326 cycles
2025-01-15T13:49:23.118484+01:00 test-pve-01 kernel: [    0.000003] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2025-01-15T13:49:23.118656+01:00 test-pve-01 kernel: [    0.057410] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
2025-01-15T13:49:23.118657+01:00 test-pve-01 kernel: [    0.057428] kvm-guest: KVM setup pv remote TLB flush
2025-01-15T13:49:23.118658+01:00 test-pve-01 kernel: [    0.057432] kvm-guest: setup PV sched yield
2025-01-15T13:49:23.118715+01:00 test-pve-01 kernel: [    0.058259] kvm-guest: PV spinlocks enabled
2025-01-15T13:49:23.119040+01:00 test-pve-01 kernel: [    0.183520] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
2025-01-15T13:49:23.119085+01:00 test-pve-01 kernel: [    0.183527] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
2025-01-15T13:49:23.119095+01:00 test-pve-01 kernel: [    0.183530] kvm-guest: setup PV IPIs
2025-01-15T13:49:23.121082+01:00 test-pve-01 kernel: [    0.377328] clocksource: Switched to clocksource kvm-clock
2025-01-15T14:13:17.746968+01:00 test-pve-01 kernel: [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2025-01-15T14:13:17.746970+01:00 test-pve-01 kernel: [    0.000001] kvm-clock: using sched offset of 22372401189 cycles
2025-01-15T14:13:17.746972+01:00 test-pve-01 kernel: [    0.000004] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2025-01-15T14:13:17.747239+01:00 test-pve-01 kernel: [    0.058603] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
2025-01-15T14:13:17.747241+01:00 test-pve-01 kernel: [    0.058621] kvm-guest: KVM setup pv remote TLB flush
2025-01-15T14:13:17.747243+01:00 test-pve-01 kernel: [    0.058626] kvm-guest: setup PV sched yield
2025-01-15T14:13:17.747365+01:00 test-pve-01 kernel: [    0.059445] kvm-guest: PV spinlocks enabled
2025-01-15T14:13:17.747927+01:00 test-pve-01 kernel: [    0.185408] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
2025-01-15T14:13:17.747942+01:00 test-pve-01 kernel: [    0.185415] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
2025-01-15T14:13:17.747946+01:00 test-pve-01 kernel: [    0.185418] kvm-guest: setup PV IPIs
2025-01-15T14:13:17.753817+01:00 test-pve-01 kernel: [    0.527399] clocksource: Switched to clocksource kvm-clock
2025-01-15T14:23:58.077113+01:00 test-pve-01 QEMU[5351]: kvm: check_section_footer: Read section footer failed: -5
2025-01-15T14:23:58.085319+01:00 test-pve-01 QEMU[5351]: kvm: load of migration failed: Invalid argument
2025-01-15T14:44:08.717865+01:00 test-pve-01 QEMU[13133]: kvm: check_section_footer: Read section footer failed: -5
2025-01-15T14:44:08.718721+01:00 test-pve-01 QEMU[13133]: kvm: load of migration failed: Invalid argument
2025-01-15T14:48:20.594322+01:00 test-pve-01 systemd-coredump[15088]: Process 13660 (kvm) of user 0 dumped core.#012#012Module libsystemd.so.0 from deb systemd-252.33-1~deb12u1.amd64#012Module libudev.so.1 from deb systemd-252.33-1~deb12u1.amd64#012Stack trace of thread 13660:#012#0  0x0000593961935b00 bdrv_primary_child (qemu-system-x86_64 + 0x80eb00)#012#1  0x000059396195e6d6 bdrv_co_flush (qemu-system-x86_64 + 0x8376d6)#012#2  0x0000593961920882 bdrv_co_flush_entry (qemu-system-x86_64 + 0x7f9882)#012#3  0x0000593961aa4f7b coroutine_trampoline (qemu-system-x86_64 + 0x97df7b)#012#4  0x00007cb0fb6a69c0 n/a (libc.so.6 + 0x519c0)#012ELF object binary architecture: AMD x86-64

And I attached the associated gdb-log file.

Best regards.

Fabien

fiona · Jan 15, 2025

Unfortunately, it still shows the same issue.

fiona · Jan 16, 2025

Could you please try again with the following script together with the new QEMU package and debug package:

Code:

handle SIGUSR1 noprint nostop
handle SIGPIPE noprint nostop
break blk_aio_flush
break bdrv_flush
break bdrv_co_flush
break blk_co_do_flush
break blk_aio_flush
break blk_remove_bs
break bdrv_graph_wrlock
break bdrv_graph_wrunlock
break bdrv_root_unref_child
break blk_drain
break bdrv_replace_child_noperm
commands 1-11
bt
c
end
c

Hopefully that finally gives a clear picture of what happens.

GPLExpert · Jan 20, 2025

Hi

I am too busy to tests it this week. I’ll back to you as soon as I get the result.

Best regards.

Fabien

jason.houston · Feb 8, 2025

I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.

jason.houston · Feb 8, 2025

jason.houston said:
I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.

I tested the patched 9.02 QEMU package and it does seem to resolve my issue. I'm using Proxmox in a homelab so I am adventurous with the testing repo, and the 9.20 QEMU package might have been causing this behavior.

fiona · Feb 10, 2025

Hi,

jason.houston said:
I'm having a similar issue on a three-node cluster running the 6.11 kernel with replicated ZFS storage. I'm just starting to dig in on troubleshooting and will look through some logging suggestions above. The problem just appeared in the last few days so it's a bit odd.

can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?

jason.houston · Feb 10, 2025

fiona said:
Hi,

can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?

Hi Fiona,

I will dig back through my logs, but an example as logged to Graylog is below:

{
"process_id": "217525",
"gl2_accounted_message_size": 219,
"gl2_receive_timestamp": "2025-02-08 06:33:56.967",
"level": 4,
"gl2_remote_ip": "xxx",
"gl2_remote_port": 52320,
"streams": [
"000000000000000000000001"
],
"gl2_message_id": "01JKJ47KZ6001W51QWXE0XD9XA",
"source": "gilgamesh",
"message": "query migrate failed: VM 201 not running#012",
"gl2_source_input": "64e4f960c34f8025bd861d05",
"gl2_processing_timestamp": "2025-02-08 06:33:56.968",
"application_name": "pve-ha-lrm",
"facility_num": 3,
"gl2_source_node": "c202ff1e-feae-4a45-8aac-2d2168db5a54",
"_id": "ac6a9680-e5e6-11ef-ba36-aecf83b738dd",
"facility": "system daemon",
"gl2_processing_duration_ms": 1,
"timestamp": "2025-02-08T06:33:56.966Z"
}

The "query migrate failed" error repeats until the live migration fails. The errors stopped after reverting to the patched qemu 9.0.2 package linked above, if helpful, I could upgrade and attempt to recreate the error to give you more details. I'm fairly certain the error will reappear if I do that.

jason.houston · Feb 10, 2025

fiona said:
Hi,

can you please post the exact errors you got (excerpt from the system logs/journal) as well as the VM configuration for an affected VM? Was the failure also happening during live migration?

Apologies, here is the VM config from the referenced VM 201 as well:

agent: 1,fstrim_cloned_disks=1
balloon: 1024
bios: ovmf
boot: order=scsi1;net0
cores: 4
cpu: max
efidisk0: local-zfs:vm-201-disk-0,efitype=4m,format=raw,size=528K
machine: pc-i440fx-9.0
memory: 4096
meta: creation-qemu=7.2.0,ctime=1685513687
name: home
net0: virtio=xxx,bridge=vmbr0
numa: 1
onboot: 1
ostype: l26
scsi1: local-zfs:vm-201-disk-1,discard=on,format=raw,iothread=1,size=50G
scsihw: virtio-scsi-single
smbios1: uuid=cb3ebf8c-24e7-40b9-909d-c590fcadc6f9
sockets: 1
vga: memory=32
vmgenid: f04d4443-8d95-4ebb-be1c-919676c6dd93

fiona · Feb 11, 2025

Okay, so this is ZFS not Ceph/RBD. Please share the system journal from the source node of the migration around the time the error happened, there should be more details in there.

Should you get around to test with the newer QEMU again, please use apt install pve-qemu-kvm-dbgsym gdb systemd-coredump (doesn't really hurt to have these packages installed in any case, but APT will not like it if the version for the dbgsym package doesn't match the version of the main package, so you'd need to downgrade that too). And then the next time a crash happens, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace.

jason.houston · Feb 12, 2025

fiona said:
Okay, so this is ZFS not Ceph/RBD. Please share the system journal from the source node of the migration around the time the error happened, there should be more details in there.

Should you get around to test with the newer QEMU again, please use apt install pve-qemu-kvm-dbgsym gdb systemd-coredump (doesn't really hurt to have these packages installed in any case, but APT will not like it if the version for the dbgsym package doesn't match the version of the main package, so you'd need to downgrade that too). And then the next time a crash happens, you can run coredumpctl -1 gdb and then in the GDB prompt thread apply all backtrace.

Yes, my apologies for confusing the original topic... when I have some time I can attempt to recreate the issue and provide logging as requested. The behavior is very similar to what the previous poster described, although I am indeed using replicated ZFS storage.

jason.houston · Mar 2, 2025

Quick update here, I had some time to do further testing this week and noted that the QEMU package had been updated to version 9.2.0-2 in the testing repo. I updated to the new package and did some live migration testing between all three of my nodes, and the issue appears to be resolved in this version.

I did not do any further log captures as requested above since I wanted to test the new version -- I had a feeling that this bug might be related to QEMU version, and it seems that is confirmed. Thank you for the response here and I will advise if I run into any further issues.

Search

Search

Proxmox VE 8.3: live migration problems

fiona

Proxmox Staff Member

GPLExpert

Renowned Member

fiona

Proxmox Staff Member

GPLExpert

Renowned Member

fiona

Proxmox Staff Member

GPLExpert

Renowned Member

Attachments

fiona

Proxmox Staff Member

fiona

Proxmox Staff Member

GPLExpert

Renowned Member

jason.houston

Member

jason.houston

Member

fiona

Proxmox Staff Member

jason.houston

Member

jason.houston

Member

fiona

Proxmox Staff Member

jason.houston

Member

jason.houston

Member

We value your privacy