[SOLVED] (PVE 7.0-14+1) "VM ### not running" Live Migration kills VM (post-SSD move, pre-RAM move)

linux · Nov 15, 2021

Originally posted as a reply here: https://forum.proxmox.com/threads/migration-fails-pve-6-2-12-query-migrate-failed.78954/post-429694
"We're seeing this as well, though without the overflow during RAM transmission. More info further into this reply."

This problem was discovered following a cluster-of-4 upgrade wave from PVE 6.4 latest to 7.0 latest.

Upgrades went successfully however the problem has emerged since completion. Many VMs moved around before, during and after the migration without problems, however this issue that we can replicate is occurring between 2 of the 4 nodes in particular, in a single direction.

Weirdly, VMs will move between most nodes and in most directions, however in this instance the VM's SSD will move successfully and before it starts the RAM transfer the VM suddenly dies, which results in a failed live migration. To me, that smells of a kernel fault or something else similarly low-level, as the VM cannot continue.

(SSD moved successfully as part of live migration ahead of this log excerpt, with VM427 running fine as always.)

Code:

all 'mirror' jobs are ready
2021-11-12 11:20:25 starting online/live migration on unix:/run/qemu-server/427.migrate
2021-11-12 11:20:25 set migration capabilities
2021-11-12 11:20:25 migration speed limit: 600.0 MiB/s
2021-11-12 11:20:25 migration downtime limit: 100 ms
2021-11-12 11:20:25 migration cachesize: 4.0 GiB
2021-11-12 11:20:25 set migration parameters
2021-11-12 11:20:25 start migrate command to unix:/run/qemu-server/427.migrate
query migrate failed: VM 427 not running

2021-11-12 11:20:26 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:28 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:30 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:32 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:34 query migrate failed: VM 427 not running
query migrate failed: VM 427 not running

2021-11-12 11:20:36 query migrate failed: VM 427 not running
2021-11-12 11:20:36 ERROR: online migrate failure - too many query migrate failures - aborting
2021-11-12 11:20:36 aborting phase 2 - cleanup resources
2021-11-12 11:20:36 migrate_cancel
2021-11-12 11:20:36 migrate_cancel error: VM 427 not running
drive-scsi0: Cancelling block job
2021-11-12 11:20:36 ERROR: VM 427 not running
2021-11-12 11:20:59 ERROR: migration finished with problems

The VM was then powered down (crashed, etc) on the old host node. To clarify, this VM is ordinarily stable.

We were able to replicate the problem with another VM of different SSD/RAM sizes.

Code:

root@node1:~# qm config 427
balloon: 0
bootdisk: scsi0
cores: 5
memory: 24576
name: vm427
net0: e1000=*removed*,bridge=vmbr0,rate=200
numa: 0
ostype: l26
scsi0: local-lvm:vm-427-disk-0,backup=0,format=raw,size=500G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=3dcb69d6-671a-4b23-8ed4-6cfcfc85683d
sockets: 2
vmgenid: 5c7e1dbb-a9e4-4517-bb95-030748de1db1
root@node1:~#

pveversion -v as below, noting that node1 had an extra older kernel that node 3 did not. Removed and thus output is now identical, problem re-verified to exist following the removal of that extra kernel version (pve-kernel-5.4.34-1-pve) (which autoremove wasn't talking about... odd).

Code:

# pveversion -v
proxmox-ve: 7.0-2 (running kernel: 5.11.22-7-pve)
pve-manager: 7.0-14+1 (running version: 7.0-14+1/08975a4c)
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.143-1-pve: 5.4.143-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-12
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-13
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.13-1
proxmox-backup-file-restore: 2.0.13-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.1.0-1
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-18
smartmontools: 7.2-pve2
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

No VM has the QEMU guest agent installed or configured, so I'd say it's not relevant here.

As for the kernel versions on the VMs, they're all patched via KernelCare. We could boot into an alternative kernel instead. OS is always RHEL-based (CloudLinux, AlmaLinux, etc), nothing older than RHEL7 family. I wouldn't have thought that the VM would have any/much insight into the migration happening in the background?

We're hunting around here, and would appreciate any pointers around what could be causing the problem. Thank you!

fabian · Nov 15, 2021

The VM was then powered down (crashed, etc) on the old host node. To clarify, this VM is ordinarily stable.

do you mean that after the failed migration, the source VM was not running, or that you powered it down? if the former, could you please check the journal around the time of migration (journalctl --since ... --until ...) - if the kvm process crashed, it should be visible there. the same check on the target side would also be interesting, although from the log output it looks like the VM crashed on the source side.

linux · Nov 15, 2021

Thanks for writing back Fabian, the journal reveals a segfault within KVM that appears to knock a bridge offline?

Indeed it crashes on the source node during the migration - the VM never makes it across to the target node with this issue.

Code:

Nov 12 11:19:31 node1 corosync[1391]:   [KNET  ] link: host: 3 link: 0 is down
Nov 12 11:19:31 node1 corosync[1391]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 12 11:19:31 node1 corosync[1391]:   [KNET  ] host: host: 3 has no active links
Nov 12 11:19:33 node1 corosync[1391]:   [TOTEM ] Token has not been received in 3225 ms
Nov 12 11:19:34 node1 corosync[1391]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Nov 12 11:19:34 node1 corosync[1391]:   [KNET  ] rx: host: 3 link: 0 is up
Nov 12 11:19:34 node1 corosync[1391]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 12 11:19:35 node1 corosync[1391]:   [QUORUM] Sync members[4]: 1 2 3 4
Nov 12 11:19:35 node1 corosync[1391]:   [TOTEM ] A new membership (1.10d2) was formed. Members
Nov 12 11:19:35 node1 corosync[1391]:   [QUORUM] Members[4]: 1 2 3 4
Nov 12 11:19:35 node1 corosync[1391]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 12 11:20:25 node1 kernel: kvm[63338]: segfault at 68 ip 0000557ab37f9991 sp 00007fe0c43d0eb0 error 6 in qemu-system-x86_64[557ab33ae000+545000]
Nov 12 11:20:25 node1 kernel: Code: 49 89 c1 48 8b 47 38 4c 01 c0 48 01 f0 48 f7 f1 48 39 fd 74 d4 4d 39 e1 77 cf 48 83 e8 01 49 39 c6 77 c6 48 83 7f 68 00 75 bf <48> 89 7d 68 31 f6 48 83 c7 50 e8 a0 0d 0f 00 48 c7 45 68 00 00 00
Nov 12 11:20:26 node1 kernel: vmbr0: port 3(tap427i0) entered disabled state
Nov 12 11:20:26 node1 kernel: vmbr0: port 3(tap427i0) entered disabled state
Nov 12 11:20:26 node1 systemd[1]: 427.scope: Succeeded.
Nov 12 11:20:26 node1 systemd[1]: 427.scope: Consumed 1d 19h 15min 51.828s CPU time.
Nov 12 11:20:26 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:26 node1 pvedaemon[219582]: query migrate failed: VM 427 not running
Nov 12 11:20:27 node1 qmeventd[231153]: Starting cleanup for 427
Nov 12 11:20:27 node1 qmeventd[231153]: trying to acquire lock...
Nov 12 11:20:28 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:28 node1 pvedaemon[219582]: query migrate failed: VM 427 not running
Nov 12 11:20:30 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:30 node1 pvedaemon[219582]: query migrate failed: VM 427 not running
Nov 12 11:20:32 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:32 node1 pvedaemon[219582]: query migrate failed: VM 427 not running
Nov 12 11:20:34 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:34 node1 pvedaemon[219582]: query migrate failed: VM 427 not running
Nov 12 11:20:36 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:36 node1 pvedaemon[219582]: query migrate failed: VM 427 not running
Nov 12 11:20:36 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:36 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:36 node1 pvedaemon[219582]: VM 427 qmp command failed - VM 427 not running
Nov 12 11:20:37 node1 qmeventd[231153]: can't lock file '/var/lock/qemu-server/lock-427.conf' - got timeout
Nov 12 11:20:37 node1 pmxcfs[1281]: [status] notice: received log
Nov 12 11:20:37 node1 pmxcfs[1281]: [status] notice: received log
Nov 12 11:20:38 node1 pmxcfs[1281]: [status] notice: received log
Nov 12 11:20:52 node1 pmxcfs[1281]: [status] notice: received log
Nov 12 11:20:58 node1 pmxcfs[1281]: [status] notice: received log
Nov 12 11:20:59 node1 pvedaemon[219582]: migration problems

System memory is ECC and this only occurs at this specific moment during migration, when SSD is moved but RAM is not yet started to move.

Appreciate your help with this. Perhaps there is another way we can try to confirm a fault? It seems software-related though, the node stays up.

fabian · Nov 15, 2021

could you please follow the instructions over in this other thread with similar symptoms: https://forum.proxmox.com/threads/p...ring-migrate-to-other-host.99678/#post-430371 and post the resulting gdb backtraces (here or there)? thanks!

edit: the bridge log lines are just because the VM crashed, and the guest NICs are thus disabled/removed from the bridge. this is normal, but the crash obviously is not

linux · Nov 15, 2021

fabian said:
could you please follow the instructions over in this other thread with similar symptoms

~~Eek - just doing so has crashed the VM, only when executing the script with the right VM ID in-place.~~
(Not what happened - Fabian points out that I used ctrl+c at the wrong point and killed the VM)

It quickly paused to prompt about paging, replied 'c' and it then delivered this to gdb.txt:

Code:

Continuing.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Thread 1 "kvm" received signal SIGINT, Interrupt.
0x00007fdd0ca914f6 in __ppoll (fds=0x5635b2d50400, nfds=21, timeout=<optimized out>, timeout@entry=0x7fffe351d780, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44    in ../sysdeps/unix/sysv/linux/ppoll.c
Quit
Quit

That has left the VM dead via SSH/HTTP/etc, yet still reporting as running in PVE, despite showing 0% CPU usage.
So a different death to how it happens via migration failure - where it does update in PVE to show stopped status.

Now can't reset the VM via PVE, error as below. Ended up stopping then starting manually, but it had to SIGKILL it.

Code:

TASK ERROR: VM 427 qmp command 'system_reset' failed - unable to connect to VM 427 qmp socket - timeout after 31 retries

Code:

root@node1:~# qm stop 427
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL

Makes sense re: the network link flapping, the downtime was too brief to be anything significant.

Looking forward to hearing back from you shortly with many thanks here!

fabian · Nov 15, 2021

you can always kill the kvm process (the PID is in /var/run/qemu-server/${VMID}.pid) - could you post the full output of the shell where you ran gdb, including the commands? this definitely shouldn't happen (and I just re-verified the instructions - they work without crashing the VM on their own here

)

fabian · Nov 15, 2021

did you press "Ctrl+C" in the gdb shell? that would cause a SIGINT and drop you back into the gdb prompt (where quit should allow you to detach from the VM process and exit gdb WITHOUT stopping the VM).. you need to keep the gdb process running and attached to the VM process to collect the back traces.

linux · Nov 15, 2021

fabian said:
could you post the full output of the shell where you ran gdb, including the commands?

I've just got the code lines in a file, then executing with bash.

Just went again, and it starts up then prompts - didn't do ctrl+c this time, where you're right about quit dropping out (stopping gdb).

Code:

[New LWP 996813]
[New LWP 996814]
[New LWP 996815]
[New LWP 996816]
[New LWP 996817]
[New LWP 996822]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f5e4b04b4f6 in __ppoll (fds=0x5609391a4c00, nfds=22, timeout=<optimized out>, timeout@entry=0x7ffdbd008460, sigmask=sigmask@entry=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        Yes     Yes             Broken pipe
Copying output to gdb.txt.
Copying debug output to gdb.txt.
Continuing.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Might have been a bit more ahead of that - screen removes the ability for me to scroll back up the terminal output. Do you need more than that?

That seems to now be tracing the VM properly, and adding new lines as it does.
See attached though, the VM console seems to log faults too across many of the virtualised threads (which are across 2 sockets - Intel Xeon).

gdb.txt so far reports:

Code:

Continuing.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.
[New Thread 0x7f57f7c7f700 (LWP 1000166)]
[Thread 0x7f57f7c7f700 (LWP 1000166) exited]
[New Thread 0x7f57f7c7f700 (LWP 1000432)]
[Thread 0x7f57f7c7f700 (LWP 1000432) exited]

I've kicked off another live migration to the same node, will post again once it lands/dies/etc.

fabian · Nov 15, 2021

the SIGPIPE's are normal and can just be ignored - the interesting part is when it segfaults - then you need to run thread apply all bt in gdb, which will print what all threads are up to at the time of the crash. once we have that info, you can quit, and start the VM normally again until we've analyzed the traces

linux · Nov 15, 2021

Okay, so leave it as-is for now, and then once it's faulted and the VM is certifiably dead, we Ctrl-C back to gdb then run thread apply all bt?

From there it should deliver the trace contents into gdb.txt? Don't want to get any more key-strokes wrong.

fabian · Nov 15, 2021

no need to ctrl+c, once it segfaults you should have a 'gdb' prompt inside the screen session, there you enter the command (and press enter), it should print a lot of info (and collect it into gdb.txt), then with 'quit' you exit gdb, which allows the crashed process to exit as well, and you can just start the VM normally again (via the GUI/API/qm start ..).

linux · Nov 15, 2021

Done! Looking forward to your update.

Code:

# cat gdb.txt
Continuing.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Thread 1 "kvm" received signal SIGPIPE, Broken pipe.

Thread 1 "kvm" received signal SIGSEGV, Segmentation fault.
mirror_wait_on_conflicts (self=self@entry=0x0, s=s@entry=0x560938834400, offset=offset@entry=107894341632, bytes=bytes@entry=1) at ../block/mirror.c:174
174    ../block/mirror.c: No such file or directory.

Thread 13 (Thread 0x7f580bfff700 (LWP 996822) "kvm"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x56093921575c) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x560939215768, cond=0x560939215730) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x560939215730, mutex=mutex@entry=0x560939215768) at pthread_cond_wait.c:638
#3  0x000056093735419b in qemu_cond_wait_impl (cond=0x560939215730, mutex=0x560939215768, file=0x56093740ce38 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:194
#4  0x0000560936f1be83 in vnc_worker_thread_loop (queue=0x560939215730) at ../ui/vnc-jobs.c:248
#5  0x0000560936f1cb48 in vnc_worker_thread (arg=arg@entry=0x560939215730) at ../ui/vnc-jobs.c:361
#6  0x00005609373537a9 in qemu_thread_start (args=0x7f580bffa570) at ../util/qemu-thread-posix.c:541
#7  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 12 (Thread 0x7f5832dfd700 (LWP 996817) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x560938439d20, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x560938439d20) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x560938439d20) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5832df8570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 11 (Thread 0x7f58335fe700 (LWP 996816) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x56093842c1f0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x56093842c1f0) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x56093842c1f0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f58335f9570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 10 (Thread 0x7f5833dff700 (LWP 996815) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x56093841e960, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x56093841e960) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x56093841e960) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5833dfa570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f5e38ef8700 (LWP 996814) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5609384110d0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x5609384110d0) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x5609384110d0) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e38ef3570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f5e396f9700 (LWP 996813) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x560938403030, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x560938403030) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x560938403030) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e396f4570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f5e39efa700 (LWP 996812) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5609383f5830, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x5609383f5830) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x5609383f5830) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e39ef5570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f5e3a6fb700 (LWP 996811) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5609383e7e80, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x5609383e7e80) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x5609383e7e80) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e3a6f6570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f5e3aefc700 (LWP 996810) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5609383da640, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x5609383da640) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x5609383da640) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e3aef7570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f5e3b6fd700 (LWP 996809) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x5609383ccc90, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x5609383ccc90) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x5609383ccc90) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e3b6f8570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f5e3befe700 (LWP 996808) "kvm"):
#0  0x00007f5e4b04ccc7 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005609371df6c7 in kvm_vcpu_ioctl (cpu=cpu@entry=0x560938397080, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3017
#2  0x00005609371df831 in kvm_cpu_exec (cpu=cpu@entry=0x560938397080) at ../accel/kvm/kvm-all.c:2843
#3  0x00005609370d99dd in kvm_vcpu_thread_fn (arg=arg@entry=0x560938397080) at ../accel/kvm/kvm-accel-ops.c:49
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e3bef9570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f5e408a0700 (LWP 996778) "kvm"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000056093735498a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-6.1.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x5609378bd388 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:480
#3  0x000056093735a9fa in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:258
#4  0x00005609373537a9 in qemu_thread_start (args=0x7f5e4089b570) at ../util/qemu-thread-posix.c:541
#5  0x00007f5e4b125ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f5e4b055def in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f5e40a0a040 (LWP 996777) "kvm"):
#0  mirror_wait_on_conflicts (self=self@entry=0x0, s=s@entry=0x560938834400, offset=offset@entry=107894341632, bytes=bytes@entry=1) at ../block/mirror.c:174
#1  0x000056093727383a in mirror_iteration (s=0x560938834400) at ../block/mirror.c:493
#2  mirror_run (job=0x560938834400, errp=<optimized out>) at ../block/mirror.c:1049
#3  0x0000560937287052 in job_co_entry (opaque=0x560938834400) at ../job.c:942
#4  0x000056093735a1a3 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#5  0x00007f5e4afa8d40 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007ffdbd0078e0 in ?? ()
#7  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7f57f9fa2000
quit
Not confirmed.
Detaching from program: /usr/bin/qemu-system-x86_64, process 996777
[Inferior 1 (process 996777) detached]

FYI: Removed the normal/plentiful thread lines near the top.

fabian · Nov 16, 2021

thanks! that helped a lot, I'll send a patch shortly

fabian · Nov 16, 2021

found the issue, it's already fixed upstream, and I just sent a patch for including the fix in our next qemu package release (>= 6.1.0-2). once that hits the repos and you have upgraded, you'll need to cold-restart (poweroff, than start again) your VMs so that they run the updated, fixed binary.

migration should work again then. note that this only affects live-migration with local disks, so any VMs that don't use local disks should be unaffected (and consequently, don't need to be restarted either).

linux · Nov 16, 2021

fabian said:
migration should work again then. note that this only affects live-migration with local disks

Many thanks for the incoming patch - hoping to get this applied tonight (8pm here now) so we can get some re-shuffling done!

Appreciate your quick movements here. When was the last time you had a segfault in the no-sub repo? (Obviously it's part of the pros/cons

)

fabian · Nov 16, 2021

linux said:
Many thanks for the incoming patch - hoping to get this applied tonight (8pm here now) so we can get some re-shuffling done!

Appreciate your quick movements here. When was the last time you had a segfault in the no-sub repo? (Obviously it's part of the pros/cons )

I don't really keep track

but it does happen from time to time, most severe stuff is caught at review time or internal testing already

it's usually the hard-to-trigger / very racy / hardware/exotic-setup-specific stuff that only becomes visible at scale / with wide exposure.

linux · Nov 16, 2021

fabian said:
it's usually the hard-to-trigger / very racy / hardware/exotic-setup-specific stuff that only becomes visible at scale / with wide exposure.

Just the super-fun stuff then? Thanks for clarifying, and for the patch link over to git.

Took a quick peek at the repo, no -2 there yet - how long does your pipeline tend to take from the point of signed-off-patch?

fabian · Nov 16, 2021

it's available on pvetest now

linux · Nov 17, 2021

fabian said:
it's available on pvetest now

Thank you kindly. Will wait for it to hit no-sub unless that'll take a while.

As it's a cherry-picked patch only, nothing else seems to have changed in the package, would it be a quick transfer there?

linux · Nov 17, 2021

Is the below what you're referring to with "VMs need rebooting to use the new QEMU binary"?

Code:

root@node4:~# apt install ./pve-qemu-kvm_6.1.0-2_amd64.deb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'pve-qemu-kvm' instead of './pve-qemu-kvm_6.1.0-2_amd64.deb'
The following package was automatically installed and is no longer required:
  bsdmainutils
Use 'apt autoremove' to remove it.
The following packages will be upgraded:
  pve-qemu-kvm
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/27.5 MB of archives.
After this operation, 1,024 B of additional disk space will be used.
Get:1 /root/pve-qemu-kvm_6.1.0-2_amd64.deb pve-qemu-kvm amd64 6.1.0-2 [27.5 MB]
Reading changelogs... Done
(Reading database ... 64715 files and directories currently installed.)
Preparing to unpack .../pve-qemu-kvm_6.1.0-2_amd64.deb ...
Unpacking pve-qemu-kvm (6.1.0-2) over (6.1.0-1) ...
Setting up pve-qemu-kvm (6.1.0-2) ...
Processing triggers for mailcap (3.69) ...
Processing triggers for man-db (2.9.4-2) ...
N: Download is performed unsandboxed as root as file '/root/pve-qemu-kvm_6.1.0-2_amd64.deb' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied)
root@node4:~#

Looks like the package was upgraded, so will try some live migrations after powering off & on the VMs between a subset of nodes.

Code:

# pveversion -v | grep qemu
libproxmox-backup-qemu0: 1.2.0-1
pve-qemu-kvm: 6.1.0-2
qemu-server: 7.0-18

[SOLVED] (PVE 7.0-14+1) "VM ### not running" Live Migration kills VM (post-SSD move, pre-RAM move)

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Attachments

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Member