[SOLVED] VMs freeze with 100% CPU

@showiproute @VivienM please make sure you have the pve-qemu-kvm-dbgsym package installed. If you have, are you sure the VM was already running with the new QEMU binary? Otherwise, there shouldn't be all question marks in the debugger's trace output.

Questions for all:
  • Are the hangs related to certain actions inside the guest, e.g. reboot?
  • Do you see any messages about failing QMP commands in the system logs?
  • What can you see when accessing the VM's Console in the web UI?
  • Any interesting logs inside the guest (after you reset it or get it working by live-migration if that works for you)?
I will look to see if that package is installed, but I think it wasn't. Installed it now.

To answer your questions:
1) No. If anything, it's related to lack of actions, i.e. the guest not being interacted with. (Note that in my case, the problematic guests are two Windows VMs in a home lab, they might not be interacted with for days)
3) Console is completely dark.

I will look at logs later, but I don't think there's anything notable.

I should note something interesting - last night after gathering that data, I tried to hibernate the frozen VM. I think someone in this thread gave me that idea. Came back this morning, resumed it, and it seems to have come back to life...
 
Is this issue propagated beyond this forum ? Or we alone there ?

i am having the same issue on ovirt 4.4 / stream 8
(on centos 7 we never seen this issue)
guest os / uptime / tools / ksm / ballooning, makes no difference.
if anything i mostly seen ksm errors when vms have issues (other issues, not seen ksm errors with this)

so maybe this comes from all the new memory management code in the kernel that was patched in to fix meltdown/spectre.
 
  • Like
Reactions: VictorSTS
Hello,

I'm having the same issue too (I suspect at least), here is a bit of info on our setup and context.

We have an hybrid cluster with some hosts being Dell m640 (CPU Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz) and others Dell m620 (CPU Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz). We only run Debian GNU/Linux as guests, of varying versions (mostly 10 to 12). We use Virtio SCSI for IO, Virtio for the networking, and we have ballooning disabled.

For long we were using Proxmox 7.3 with a 5.15 kernel, and all was working fine, except that migrating guests started on a m640 to a m620 would make them freeze (a known FPU-related bug). So we upgraded to Proxmox 7.4 with a 6.2 kernel, and that bug was fixed.

But since then we have occasional freeze of VMs (stuck at 100% CPU, not responding either on network nor on console, nothing in logs). Crash seem to happen more often on guests that have a heavy load, but it's hard to say for sure. They happened on both the m640s and the m620s so it doesn't seem to be related to specific hardware.

I'll try to run the gdb and give more detailed info next time we have a crash, but they don't happen often (which is actually a good thing for me, since it's a production cluster, but it does make troubleshooting harder).

Regards,
 
  • Like
Reactions: irekpias
Hi @fiona,

Had another freezing VM just now. Here is the usual data:
root@xenon:~# strace -c -p $(cat /var/run/qemu-server/105.pid)
strace: Process 1629 attached
^Cstrace: Process 1629 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.71 20.595787 5937 3469 ppoll
0.86 0.180207 13 13382 write
0.24 0.050447 14 3444 read
0.12 0.024196 7 3273 recvmsg
0.06 0.013304 201 66 sendmsg
0.00 0.000018 1 15 close
0.00 0.000018 0 30 fcntl
0.00 0.000014 0 15 accept4
0.00 0.000004 0 15 getsockname
0.00 0.000001 0 4 ioctl
------ ----------- ----------- --------- --------- ----------------
100.00 20.863996 879 23713 total

root@xenon:~# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/105.pid)[New LWP 1630]
[New LWP 1674]
[New LWP 1675]
[New LWP 1676]
[New LWP 1679]
[New LWP 1778]
[New LWP 2729046]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fcd41cac0f6 in __ppoll (fds=0x559ed7664400, nfds=147, timeout=<optimized out>, timeout@entry=0x7ffd562e5bd0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
42 ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 8 (Thread 0x7fcd36dec280 (LWP 2729046) "iou-wrk-1629"):
#0 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 7 (Thread 0x7fcb0cbd96c0 (LWP 1778) "vnc_worker"):
#0 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x559ed6d67f98) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x559ed6d67f98, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2 0x00007fcd41c35d9b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x559ed6d67f98, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3 0x00007fcd41c383f8 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x559ed6d67fa8, cond=0x559ed6d67f70) at ./nptl/pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=cond@entry=0x559ed6d67f70, mutex=mutex@entry=0x559ed6d67fa8) at ./nptl/pthread_cond_wait.c:618
#5 0x0000559ed49686fb in qemu_cond_wait_impl (cond=0x559ed6d67f70, mutex=0x559ed6d67fa8, file=0x559ed49f5bb4 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:225
#6 0x0000559ed43cefdd in vnc_worker_thread_loop (queue=queue@entry=0x559ed6d67f70) at ../ui/vnc-jobs.c:248
#7 0x0000559ed43cfce8 in vnc_worker_thread (arg=arg@entry=0x559ed6d67f70) at ../ui/vnc-jobs.c:361
#8 0x0000559ed4967be8 in qemu_thread_start (args=0x559ed6755770) at ../util/qemu-thread-posix.c:541
#9 0x00007fcd41c38fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x00007fcd41cb95bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 6 (Thread 0x7fcb25bff6c0 (LWP 1679) "SPICE Worker"):
#0 0x00007fcd41cabfff in __GI___poll (fds=0x7fcb1c0014f0, nfds=2, timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fcd4353e9ae in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2 0x00007fcd4353ecef in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007fcd43cdafa9 in ?? () from /lib/x86_64-linux-gnu/libspice-server.so.1
#4 0x00007fcd41c38fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5 0x00007fcd41cb95bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7fcb27bff6c0 (LWP 1676) "CPU 2/KVM"):
#0 __GI___ioctl (fd=32, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x0000559ed47d455f in kvm_vcpu_ioctl (cpu=cpu@entry=0x559ed670f230, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2 0x0000559ed47d46b5 in kvm_cpu_exec (cpu=cpu@entry=0x559ed670f230) at ../accel/kvm/kvm-all.c:2939
#3 0x0000559ed47d5cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x559ed670f230) at ../accel/kvm/kvm-accel-ops.c:51
#4 0x0000559ed4967be8 in qemu_thread_start (args=0x559ed6717c80) at ../util/qemu-thread-posix.c:541
#5 0x00007fcd41c38fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6 0x00007fcd41cb95bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7fcd34bff6c0 (LWP 1675) "CPU 1/KVM"):
#0 __GI___ioctl (fd=31, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x0000559ed47d455f in kvm_vcpu_ioctl (cpu=cpu@entry=0x559ed6705cd0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2 0x0000559ed47d46b5 in kvm_cpu_exec (cpu=cpu@entry=0x559ed6705cd0) at ../accel/kvm/kvm-all.c:2939
#3 0x0000559ed47d5cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x559ed6705cd0) at ../accel/kvm/kvm-accel-ops.c:51
#4 0x0000559ed4967be8 in qemu_thread_start (args=0x559ed670e890) at ../util/qemu-thread-posix.c:541
#5 0x00007fcd41c38fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6 0x00007fcd41cb95bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7fcd361ba6c0 (LWP 1674) "CPU 0/KVM"):
#0 __GI___ioctl (fd=30, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1 0x0000559ed47d455f in kvm_vcpu_ioctl (cpu=cpu@entry=0x559ed66d51c0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2 0x0000559ed47d46b5 in kvm_cpu_exec (cpu=cpu@entry=0x559ed66d51c0) at ../accel/kvm/kvm-all.c:2939
#3 0x0000559ed47d5cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x559ed66d51c0) at ../accel/kvm/kvm-accel-ops.c:51
#4 0x0000559ed4967be8 in qemu_thread_start (args=0x559ed66dbe10) at ../util/qemu-thread-posix.c:541
#5 0x00007fcd41c38fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6 0x00007fcd41cb95bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7fcd36c7e6c0 (LWP 1630) "call_rcu"):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000559ed4968d6a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./include/qemu/futex.h:29
#2 qemu_event_wait (ev=ev@entry=0x559ed5278ce8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3 0x0000559ed49725c2 in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4 0x0000559ed4967be8 in qemu_thread_start (args=0x559ed630ecf0) at ../util/qemu-thread-posix.c:541
#5 0x00007fcd41c38fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6 0x00007fcd41cb95bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7fcd36dec280 (LWP 1629) "kvm"):
#0 0x00007fcd41cac0f6 in __ppoll (fds=0x559ed7664400, nfds=147, timeout=<optimized out>, timeout@entry=0x7ffd562e5bd0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1 0x0000559ed497dbee in ppoll (__ss=0x0, __timeout=0x7ffd562e5bd0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:64
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=2179731832) at ../util/qemu-timer.c:351
#3 0x0000559ed497b4ee in os_host_main_loop_wait (timeout=2179731832) at ../util/main-loop.c:308
#4 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:592
#5 0x0000559ed4597af7 in qemu_main_loop () at ../softmmu/runstate.c:731
#6 0x0000559ed47dea46 in qemu_default_main () at ../softmmu/main.c:37
#7 0x00007fcd41bd718a in __libc_start_call_main (main=main@entry=0x559ed43a4390 <main>, argc=argc@entry=98, argv=argv@entry=0x7ffd562e5de8) at ../sysdeps/nptl/libc_start_call_main.h:58
#8 0x00007fcd41bd7245 in __libc_start_main_impl (main=0x559ed43a4390 <main>, argc=98, argv=0x7ffd562e5de8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd562e5dd8) at ../csu/libc-start.c:381
#9 0x0000559ed43a5e71 in _start ()
[Inferior 1 (process 1629) detached]
root@xenon:~# qm config 105
agent: 1
audio0: device=ich9-intel-hda,driver=spice
balloon: 512
bios: ovmf
bootdisk: scsi0
cores: 3
cpu: kvm64
efidisk0: local-lvm:vm-105-disk-1,efitype=4m,pre-enrolled-keys=1,size=4M
ide0: none,media=cdrom
ide2: none,media=cdrom
machine: pc-q35-7.2
memory: 8192
name: win10
net0: virtio=9A:00:B8:64:D0:29,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsi0: local-lvm:vm-105-disk-0,discard=on,size=128G
scsihw: virtio-scsi-pci
smbios1: uuid=81e5d251-b177-4bc0-aba0-c0df15073315
sockets: 1
tpmstate0: local-lvm:vm-105-disk-2,size=4M,version=v2.0
vga: qxl
vmgenid: 098c9138-7808-48bb-9d4c-c7987143cabf
root@xenon:~# pveversion -v
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-5.15: 7.4-4
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.108-1-pve: 5.15.108-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.6
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.2
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.2
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.2
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
 
Also another hang up from an Ubuntu 22.04 VM:

Code:
^Cstrace: Process 26267 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   10.121025     2530256         4           ppoll
------ ----------- ----------- --------- --------- ----------------
100.00   10.121025     2530256         4           total


Code:
root@proxmox2:~# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/118.pid)
[New LWP 26268]
[New LWP 26269]
[New LWP 27241]
[New LWP 27243]
[New LWP 27244]
[New LWP 27245]
[New LWP 27295]
[New LWP 739092]
[New LWP 745138]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f519d3550f6 in __ppoll (fds=0x5618bb7e6990, nfds=76, timeout=<optimized out>, timeout@entry=0x7ffca1c06600, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
42      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 10 (Thread 0x7f5199a616c0 (LWP 745138) "iou-wrk-26269"):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 9 (Thread 0x7f5199a616c0 (LWP 739092) "iou-wrk-26269"):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 8 (Thread 0x7f507a7bf6c0 (LWP 27295) "vnc_worker"):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5618ba778dfc) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x5618ba778dfc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/                                         futex-internal.c:87
#2  0x00007f519d2ded9b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5618ba778dfc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at                                          ./nptl/futex-internal.c:139
#3  0x00007f519d2e13f8 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5618ba778e08, cond=0x5618ba778dd0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x5618ba778dd0, mutex=mutex@entry=0x5618ba778e08) at ./nptl/pthread_cond_wait.c:618
#5  0x00005618b80f36fb in qemu_cond_wait_impl (cond=0x5618ba778dd0, mutex=0x5618ba778e08, file=0x5618b8180bb4 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:225
#6  0x00005618b7b59fdd in vnc_worker_thread_loop (queue=queue@entry=0x5618ba778dd0) at ../ui/vnc-jobs.c:248
#7  0x00005618b7b5ace8 in vnc_worker_thread (arg=arg@entry=0x5618ba778dd0) at ../ui/vnc-jobs.c:361
#8  0x00005618b80f2be8 in qemu_thread_start (args=0x5618bb72d830) at ../util/qemu-thread-posix.c:541
#9  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 7 (Thread 0x7f508abfd6c0 (LWP 27245) "CPU 3/KVM"):
#0  __GI___ioctl (fd=33, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00005618b7f5f55f in kvm_vcpu_ioctl (cpu=cpu@entry=0x5618ba6be330, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2  0x00005618b7f5f6b5 in kvm_cpu_exec (cpu=cpu@entry=0x5618ba6be330) at ../accel/kvm/kvm-all.c:2939
#3  0x00005618b7f60cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x5618ba6be330) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00005618b80f2be8 in qemu_thread_start (args=0x5618ba6c6ec0) at ../util/qemu-thread-posix.c:541
#5  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 6 (Thread 0x7f508b3fe6c0 (LWP 27244) "CPU 2/KVM"):
#0  __GI___ioctl (fd=32, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00005618b7f5f55f in kvm_vcpu_ioctl (cpu=cpu@entry=0x5618ba6b4f10, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2  0x00005618b7f5f6b5 in kvm_cpu_exec (cpu=cpu@entry=0x5618ba6b4f10) at ../accel/kvm/kvm-all.c:2939
#3  0x00005618b7f60cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x5618ba6b4f10) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00005618b80f2be8 in qemu_thread_start (args=0x5618ba6bd960) at ../util/qemu-thread-posix.c:541
#5  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7f508bbff6c0 (LWP 27243) "CPU 1/KVM"):
#0  __GI___ioctl (fd=31, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00005618b7f5f55f in kvm_vcpu_ioctl (cpu=cpu@entry=0x5618ba6ab9d0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2  0x00005618b7f5f6b5 in kvm_cpu_exec (cpu=cpu@entry=0x5618ba6ab9d0) at ../accel/kvm/kvm-all.c:2939
#3  0x00005618b7f60cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x5618ba6ab9d0) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00005618b80f2be8 in qemu_thread_start (args=0x5618ba6b4540) at ../util/qemu-thread-posix.c:541
#5  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f51992606c0 (LWP 27241) "CPU 0/KVM"):
#0  __GI___ioctl (fd=30, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00005618b7f5f55f in kvm_vcpu_ioctl (cpu=cpu@entry=0x5618ba67be80, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3127
#2  0x00005618b7f5f6b5 in kvm_cpu_exec (cpu=cpu@entry=0x5618ba67be80) at ../accel/kvm/kvm-all.c:2939
#3  0x00005618b7f60cfd in kvm_vcpu_thread_fn (arg=arg@entry=0x5618ba67be80) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00005618b80f2be8 in qemu_thread_start (args=0x5618ba3032a0) at ../util/qemu-thread-posix.c:541
#5  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f5199a616c0 (LWP 26269) "kvm"):
#0  0x00007f519d3550f6 in __ppoll (fds=0x7f518c0030c0, nfds=8, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x00005618b8108c45 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:64
#2  0x00005618b80f0889 in fdmon_poll_wait (ctx=0x5618ba56de70, ready_list=0x7f5199a5c108, timeout=-1) at ../util/fdmon-poll.c:80
#3  0x00005618b80efd2d in aio_poll (ctx=0x5618ba56de70, blocking=blocking@entry=true) at ../util/aio-posix.c:680
#4  0x00005618b7fa2176 in iothread_run (opaque=opaque@entry=0x5618ba336b00) at ../iothread.c:63
#5  0x00005618b80f2be8 in qemu_thread_start (args=0x5618ba56e4c0) at ../util/qemu-thread-posix.c:541
#6  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f519a3636c0 (LWP 26268) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00005618b80f3d6a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x5618b8a03ce8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x00005618b80fd5c2 in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x00005618b80f2be8 in qemu_thread_start (args=0x5618ba306a50) at ../util/qemu-thread-posix.c:541
#5  0x00007f519d2e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f519d3625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f519a5c7400 (LWP 26267) "kvm"):
#0  0x00007f519d3550f6 in __ppoll (fds=0x5618bb7e6990, nfds=76, timeout=<optimized out>, timeout@entry=0x7ffca1c06600, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x00005618b8108bee in ppoll (__ss=0x0, __timeout=0x7ffca1c06600, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:64
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1079438303) at ../util/qemu-timer.c:351
#3  0x00005618b81064ee in os_host_main_loop_wait (timeout=1079438303) at ../util/main-loop.c:308
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:592
#5  0x00005618b7d22af7 in qemu_main_loop () at ../softmmu/runstate.c:731
#6  0x00005618b7f69a46 in qemu_default_main () at ../softmmu/main.c:37
#7  0x00007f519d28018a in __libc_start_call_main (main=main@entry=0x5618b7b2f390 <main>, argc=argc@entry=74, argv=argv@entry=0x7ffca1c06818) at ../sysdeps/nptl/libc_start_call_main.h:58
#8  0x00007f519d280245 in __libc_start_main_impl (main=0x5618b7b2f390 <main>, argc=74, argv=0x7ffca1c06818, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffca1c06808) at ../csu/lib                                         c-start.c:381
#9  0x00005618b7b30e71 in _start ()
[Inferior 1 (process 26267) detached]
 
- Hangs seem completely unrelated to guest actions: VM uptime ranges from a few hours to ~50 days, CPU and memory usage at hang time varies a lot from nothing to much, same with disk and network I/O. Happened with Windows OS from (10, 2019, 2022) and Linux OS (Ubuntu 18, 20, 22, Debian Buster and Bullseye). All have QEMU Agent configured and running. I've tried to reproduce the issue generating different workloads on guest VMs without success.

- No QMP messages on syslog. When the VM ins hung it does not reply to QMP commands (ping, reboot, stop).

- The webUI console of the VM shows a frozen screen with whatever the VM has at that moment, but its unresponsive and neither keyboard or mouse input works. The date/time shown by Windows welcome screen does not refresh, i.e. graphic output doesn't work either.

- Nothing relevant on any OS. From the guest perspective seems as if the time had paused and simply unpaused after the live migration has ended. Nothing inside the guest works while the VM is hung (tested with logger every second to syslog).

Adding to that:

- It has happened with at least 5 different clusters, some Intel, some AMD Epyc, of different generations.

- Seems to happen more often on memory constrained hosts, were KSM do have some amount of memory merged, though it has happened on hosts with lots of free memory too.


Any chance we could get Kernel 5.15 to PVE8 meanwhile this issue gets sorted out? For me PVE has been very stable for years until this issue arose. Having the option to use 5.15 on PVE8 would help us deploy the newer version without the risk of suffering these hangs.
that's exactly what we're seeing as well. The freezing guests can be tight on memory and we have tried with and without memory ballooning, all to no avail.
 
Also another hang up from an Ubuntu 22.04 VM:

Code:
^Cstrace: Process 26267 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   10.121025     2530256         4           ppoll
------ ----------- ----------- --------- --------- ----------------
100.00   10.121025     2530256         4           total
That looks rather different from the other reports and very strange. There's only 4 polling calls done in 10 seconds and no other syscalls. Was the server under heavy load at the time, any issues with network or storage? Can you share the VM configuration?
 
That looks rather different from the other reports and very strange. There's only 4 polling calls done in 10 seconds and no other syscalls. Was the server under heavy load at the time, any issues with network or storage? Can you share the VM configuration?
The VM config would be:
Code:
agent: 1
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 4
cpu: host
efidisk0: SSD_1TB:vm-118-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide2: none,media=cdrom
machine: q35
memory: 4096
meta: creation-qemu=7.1.0,ctime=1672437547
name: k8s-3
net0: virtio=76:A4:B3:0C:AD:51,bridge=vmbr0,firewall=1,tag=20
numa: 1
onboot: 1
ostype: l26
scsi0: SSD_1TB:vm-118-disk-1,discard=on,iothread=1,size=128G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=a8a97882-cb7f-4d6f-b05f-8ba1e1c380cc
sockets: 1
vmgenid: 38359f81-d253-4e10-9ee9-9409e443a7d9

In general I was not doing any heavy load on it. On 7 am and 7 pm I run a backup on PBS which sits on the same physical server.
 
At least I didn't find any upstream reports about this issue and no promising patches on the QEMU development mailing list. But more eyes certainly can't hurt. Neither me nor any of my coworkers ran into this bug yet, so it's hard for us to debug further. We did look at a customers machine, but in one case it was the PLT corruption bug, which likely is more low-level than QEMU and the other time, it was a different bug (QEMU process was not stuck in ppoll).
Would it help if you got access to one of the nodes when the freeze occurs?

If so, I'd be happy to upgrade that node to Basic support and open an official support request. I guess others would be happy to do the same, because this is really growing into a showstopper.
 
Would it help if you got access to one of the nodes when the freeze occurs?

If so, I'd be happy to upgrade that node to Basic support and open an official support request. I guess others would be happy to do the same, because this is really growing into a showstopper.
We can certainly take a closer look then and better see how the QEMU process behaves, but I can't give you any guarantees that we can actually identify the root cause (especially if it's the PLT corruption with the internal_fallocate64 calls, it might be very difficult to find).
 
I'd suggest getting outside help: kernel-devel/quemu-devel ... We've been going around in circles since May, no progress. If nothing changes, we'll flood this ticket with more strace and gdb pastes.

We use Proxmox in bussines, so this annoying bug hurts us ...
 
Last edited:
I'd suggest getting outside help: kernel-devel/quemu-devel ... We've been going around in circles since May, no progress. If nothing changes, we'll flood this ticket with more strace and gdb pastes.

We use Proxmox in bussines, so this annoying bug hurts us ...
I know and we'd like to fix these issues too, but if there is nothing concrete to work with, it's just not possible. If you'd like us to have a closer look at your hanging VMs, please open a support ticket. As I said, we can't guarantee finding the root cause, but it can't hurt to take a closer look either.

I did ask on the QEMU developer mailing list now:
https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02073.html

If we are lucky somebody has an idea, but it's hard to tell if there is no reproducer or commonality in the VM, software or hardware configurations that can be identified as the culprit.
 
We can certainly take a closer look then and better see how the QEMU process behaves, but I can't give you any guarantees that we can actually identify the root cause (especially if it's the PLT corruption with the internal_fallocate64 calls, it might be very difficult to find).
ok, then I will open an support ticket the next time one of our VMs freezes. Not sure how fast it happens, but it will, sooner or later ... I'll keep you posted.
 
I got a freeze on my setup, here are the traces from my side :

strace :
Code:
$  strace -c -p $(cat /var/run/qemu-server/511.pid)
strace: Process 10876 attached
^Cstrace: Process 10876 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.28    6.916085        3083      2243           ppoll
  2.70    0.194239          23      8416           write
  0.57    0.040779          19      2060           recvmsg
  0.34    0.024498          11      2164           read
  0.11    0.007892         493        16           fcntl
  0.00    0.000018           0        40           sendmsg
  0.00    0.000012           0        36           ioctl
  0.00    0.000008           1         8           close
  0.00    0.000004           0         8           getsockname
  0.00    0.000004           0         8           accept4
------ ----------- ----------- --------- --------- ----------------
100.00    7.183539         478     14999           total

gdb:
Code:
$  gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/511.pid)
[New LWP 10877]
[New LWP 10916]
[New LWP 10917]
[New LWP 10918]
[New LWP 10919]
[New LWP 10922]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f046527fa66 in __ppoll (fds=0x55a24a524440, nfds=83, timeout=<optimized out>, timeout@entry=0x7ffc5e169840, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 7 (Thread 0x7f00bbfff700 (LWP 10922) "vnc_worker"):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x55a24a54a64c) at ../sysdeps/nptl/futex-internal.h:186
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55a24a54a658, cond=0x55a24a54a620) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x55a24a54a620, mutex=mutex@entry=0x55a24a54a658) at pthread_cond_wait.c:638
#3  0x000055a248ff49cb in qemu_cond_wait_impl (cond=0x55a24a54a620, mutex=0x55a24a54a658, file=0x55a24906b434 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:220
#4  0x000055a248a835c3 in vnc_worker_thread_loop (queue=0x55a24a54a620) at ../ui/vnc-jobs.c:248
#5  0x000055a248a84288 in vnc_worker_thread (arg=arg@entry=0x55a24a54a620) at ../ui/vnc-jobs.c:361
#6  0x000055a248ff3e89 in qemu_thread_start (args=0x7f00bbffa570) at ../util/qemu-thread-posix.c:505
#7  0x00007f046536bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f046528ba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f00cbbff700 (LWP 10919) "CPU 3/KVM"):
#0  0x00007f0465281237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055a248e6c997 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55a249df6720, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3035
#2  0x000055a248e6cb01 in kvm_cpu_exec (cpu=cpu@entry=0x55a249df6720) at ../accel/kvm/kvm-all.c:2850
#3  0x000055a248e6e17d in kvm_vcpu_thread_fn (arg=arg@entry=0x55a249df6720) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x000055a248ff3e89 in qemu_thread_start (args=0x7f00cbbfa570) at ../util/qemu-thread-posix.c:505
#5  0x00007f046536bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f046528ba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f04589cd700 (LWP 10918) "CPU 2/KVM"):
#0  0x00007f0465281237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055a248e6c997 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55a249deea80, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3035
#2  0x000055a248e6cb01 in kvm_cpu_exec (cpu=cpu@entry=0x55a249deea80) at ../accel/kvm/kvm-all.c:2850
#3  0x000055a248e6e17d in kvm_vcpu_thread_fn (arg=arg@entry=0x55a249deea80) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x000055a248ff3e89 in qemu_thread_start (args=0x7f04589c8570) at ../util/qemu-thread-posix.c:505
#5  0x00007f046536bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f046528ba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f04591ce700 (LWP 10917) "CPU 1/KVM"):
#0  0x00007f0465281237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055a248e6c997 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55a249de6c70, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3035
#2  0x000055a248e6cb01 in kvm_cpu_exec (cpu=cpu@entry=0x55a249de6c70) at ../accel/kvm/kvm-all.c:2850
#3  0x000055a248e6e17d in kvm_vcpu_thread_fn (arg=arg@entry=0x55a249de6c70) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x000055a248ff3e89 in qemu_thread_start (args=0x7f04591c9570) at ../util/qemu-thread-posix.c:505
#5  0x00007f046536bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f046528ba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f0459c0f700 (LWP 10916) "CPU 0/KVM"):
#0  0x00007f0465281237 in ioctl () at ../sysdeps/unix/syscall-template.S:120
#1  0x000055a248e6c997 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55a249db70f0, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3035
#2  0x000055a248e6cb01 in kvm_cpu_exec (cpu=cpu@entry=0x55a249db70f0) at ../accel/kvm/kvm-all.c:2850
#3  0x000055a248e6e17d in kvm_vcpu_thread_fn (arg=arg@entry=0x55a249db70f0) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x000055a248ff3e89 in qemu_thread_start (args=0x7f0459c0a570) at ../util/qemu-thread-posix.c:505
#5  0x00007f046536bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f046528ba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f045a511700 (LWP 10877) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000055a248ff504a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /build/pve-qemu/pve-qemu-kvm-7.2.0/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x55a249856328 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:430
#3  0x000055a248ffd94a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000055a248ff3e89 in qemu_thread_start (args=0x7f045a50c570) at ../util/qemu-thread-posix.c:505
#5  0x00007f046536bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f046528ba2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f045a673040 (LWP 10876) "kvm"):
#0  0x00007f046527fa66 in __ppoll (fds=0x55a24a524440, nfds=83, timeout=<optimized out>, timeout@entry=0x7ffc5e169840, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000055a249008e11 in ppoll (__ss=0x0, __timeout=0x7ffc5e169840, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1379648640) at ../util/qemu-timer.c:351
#3  0x000055a249006675 in os_host_main_loop_wait (timeout=1379648640) at ../util/main-loop.c:315
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:606
#5  0x000055a248c23191 in qemu_main_loop () at ../softmmu/runstate.c:739
#6  0x000055a248a5caa7 in qemu_default_main () at ../softmmu/main.c:37
#7  0x00007f04651b3d0a in __libc_start_main (main=0x55a248a57c60 <main>, argc=81, argv=0x7ffc5e169a08, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc5e1699f8) at ../csu/libc-start.c:308
#8  0x000055a248a5c9da in _start ()
[Inferior 1 (process 10876) detached]


qm config
Code:
$  qm config 511
acpi: 1
agent: 1
args: -vnc 0.0.0.0:511
balloon: 0
boot: order=scsi0;net0
cores: 2
cpu: IvyBridge
hotplug: disk,network,usb
keyboard: fr
kvm: 1
memory: 14336
name: ut-kult-front2
net0: virtio=36:0C:08:77:5D:25,bridge=vmbr0,tag=3548
numa: 1
onboot: 1
ostype: l26
scsi0: iscivg:vm-511-disk-0,size=824633720
scsi1: iscivg:vm-511-disk-1,size=150G
scsihw: virtio-scsi-pci
smbios1: uuid=cf74aaf0-e459-48c1-9b12-76ba85fc8a67
sockets: 2
vmgenid: 23e307af-3fe0-4a35-ae98-6738e0324653

pveversions
Code:
$  pveversion -v
proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-5.13: 7.1-9
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.11.22-7-pve: 5.11.22-12
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-1
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.6
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: residual config
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
In my cluster with 3 pve 8, 6.2 kernel, vm 100% CPU spikes,vm packet loss and rare complete freeze, setting “mitigations=off” helped
 
  • Like
Reactions: coenvl
Hi, I recently reported a similar 100%CPU freeze problem here:

https://forum.proxmox.com/threads/p...pu-issue-with-windows-server-2019-vms.130727/

In our case we have the reproducible case that all our Windows2019 VMs are going wild with 100%CPU spikes including temporary but complete Proxmox console freezes when the 100%CPU spikes appear. In addition, in constant ICMP ping requests sent to the affected VMs we can see that upon startup the ping answer times at some point increase from < 1ms to up to 100 seconds until the VM recovers for some time. However, after some time (minutes) these VMs recover just until the issue happens again at some point. In addition, the probability of the issue also seems to be related to the amount of memory assigned to the affected VMs, as we could get quite smoothly running Windows2019 VMs when just assigning about 8GB or 32GB of RAM rather than 300-400GB as we usually do. As mentioned in the referenced issue, in our case it seems to be also limited to our Windows2019 VMs and we haven't noticed any issue with Linux or Windows10/Windows7 VMs so far.

After some investigation we were able to workaround the issue by pinning the affected PVE hosts in our cluster to the latest kernel 5.15 and in these cases the issue immediately disappeared and the VMs are running perfectly fine even under Proxmox8.

After having read through all the postings here this indeed looks like a kernel > 5.15 issue related to memory management (because lower assigned memory and CPU socket amount lowers the probability of the 100%CPU stalling). Now with 5.15. on the affected hosts (AMD EPYC and Intel Xeon CPUs) the issue immediately disappeared and the VMs are working nicely like they did with Proxmox7+kernel5.15. Therefore it would IMHO really be a good idea to supply also kernel 5.15 for Proxmox8 environment right from the debian repositories. Right now we could just pin to 5.15 because these hosts were previously running Proxmox7 and we haven't deleted the older kernels yet.

What we haven't tried yet is using the mitigations=off kernel commandline option like mentioned by @Whatever in the previous post. But what we can postulate to for is that this seems to be not related to Qemu v7 vs. v8 since a simple kernel 5.15 downgrade is enough to make sure the VMs are working smoothly again.

Furthermore, as the problem is perfectly reproducible in our case we are also highly willing to cooperate with the Proxmox developers in case they want to investigate the issue and could arrange remote access to the affected systems.
 
  • Like
Reactions: coenvl
Hi,
In my cluster with 3 pve 8, 6.2 kernel, vm 100% CPU spikes,vm packet loss and rare complete freeze, setting “mitigations=off” helped
thank you for sharing the workaround! Might be worth checking if/which new mitigations came in between 5.15 and 6.2.

In our case we have the reproducible case that all our Windows2019 VMs are going wild with 100%CPU spikes including temporary but complete Proxmox console freezes when the 100%CPU spikes appear. In addition, in constant ICMP ping requests sent to the affected VMs we can see that upon startup the ping answer times at some point increase from < 1ms to up to 100 seconds until the VM recovers for some time. However, after some time (minutes) these VMs recover just until the issue happens again at some point. In addition, the probability of the issue also seems to be related to the amount of memory assigned to the affected VMs, as we could get quite smoothly running Windows2019 VMs when just assigning about 8GB or 32GB of RAM rather than 300-400GB as we usually do. As mentioned in the referenced issue, in our case it seems to be also limited to our Windows2019 VMs and we haven't noticed any issue with Linux or Windows10/Windows7 VMs so far.
interesting, I'm not sure we even have any machines with that much RAM around to try and reproduce.

After having read through all the postings here this indeed looks like a kernel > 5.15 issue related to memory management (because lower assigned memory and CPU socket amount lowers the probability of the 100%CPU stalling). Now with 5.15. on the affected hosts (AMD EPYC and Intel Xeon CPUs) the issue immediately disappeared and the VMs are working nicely like they did with Proxmox7+kernel5.15. Therefore it would IMHO really be a good idea to supply also kernel 5.15 for Proxmox8 environment right from the debian repositories. Right now we could just pin to 5.15 because these hosts were previously running Proxmox7 and we haven't deleted the older kernels yet.
IIRC, some people also reported issues with 5.15, but as already mentioned, there are multiple different issues reported in this thread. It's great that you were able to clearly identify the kernel as the culprit for one of them!

What we haven't tried yet is using the mitigations=off kernel commandline option like mentioned by @Whatever in the previous post. But what we can postulate to for is that this seems to be not related to Qemu v7 vs. v8 since a simple kernel 5.15 downgrade is enough to make sure the VMs are working smoothly again.
Would be good to know.

Furthermore, as the problem is perfectly reproducible in our case we are also highly willing to cooperate with the Proxmox developers in case they want to investigate the issue and could arrange remote access to the affected systems.
Unfortunately, the difference between 5.15 and 6.2 is huge. In the worst case, we'd need to do a full kernel bisect, but knowing the first working and non-working versions would already help narrow it down. You'd need a dedicated host, as it involves a lot of rebooting to bisect the kernel version. If you are up to it, you can find Debian packages for upstream kernels here: https://kernel.ubuntu.com/~kernel-ppa/mainline/
You need to download and install (with dpkg -i <debs>) the packages
Code:
amd64/linux-image-unsigned-XYZ.deb
amd64/linux-modules-XYZ.deb
Bisecting means starting with 5.15 and 6.2 (just to make sure upstream kernels behave the same as ours) and then always pick the "interesting" middle one. So the first one is 5.19 and then depending on if it works or not, you can half the interval. The next one would then be either 6.0 or 5.17 respectively.

EDIT: I should note that these kernels do not have ZFS support, so you'd need to make sure your test setup doesn't require that.

EDIT2: You might also want to test one of the newest kernels, e.g. 6.4.3, to see if there has been a fix already. If that's the case, it might be better to bisect between 6.2 and 6.4.3 to find the fixing commit rather then the breaking commit.
 
Last edited:
  • Like
Reactions: coenvl
Hi, I recently reported a similar 100%CPU freeze problem here:

https://forum.proxmox.com/threads/p...pu-issue-with-windows-server-2019-vms.130727/

In our case we have the reproducible case that all our Windows2019 VMs are going wild with 100%CPU spikes including temporary but complete Proxmox console freezes when the 100%CPU spikes appear. In addition, in constant ICMP ping requests sent to the affected VMs we can see that upon startup the ping answer times at some point increase from < 1ms to up to 100 seconds until the VM recovers for some time. However, after some time (minutes) these VMs recover just until the issue happens again at some point. In addition, the probability of the issue also seems to be related to the amount of memory assigned to the affected VMs, as we could get quite smoothly running Windows2019 VMs when just assigning about 8GB or 32GB of RAM rather than 300-400GB as we usually do. As mentioned in the referenced issue, in our case it seems to be also limited to our Windows2019 VMs and we haven't noticed any issue with Linux or Windows10/Windows7 VMs so far.
This sounds like a different issue than my issue:
1) I've seen the issue with recent builds of Windows 11 insider and the latest Windows 10 release build.
2) My issue was present in Proxmox 7.x, not sure what the first kernel version to show it might have been, but it was at least 6 months ago. My vague recollection is that I switched to the newer kernels for 7.x precisely hoping to find a fix, so that likely means 5.15 had the problem too, but I just don't remember.
3) The VMs don't recover without manual intervention (either, based on what someone posted here, hibernate/resume, or hard power off). Unlike you, I'm just a home lab, not doing anything serious, so... potentially I might not notice for many hours that a VM crashed at 100% CPU. If they recovered after a few minutes I would probably never notice these CPU freezes.
 
This sounds like a different issue than my issue:
3) The VMs don't recover without manual intervention (either, based on what someone posted here, hibernate/resume, or hard power off). Unlike you, I'm just a home lab, not doing anything serious, so... potentially I might not notice for many hours that a VM crashed at 100% CPU. If they recovered after a few minutes I would probably never notice these CPU freezes.

My setup is a production setup, but I confirm that affected VM will never recover without manual hard power off and restart (I didn't try hibernate/resume).

Also, as a side-note, since it seems the crash happens more often under heavy load, I tried to run stress-testing programs on a test VM, hoping I could trigger the bug (making it easier to do further investigations) but despite completely overloading the VM with both CPU and IO for 24 hours it didn't crash.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!