VMs hung after backup

fiona · Apr 22, 2024

Unfortunately my celebration was premature. It died in the same way again after working fine for four days. This is pretty strange since triggering the failure was very repeatable before the update (even on a freshly booted system), and was repeatably not reproducible immediately following it. This time, I managed to capture the following backtraces of the KVM processes using sysrq. Again, there are other processes that subsequently go into D state after the KVM ones lock up, and I can provide those if useful, but I assume the KVM ones are the most important.

Code:

[386896.658774] sysrq: Show Blocked State
[386896.658831] task:kvm             state:D stack:0     pid:1579  ppid:1      flags:0x00004002
[386896.658835] Call Trace:
[386896.658837]  <TASK>
[386896.658840]  __schedule+0x3fc/0x1440
[386896.658845]  ? blk_mq_flush_plug_list+0x19/0x30
[386896.658849]  schedule+0x63/0x110
[386896.658852]  wait_on_commit+0x95/0xd0 [nfs]
[386896.658887]  ? __pfx_var_wake_function+0x10/0x10
[386896.658891]  __nfs_commit_inode+0x13b/0x190 [nfs]
[386896.658917]  nfs_wb_folio+0xee/0x240 [nfs]
[386896.658942]  nfs_release_folio+0x95/0x150 [nfs]
[386896.658963]  filemap_release_folio+0x6e/0xb0
[386896.658966]  shrink_folio_list+0x59c/0xc30
[386896.658970]  evict_folios+0x2bd/0x660
[386896.658973]  try_to_shrink_lruvec+0x1c8/0x2d0
[386896.658976]  shrink_one+0x123/0x1e0
[386896.658979]  shrink_node+0x9bd/0xc10
[386896.658981]  do_try_to_free_pages+0xe1/0x5f0
[386896.658983]  ? sched_clock_noinstr+0x9/0x10
[386896.658986]  try_to_free_pages+0xe6/0x210
[386896.658988]  __alloc_pages+0x6d9/0x12f0
[386896.658992]  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[386896.658994]  alloc_pages+0x90/0x160
[386896.658997]  __get_free_pages+0x11/0x50
[386896.658999]  __pollwait+0x9e/0xe0
[386896.659002]  eventfd_poll+0x2f/0x70
[386896.659005]  do_sys_poll+0x2f7/0x610
[386896.659009]  ? __pfx___pollwait+0x10/0x10
[386896.659012]  ? __pfx_pollwake+0x10/0x10
[386896.659014]  ? __pfx_pollwake+0x10/0x10
[386896.659016]  ? __pfx_pollwake+0x10/0x10
[386896.659018]  ? __pfx_pollwake+0x10/0x10
[386896.659020]  ? __pfx_pollwake+0x10/0x10
[386896.659023]  ? __pfx_pollwake+0x10/0x10
[386896.659025]  ? __pfx_pollwake+0x10/0x10
[386896.659027]  ? __pfx_pollwake+0x10/0x10
[386896.659029]  ? __pfx_pollwake+0x10/0x10
[386896.659031]  ? _copy_from_user+0x2f/0x80
[386896.659033]  ? __pfx_read_tsc+0x10/0x10
[386896.659036]  __x64_sys_ppoll+0xde/0x170
[386896.659038]  do_syscall_64+0x5b/0x90
[386896.659040]  ? syscall_exit_to_user_mode+0x37/0x60
[386896.659042]  ? do_syscall_64+0x67/0x90
[386896.659044]  ? do_syscall_64+0x67/0x90
[386896.659045]  ? do_syscall_64+0x67/0x90
[386896.659046]  ? do_syscall_64+0x67/0x90
[386896.659047]  ? sysvec_call_function_single+0x4b/0xd0
[386896.659049]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[386896.659051] RIP: 0033:0x7f612318b256
[386896.659070] RSP: 002b:00007ffce8298640 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
[386896.659072] RAX: ffffffffffffffda RBX: 000056c7d90e36a0 RCX: 00007f612318b256
[386896.659073] RDX: 00007ffce8298660 RSI: 000000000000000c RDI: 000056c7da56ef70
[386896.659074] RBP: 00007ffce82986cc R08: 0000000000000008 R09: 0000000000000000
[386896.659075] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffce8298660
[386896.659076] R13: 000056c7d90e36a0 R14: 000056c7d6fd7c48 R15: 00007ffce82986d0
[386896.659078]  </TASK>
[386896.659082] task:kvm             state:D stack:0     pid:1820  ppid:1      flags:0x00024002
[386896.659084] Call Trace:
[386896.659085]  <TASK>
[386896.659086]  __schedule+0x3fc/0x1440
[386896.659088]  ? blk_mq_flush_plug_list+0x19/0x30
[386896.659090]  schedule+0x63/0x110
[386896.659091]  wait_on_commit+0x95/0xd0 [nfs]
[386896.659115]  ? __pfx_var_wake_function+0x10/0x10
[386896.659118]  __nfs_commit_inode+0x13b/0x190 [nfs]
[386896.659142]  nfs_wb_folio+0xee/0x240 [nfs]
[386896.659167]  nfs_release_folio+0x95/0x150 [nfs]
[386896.659187]  filemap_release_folio+0x6e/0xb0
[386896.659189]  shrink_folio_list+0x59c/0xc30
[386896.659193]  evict_folios+0x2bd/0x660
[386896.659196]  try_to_shrink_lruvec+0x1c8/0x2d0
[386896.659199]  shrink_one+0x123/0x1e0
[386896.659202]  shrink_node+0x9bd/0xc10
[386896.659204]  do_try_to_free_pages+0xe1/0x5f0
[386896.659206]  try_to_free_pages+0xe6/0x210
[386896.659209]  __alloc_pages+0x6d9/0x12f0
[386896.659211]  ? ktime_get+0x48/0xc0
[386896.659214]  alloc_pages+0x90/0x160
[386896.659216]  __get_free_pages+0x11/0x50
[386896.659219]  __pollwait+0x9e/0xe0
[386896.659221]  eventfd_poll+0x2f/0x70
[386896.659223]  do_sys_poll+0x2f7/0x610
[386896.659228]  ? __pfx___pollwait+0x10/0x10
[386896.659230]  ? __pfx_pollwake+0x10/0x10
[386896.659232]  ? __pfx_pollwake+0x10/0x10
[386896.659235]  ? __pfx_pollwake+0x10/0x10
[386896.659237]  ? __pfx_pollwake+0x10/0x10
[386896.659239]  ? __pfx_pollwake+0x10/0x10
[386896.659241]  ? __pfx_pollwake+0x10/0x10
[386896.659243]  ? __pfx_pollwake+0x10/0x10
[386896.659245]  ? __pfx_pollwake+0x10/0x10
[386896.659247]  ? __pfx_pollwake+0x10/0x10
[386896.659250]  __x64_sys_ppoll+0xde/0x170
[386896.659253]  do_syscall_64+0x5b/0x90
[386896.659254]  ? exit_to_user_mode_prepare+0x39/0x190
[386896.659257]  ? syscall_exit_to_user_mode+0x37/0x60
[386896.659258]  ? do_syscall_64+0x67/0x90
[386896.659260]  ? do_syscall_64+0x67/0x90
[386896.659261]  ? sysvec_reschedule_ipi+0x7a/0x120
[386896.659263]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[386896.659264] RIP: 0033:0x7f95cb98b256
[386896.659268] RSP: 002b:00007ffe6583cd40 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
[386896.659270] RAX: ffffffffffffffda RBX: 000057aa15ebc420 RCX: 00007f95cb98b256
[386896.659271] RDX: 00007ffe6583cd60 RSI: 0000000000000044 RDI: 000057aa16ee9c00
[386896.659271] RBP: 00007ffe6583cdcc R08: 0000000000000008 R09: 0000000000000000
[386896.659272] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffe6583cd60
[386896.659273] R13: 000057aa15ebc420 R14: 000057aa14514c48 R15: 00007ffe6583cdd0
[386896.659275]  </TASK>

@fiona I know you have been asking for some backtraces, so perhaps these might mean something to you.

I may have more time to mess around with it in a while. I have a few random ideas to check, mainly NFS volume free space (maybe 4 days of backups brought it down to some critical level?) and running a memory test on the server (it passed fine about a year or two ago).

The trace does indicate it's waiting in functions related to NFS, so yes, please check that the connection and the NFS server itself are working fine. If it's not some space or connection issue, there also is an opt-in 6.8 kernel you could try: https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.144557/

cmeWNQAm · Apr 22, 2024

fiona said:
The trace does indicate it's waiting in functions related to NFS, so yes, please check that the connection and the NFS server itself are working fine. If it's not some space or connection issue, there also is an opt-in 6.8 kernel you could try: https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.144557/

One thing I find strange is that these hanging KVM processes are that of the TrueNAS VM, which has nothing stored on NFS (since the NFS is hosted by this VM and doesn't exist yet at its boot time). If something is causing the TrueNAS VM to block waiting on its own NFS server, that would explain the deadlock. Then the question then becomes why are the TrueNAS KVM processes accessing NFS if they have nothing stored on it?

I suppose the next step would be trying to figure out what files those processes are accessing over NFS.

EDIT: Just triggered a lockup again (took copying 150GB of random junk this time), but wasn't able to see any new file opens or open files on /mnt/pve/TrueNAS (the NFS location)

Was watching
strace -t -e trace=open,openat,close,stat,fstat,lstat,connect,accept -p $(cat /var/run/qemu-server/102.pid) -ff -y

And running
lsof -a -p $(cat /var/run/qemu-server/102.pid)
every 10s

ed.m · May 22, 2024

Good morning @fiona .
I'm using proxmox VE 8.1.3 and I'm having the same problem with an AlmaLinux VM. When starting the VM backup we have the same problems reported here. To return it, just press "pause" followed by "resume", "stop", "start". On Ubuntu VMs it is not freezing.
We follow the steps described here, removing the "IO thread" from the hard drive. And for manual backups it worked correctly, but in the scheduled backup it keeps freezing.

I have the VMs on an SSD as Raid1 in hardware, copying the backup to an SSD without raid.

Is there already a definitive solution for this? Or how should I proceed?

ed.m · May 23, 2024

The error ocorrer on guest-agent, on start task backup:

INFO: issuing guest-agent 'fs-freeze' command
ERROR: VM 100 qmp command 'guest-fsfreeze-freeze' failed - got timeout
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 100 qmp command 'guest-fsfreeze-thaw' failed - got timeout

The soluction are unistall qemu-system-x86_64 on almalinux and uncheck guest-agent on proxmox, shutdown a VM and start again.

More information here:
https://gitlab.com/qemu-project/qemu/-/issues/520

fiona · May 27, 2024

Hi,

ed.m said:
Good morning @fiona .
I'm using proxmox VE 8.1.3 and I'm having the same problem with an AlmaLinux VM. When starting the VM backup we have the same problems reported here. To return it, just press "pause" followed by "resume", "stop", "start". On Ubuntu VMs it is not freezing.
We follow the steps described here, removing the "IO thread" from the hard drive. And for manual backups it worked correctly, but in the scheduled backup it keeps freezing.

I have the VMs on an SSD as Raid1 in hardware, copying the backup to an SSD without raid.

Is there already a definitive solution for this? Or how should I proceed?

please upgrade to the latest version of Proxmox VE 8 first (in particular the QEMU package should be recent enough: pve-qemu-kvm>=8.1.5-6). You'll need to shutdown+start the guetss to have them pick up the new binary (reboot inside the guest is not enough) or migrate to an upgraded node.

mnih · Jun 7, 2024

fiona said:
So the communication with the guest agent fails. It might be because it actually is not running in the guest anymore (or some other fundamental issue in the guest), or because there's an issue with the QEMU process on the host.

This hints it might be the QEMU process on the host.

Should it happen again, please check again that the guest agent is not detected as running. Then you can use

Code:

apt install pve-qemu-kvm-dbgsym gdb gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/107.pid)

with the correct VM ID instead of 107 to obtain a debugger backtrace. If we are lucky, there is something noticeable there.

After a somewhat long time it happened again and, here are the backtraces:

Code:

root@vas:~# pveversion
pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-3-pve)
root@vas:~# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/107.pid)
[New LWP 5363]
[New LWP 5505]
[New LWP 5506]
[New LWP 5507]
[New LWP 5508]
[New LWP 5509]
[New LWP 5512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000077d6843ba256 in __ppoll (fds=0x6531b4a1ac00, nfds=92, timeout=<optimized out>, timeout@entry=0x7ffd5f0882f0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
42	../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 8 (Thread 0x77d4710006c0 (LWP 5512) "vnc_worker"):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x6531b40037cc) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x6531b40037cc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x000077d684343efb in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x6531b40037cc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000077d684346558 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x6531b40037d8, cond=0x6531b40037a0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x6531b40037a0, mutex=mutex@entry=0x6531b40037d8) at ./nptl/pthread_cond_wait.c:618
#5  0x00006531b0eafdeb in qemu_cond_wait_impl (cond=0x6531b40037a0, mutex=0x6531b40037d8, file=0x6531b0f74cf4 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:225
#6  0x00006531b093bf2b in vnc_worker_thread_loop (queue=queue@entry=0x6531b40037a0) at ../ui/vnc-jobs.c:248
#7  0x00006531b093cbc8 in vnc_worker_thread (arg=arg@entry=0x6531b40037a0) at ../ui/vnc-jobs.c:362
#8  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b4003830) at ../util/qemu-thread-posix.c:541
#9  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 7 (Thread 0x77d679c006c0 (LWP 5509) "CPU 3/KVM"):
#0  __GI___ioctl (fd=32, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c50030, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c50030) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c50030) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3c592c0) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 6 (Thread 0x77d67a6006c0 (LWP 5508) "CPU 2/KVM"):
#0  __GI___ioctl (fd=30, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c46540, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c46540) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c46540) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3c4f660) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x77d67b0006c0 (LWP 5507) "CPU 1/KVM"):
#0  __GI___ioctl (fd=28, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c3cb50, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c3cb50) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c3cb50) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3c45b70) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x77d67be006c0 (LWP 5506) "CPU 0/KVM"):
#0  __GI___ioctl (fd=26, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c0cc10, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c0cc10) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c0cc10) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3894bc0) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x77d6815fe4c0 (LWP 5505) "vhost-5362"):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 2 (Thread 0x77d6812006c0 (LWP 5363) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00006531b0eb045a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x6531b1803c68 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x00006531b0eb9d62 in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:278
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b38999b0) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x77d6815fe4c0 (LWP 5362) "kvm"):
#0  0x000077d6843ba256 in __ppoll (fds=0x6531b4a1ac00, nfds=92, timeout=<optimized out>, timeout@entry=0x7ffd5f0882f0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x00006531b0ec555e in ppoll (__ss=0x0, __timeout=0x7ffd5f0882f0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:64
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1694141246) at ../util/qemu-timer.c:351
#3  0x00006531b0ec2e4e in os_host_main_loop_wait (timeout=1694141246) at ../util/main-loop.c:308
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:592
#5  0x00006531b0b1faa7 in qemu_main_loop () at ../softmmu/runstate.c:732
#6  0x00006531b0d1ff46 in qemu_default_main () at ../softmmu/main.c:37
#7  0x000077d6842e524a in __libc_start_call_main (main=main@entry=0x6531b0910480 <main>, argc=argc@entry=83, argv=argv@entry=0x7ffd5f088508) at ../sysdeps/nptl/libc_start_call_main.h:58
#8  0x000077d6842e5305 in __libc_start_main_impl (main=0x6531b0910480 <main>, argc=83, argv=0x7ffd5f088508, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd5f0884f8) at ../csu/libc-start.c:360
#9  0x00006531b09120a1 in _start ()
[Inferior 1 (process 5362) detached]

fiona · Jun 7, 2024

mnih said:

After a somewhat long time it happened again and, here are the backtraces:

Code:

root@vas:~# pveversion
pve-manager/8.2.2/9355359cd7afbae4 (running kernel: 6.8.4-3-pve)
root@vas:~# gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/107.pid)
[New LWP 5363]
[New LWP 5505]
[New LWP 5506]
[New LWP 5507]
[New LWP 5508]
[New LWP 5509]
[New LWP 5512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000077d6843ba256 in __ppoll (fds=0x6531b4a1ac00, nfds=92, timeout=<optimized out>, timeout@entry=0x7ffd5f0882f0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
42    ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 8 (Thread 0x77d4710006c0 (LWP 5512) "vnc_worker"):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x6531b40037cc) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x6531b40037cc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x000077d684343efb in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x6531b40037cc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000077d684346558 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x6531b40037d8, cond=0x6531b40037a0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x6531b40037a0, mutex=mutex@entry=0x6531b40037d8) at ./nptl/pthread_cond_wait.c:618
#5  0x00006531b0eafdeb in qemu_cond_wait_impl (cond=0x6531b40037a0, mutex=0x6531b40037d8, file=0x6531b0f74cf4 "../ui/vnc-jobs.c", line=248) at ../util/qemu-thread-posix.c:225
#6  0x00006531b093bf2b in vnc_worker_thread_loop (queue=queue@entry=0x6531b40037a0) at ../ui/vnc-jobs.c:248
#7  0x00006531b093cbc8 in vnc_worker_thread (arg=arg@entry=0x6531b40037a0) at ../ui/vnc-jobs.c:362
#8  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b4003830) at ../util/qemu-thread-posix.c:541
#9  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#10 0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 7 (Thread 0x77d679c006c0 (LWP 5509) "CPU 3/KVM"):
#0  __GI___ioctl (fd=32, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c50030, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c50030) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c50030) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3c592c0) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 6 (Thread 0x77d67a6006c0 (LWP 5508) "CPU 2/KVM"):
#0  __GI___ioctl (fd=30, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c46540, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c46540) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c46540) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3c4f660) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x77d67b0006c0 (LWP 5507) "CPU 1/KVM"):
#0  __GI___ioctl (fd=28, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c3cb50, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c3cb50) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c3cb50) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3c45b70) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x77d67be006c0 (LWP 5506) "CPU 0/KVM"):
#0  __GI___ioctl (fd=26, request=request@entry=44672) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00006531b0d156cf in kvm_vcpu_ioctl (cpu=cpu@entry=0x6531b3c0cc10, type=type@entry=44672) at ../accel/kvm/kvm-all.c:3179
#2  0x00006531b0d15ba5 in kvm_cpu_exec (cpu=cpu@entry=0x6531b3c0cc10) at ../accel/kvm/kvm-all.c:2991
#3  0x00006531b0d1708d in kvm_vcpu_thread_fn (arg=arg@entry=0x6531b3c0cc10) at ../accel/kvm/kvm-accel-ops.c:51
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b3894bc0) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x77d6815fe4c0 (LWP 5505) "vhost-5362"):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 2 (Thread 0x77d6812006c0 (LWP 5363) "call_rcu"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00006531b0eb045a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at ./include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x6531b1803c68 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x00006531b0eb9d62 in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:278
#4  0x00006531b0eaf2d8 in qemu_thread_start (args=0x6531b38999b0) at ../util/qemu-thread-posix.c:541
#5  0x000077d684347134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x000077d6843c77dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x77d6815fe4c0 (LWP 5362) "kvm"):
#0  0x000077d6843ba256 in __ppoll (fds=0x6531b4a1ac00, nfds=92, timeout=<optimized out>, timeout@entry=0x7ffd5f0882f0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x00006531b0ec555e in ppoll (__ss=0x0, __timeout=0x7ffd5f0882f0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/poll2.h:64
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1694141246) at ../util/qemu-timer.c:351
#3  0x00006531b0ec2e4e in os_host_main_loop_wait (timeout=1694141246) at ../util/main-loop.c:308
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:592
#5  0x00006531b0b1faa7 in qemu_main_loop () at ../softmmu/runstate.c:732
#6  0x00006531b0d1ff46 in qemu_default_main () at ../softmmu/main.c:37
#7  0x000077d6842e524a in __libc_start_call_main (main=main@entry=0x6531b0910480 <main>, argc=argc@entry=83, argv=argv@entry=0x7ffd5f088508) at ../sysdeps/nptl/libc_start_call_main.h:58
#8  0x000077d6842e5305 in __libc_start_main_impl (main=0x6531b0910480 <main>, argc=83, argv=0x7ffd5f088508, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd5f0884f8) at ../csu/libc-start.c:360
#9  0x00006531b09120a1 in _start ()
[Inferior 1 (process 5362) detached]

Unfortunately, there's nothing special in the backtrace. There is QEMU 9.0 on the pvetest repository now, which you might want to try. You need to shutdown+start the VM, migrate to an already upgraded node or use the Reboot in the UI (reboot within the guest is not enough) to have it pick up the new version.

Please share the current VM configuration qm config 107 and excerpt from the host system logs/journal around the time the issue happened. I assume there is still nothing interesting in the guest logs? What you could do is connect via SSH to the guest, run journalctl -f and monitor the guest like that. Maybe you'll be able to catch some more info like that.

tomgem · Jan 15, 2025

bbn said:
@mnih: same here. we started having the issues with SSD pools as well. For now I have removed iothreads from all VM's and disabled fs-freeze in qemu-agent. For now, all VM's have survived backup. It would be good if the proxmox team could confirm that this is the recommended workaround until this issue gets fixed. Should both iothreads and fs-freeze be disabled? Or is the issue with fs-freeze caused by the iothreads issue?
Also @fiona mentioned that qm suspend && qm resume should solve the issue, if not there is another issue. It does nothing for me, and it seems to also not help in many other cases, so are we dealing with a single issue or are there multiple issues?

I think there are multiple issues, most of them seem to be solved meanwhile?

For those still having PVE VM freezes after PBS backups, it might be caused by bind mounts inside the (Linux-) VM.
Disable "freeze-fs-on-backup" for QEMU guest agent might help as a workaround in this case.

Anyway, the reason I post this, there is hope this problem will be gone soon:
https://gitlab.com/qemu-project/qemu/-/commit/56ef123777b7a29f5b813696a69865d6033c3c78

Fingers crossed

maarvien · Jan 28, 2025

Have the same problem on a cPanel server with cloudlinux 9.5.
Everything up to date but vm hangs when try to do backup.

tomgem · Jan 28, 2025

maarvien said:
Have the same problem on a cPanel server with cloudlinux 9.5.
Everything up to date but vm hangs when try to do backup.

Did you disable "freeze-fs-on-backup" in the QEMU guest agent settings for this VM?
It will help with a cPanel server, cPanel makes heavy use of bind mounts.

Search

Search

VMs hung after backup

fiona

Proxmox Staff Member

cmeWNQAm

New Member

ed.m

New Member

ed.m

New Member

fiona

Proxmox Staff Member

mnih

Active Member

fiona

Proxmox Staff Member

tomgem

Member

maarvien

Member

tomgem

Member

We value your privacy