Hello, i use pve-qemu-kvm 9.2.0-1 to verify whether the fleecing function works properly under high IO load.
I set up a PBS and an NFS server (PBS use NFS server) on Host 1; created a VM 100101004 on Host 2, and launched a stress-ng container inside the VM for high-pressure testing; meanwhile, on Host 1, I continuously copied a 2GB ISO file to VM 100101004 via VM2 in a loop (overwriting the file each time). Then, I initiated the PBS backup with the following command:
- 1st backup: 600GB of data backed up successfully.
- 2nd backup: 50GB of data backed up successfully.
- ...
- 11th backup: hungsINFO: using storage: nvme-pool to create fleecing disk
At this point, the IO of the VM showed 0 on the web interface.
Meanwhile, on the VM being backed up, the following messages were displayed in the terminal:
When checking with the `top -H -p` command, 3 threads in the `/usr/bin/kvm` process had high CPU usage: two KVM threads and one CPU thread. The call stacks of the two KVM threads with high CPU usage are as follows:
AI analysis indicated that Thread 3 (LWP 1255119) was stuck in an infinite loop in `virtio_queue_host_notifier_aio_poll` while handling AIO event notifications for the virtio device. The fleecing mechanism and the virtio disk driver competed for resources under high IO pressure, leading to CPU starvation and a system-level deadlock.
Is there any known issue?
I set up a PBS and an NFS server (PBS use NFS server) on Host 1; created a VM 100101004 on Host 2, and launched a stress-ng container inside the VM for high-pressure testing; meanwhile, on Host 1, I continuously copied a 2GB ISO file to VM 100101004 via VM2 in a loop (overwriting the file each time). Then, I initiated the PBS backup with the following command:
vzdump 100101004 --compress 0 --fleecing '1,storage=nvme-pool' --remove 0 --node node60 --mode snapshot --storage storage_pbs66
- 1st backup: 600GB of data backed up successfully.
- 2nd backup: 50GB of data backed up successfully.
- ...
- 11th backup: hungsINFO: using storage: nvme-pool to create fleecing disk
INFO: starting new backup job: vzdump 100101004 --compress 0 --fleecing '1,storage=nvme-pool' --remove 0 --node node60 --mode snapshot --storage storage_pbs66
INFO: Starting Backup of VM 100101004 (qemu)
INFO: Backup started at 2025-09-30 12:39:51
INFO: status = running
INFO: VM Name: c70
INFO: include disk 'scsi0' 'nvme-pool:vm-100101004-disk-1' 200G
INFO: include disk 'scsi3' 'nvme-pool:vm-100101004-disk-2' 100G
INFO: include disk 'scsi4' 'nvme-pool:vm-100101004-disk-3' 100G
INFO: include disk 'scsi5' 'nvme-pool:vm-100101004-disk-4' 100G
INFO: include disk 'scsi6' 'hdd-pool:vm-100101004-disk-0' 100G
INFO: include disk 'efidisk0' 'nvme-pool:vm-100101004-disk-0' 64M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100101004/2025-09-30T04:39:51Z'
INFO: drive-scsi0: attaching fleecing image nvme-pool:vm-100101004-fleece-0 to QEMU
INFO: drive-scsi3: attaching fleecing image nvme-pool:vm-100101004-fleece-1 to QEMU
INFO: drive-scsi4: attaching fleecing image nvme-pool:vm-100101004-fleece-2 to QEMU
INFO: drive-scsi5: attaching fleecing image nvme-pool:vm-100101004-fleece-3 to QEMU
INFO: drive-scsi6: attaching fleecing image nvme-pool:vm-100101004-fleece-4 to QEMU
INFO: started backup task '6b0b7649-3331-4e27-b821-3ca4a9b49bdb'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: OK (drive clean)
INFO: scsi0: dirty-bitmap status: OK (2.5 GiB of 200.0 GiB dirty)
INFO: scsi3: dirty-bitmap status: OK (drive clean)
INFO: scsi4: dirty-bitmap status: OK (drive clean)
INFO: scsi5: dirty-bitmap status: OK (drive clean)
INFO: scsi6: dirty-bitmap status: OK (drive clean)
INFO: using fast incremental mode (dirty-bitmap), 2.5 GiB dirty of 600.1 GiB total
INFO: 0% (0.0 B of 2.5 GiB) in 3s, read: 0 B/s, write: 0 B/s
At this point, the IO of the VM showed 0 on the web interface.
Meanwhile, on the VM being backed up, the following messages were displayed in the terminal:
Message from syslogd@localhost at Sep 30 15:45:48 ...
kernel:[ 1779.146039] watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
Message from syslogd@localhost at Sep 30 15:45:48 ...
kernel:[ 1779.146043] watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
When checking with the `top -H -p` command, 3 threads in the `/usr/bin/kvm` process had high CPU usage: two KVM threads and one CPU thread. The call stacks of the two KVM threads with high CPU usage are as follows:
Thread 3 (LWP 1255119 "kvm"):
#0 virtio_device_disabled (vdev=<optimized out>) at ./include/hw/virtio/virtio.h:528
#1 virtio_queue_split_empty (vq=0xaaaac29d9b18) at ../hw/virtio/virtio.c:694
#2 virtio_queue_empty (vq=0xaaaac29d9b18) at ../hw/virtio/virtio.c:743
#3 0x0000aaaab22b1d1c in virtio_queue_host_notifier_aio_poll (opaque=<optimized out>) at ../hw/virtio/virtio.c:3776
#4 0x0000aaaab255a97c in run_poll_handlers_once (timeout=<synthetic pointer>, now=189532829631780, ready_list=0xffff9b0689c0, ctx=0xaaaabf4b2160) at ../util/aio-posix.c:442
#5 run_poll_handlers (timeout=<synthetic pointer>, max_ns=<optimized out>, ready_list=0xffff9b0689c0, ctx=0xaaaabf4b2160) at ../util/aio-posix.c:545
#6 try_poll_mode (timeout=<synthetic pointer>, ready_list=0xffff9b0689c0, ctx=0xaaaabf4b2160) at ../util/aio-posix.c:596
#7 aio_poll (ctx=0xaaaabf4b2160, blocking=blocking@entry=true) at ../util/aio-posix.c:630
#8 0x0000aaaab23c5788 in iothread_run (opaque=opaque@entry=0xaaaabf077000) at ../iothread.c:63
#9 0x0000aaaab255db5c in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#10 0x0000ffff9f9fee18 in __GI___pthread_get_minstack (attr=<optimized out>) at ./nptl/nptl-stack.c:145
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (LWP 1255117 "kvm"):
#0 0x0000ffff9fa5dfa8 in __faccessat (fd=<optimized out>, file=<optimized out>, mode=<optimized out>, flag=<optimized out>) at ../sysdeps/unix/sysv/linux/faccessat.c:75
#1 0x0000aaaab1f2367c in qemu_main_loop () at ../system/runstate.c:835
#2 0x0000aaaab1f2366c in main_loop_should_exit (status=<synthetic pointer>) at ../system/runstate.c:824
#3 qemu_main_loop () at ../system/runstate.c:834
#4 0x0000ffffa306cb90 in __stack_chk_guard () from /lib/ld-linux-aarch64.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
AI analysis indicated that Thread 3 (LWP 1255119) was stuck in an infinite loop in `virtio_queue_host_notifier_aio_poll` while handling AIO event notifications for the virtio device. The fleecing mechanism and the virtio disk driver competed for resources under high IO pressure, leading to CPU starvation and a system-level deadlock.
Is there any known issue?