When enabling the fleecing switch for PBS backup, the backup task hangs, and sometimes the VM also hangs.

qinyi · 2025-09-30T11:56:10+0200

Hello, i use pve-qemu-kvm 9.2.0-1 to verify whether the fleecing function works properly under high IO load.
I set up a PBS and an NFS server (PBS use NFS server) on Host 1; created a VM 100101004 on Host 2, and launched a stress-ng container inside the VM for high-pressure testing; meanwhile, on Host 1, I continuously copied a 2GB ISO file to VM 100101004 via VM2 in a loop (overwriting the file each time). Then, I initiated the PBS backup with the following command:

vzdump 100101004 --compress 0 --fleecing '1,storage=nvme-pool' --remove 0 --node node60 --mode snapshot --storage storage_pbs66

- 1st backup: 600GB of data backed up successfully.
- 2nd backup: 50GB of data backed up successfully.
- ...
- 11th backup: hungsINFO: using storage: nvme-pool to create fleecing disk

INFO: starting new backup job: vzdump 100101004 --compress 0 --fleecing '1,storage=nvme-pool' --remove 0 --node node60 --mode snapshot --storage storage_pbs66

INFO: Starting Backup of VM 100101004 (qemu)
INFO: Backup started at 2025-09-30 12:39:51
INFO: status = running
INFO: VM Name: c70
INFO: include disk 'scsi0' 'nvme-pool:vm-100101004-disk-1' 200G
INFO: include disk 'scsi3' 'nvme-pool:vm-100101004-disk-2' 100G
INFO: include disk 'scsi4' 'nvme-pool:vm-100101004-disk-3' 100G
INFO: include disk 'scsi5' 'nvme-pool:vm-100101004-disk-4' 100G
INFO: include disk 'scsi6' 'hdd-pool:vm-100101004-disk-0' 100G
INFO: include disk 'efidisk0' 'nvme-pool:vm-100101004-disk-0' 64M
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100101004/2025-09-30T04:39:51Z'
INFO: drive-scsi0: attaching fleecing image nvme-pool:vm-100101004-fleece-0 to QEMU
INFO: drive-scsi3: attaching fleecing image nvme-pool:vm-100101004-fleece-1 to QEMU
INFO: drive-scsi4: attaching fleecing image nvme-pool:vm-100101004-fleece-2 to QEMU
INFO: drive-scsi5: attaching fleecing image nvme-pool:vm-100101004-fleece-3 to QEMU
INFO: drive-scsi6: attaching fleecing image nvme-pool:vm-100101004-fleece-4 to QEMU
INFO: started backup task '6b0b7649-3331-4e27-b821-3ca4a9b49bdb'
INFO: resuming VM again
INFO: efidisk0: dirty-bitmap status: OK (drive clean)
INFO: scsi0: dirty-bitmap status: OK (2.5 GiB of 200.0 GiB dirty)
INFO: scsi3: dirty-bitmap status: OK (drive clean)
INFO: scsi4: dirty-bitmap status: OK (drive clean)
INFO: scsi5: dirty-bitmap status: OK (drive clean)
INFO: scsi6: dirty-bitmap status: OK (drive clean)
INFO: using fast incremental mode (dirty-bitmap), 2.5 GiB dirty of 600.1 GiB total
INFO: 0% (0.0 B of 2.5 GiB) in 3s, read: 0 B/s, write: 0 B/s
At this point, the IO of the VM showed 0 on the web interface.
Meanwhile, on the VM being backed up, the following messages were displayed in the terminal:
Message from syslogd@localhost at Sep 30 15:45:48 ...
kernel:[ 1779.146039] watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
Message from syslogd@localhost at Sep 30 15:45:48 ...
kernel:[ 1779.146043] watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [swapper/13:0]
When checking with the `top -H -p` command, 3 threads in the `/usr/bin/kvm` process had high CPU usage: two KVM threads and one CPU thread. The call stacks of the two KVM threads with high CPU usage are as follows:

Thread 3 (LWP 1255119 "kvm"):
#0 virtio_device_disabled (vdev=<optimized out>) at ./include/hw/virtio/virtio.h:528
#1 virtio_queue_split_empty (vq=0xaaaac29d9b18) at ../hw/virtio/virtio.c:694
#2 virtio_queue_empty (vq=0xaaaac29d9b18) at ../hw/virtio/virtio.c:743
#3 0x0000aaaab22b1d1c in virtio_queue_host_notifier_aio_poll (opaque=<optimized out>) at ../hw/virtio/virtio.c:3776

#4  0x0000aaaab255a97c in run_poll_handlers_once (timeout=<synthetic pointer>, now=189532829631780, ready_list=0xffff9b0689c0, ctx=0xaaaabf4b2160) at ../util/aio-posix.c:442

#5  run_poll_handlers (timeout=<synthetic pointer>, max_ns=<optimized out>, ready_list=0xffff9b0689c0, ctx=0xaaaabf4b2160) at ../util/aio-posix.c:545

#6  try_poll_mode (timeout=<synthetic pointer>, ready_list=0xffff9b0689c0, ctx=0xaaaabf4b2160) at ../util/aio-posix.c:596

#7 aio_poll (ctx=0xaaaabf4b2160, blocking=blocking@entry=true) at ../util/aio-posix.c:630
#8 0x0000aaaab23c5788 in iothread_run (opaque=opaque@entry=0xaaaabf077000) at ../iothread.c:63
#9 0x0000aaaab255db5c in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#10 0x0000ffff9f9fee18 in __GI___pthread_get_minstack (attr=<optimized out>) at ./nptl/nptl-stack.c:145
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (LWP 1255117 "kvm"):

#0  0x0000ffff9fa5dfa8 in __faccessat (fd=<optimized out>, file=<optimized out>, mode=<optimized out>, flag=<optimized out>) at ../sysdeps/unix/sysv/linux/faccessat.c:75

#1 0x0000aaaab1f2367c in qemu_main_loop () at ../system/runstate.c:835
#2 0x0000aaaab1f2366c in main_loop_should_exit (status=<synthetic pointer>) at ../system/runstate.c:824
#3 qemu_main_loop () at ../system/runstate.c:834
#4 0x0000ffffa306cb90 in __stack_chk_guard () from /lib/ld-linux-aarch64.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

AI analysis indicated that Thread 3 (LWP 1255119) was stuck in an infinite loop in `virtio_queue_host_notifier_aio_poll` while handling AIO event notifications for the virtio device. The fleecing mechanism and the virtio disk driver competed for resources under high IO pressure, leading to CPU starvation and a system-level deadlock.

Is there any known issue?

Search

Search

When enabling the fleecing switch for PBS backup, the backup task hangs, and sometimes the VM also hangs.

qinyi

New Member

We value your privacy