Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

Heracleos

Member
Mar 7, 2024
35
14
8
Hello everyone,
Since yesterday, after all the updates to PVE 9.1.1 and PBS 4.0.20 (all with no-subscription repository), the backup has become super slow.
Obviously, after the updates, I restarted both the nodes and the VMs.

I found my daily backup job still running after hours.
And some VMs even frozen. I had to kill all backup processes, both from the pve nodes and from the PBS.
Once killed, I also had to unlock the VMs that were being backed up and force them to stop.

Then, once normal operation was restored, I launched a manual backup, but for some VMs, the backup is still very slow and in some cases, the backup is even aborted due to timeout.
For example:

Code:
()
INFO: starting new backup job: vzdump 103 --storage BackupRepository --notes-template '{{guestname}}' --node pve03 --mode snapshot --notification-mode notification-system --remove 0
INFO: Starting Backup of VM 103 (qemu)
INFO: Backup started at 2025-11-21 07:09:48
INFO: status = running
INFO: VM Name: HomeSrvWin01
INFO: include disk 'scsi0' 'CephRBD:vm-103-disk-0' 90G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/103/2025-11-21T13:09:48Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '798f2410-2ddf-4552-81a9-df71b384288f'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: created new
INFO:   0% (700.0 MiB of 90.0 GiB) in 3s, read: 233.3 MiB/s, write: 141.3 MiB/s
INFO:   1% (1.0 GiB of 90.0 GiB) in 6s, read: 112.0 MiB/s, write: 30.7 MiB/s
INFO:   2% (1.8 GiB of 90.0 GiB) in 13s, read: 115.4 MiB/s, write: 29.1 MiB/s
INFO:   3% (2.8 GiB of 90.0 GiB) in 20s, read: 144.0 MiB/s, write: 17.1 MiB/s
INFO:   4% (3.6 GiB of 90.0 GiB) in 54s, read: 25.4 MiB/s, write: 9.5 MiB/s
ERROR: VM 103 qmp command 'query-backup' failed - got timeout
INFO: aborting backup job
ERROR: VM 103 qmp command 'backup-cancel' failed - unable to connect to VM 103 qmp socket - timeout after 5973 retries
INFO: resuming VM again
ERROR: Backup of VM 103 failed - VM 103 qmp command 'cont' failed - unable to connect to VM 103 qmp socket - timeout after 449 retries
INFO: Failed at 2025-11-21 07:35:47
INFO: Backup job finished with errors
INFO: notified via target `HeracleosMailServer`
TASK ERROR: job errors

When this happens, the VM is completely frozen.
The vm console is inaccessible, and from the summary, even the qemu agent does not appear to be active.
The only way to unblock it is to force a stop.

All VMs vdisks are in a ceph rdb volume of 3 pve nodes.
PBS runs as a VM, but the disks (including the backup repository) are qcow2 files in an NFS share on a Sinology NAS.

This is the first time this has happened since I installed the cluster months ago.
What steps can I take to diagnose the problem?
Thank you for your suggestions.
 
Last edited:
Please try to generate and post a backtrace the next time this happens by following the steps as described here https://forum.proxmox.com/threads/proxmox-virtual-environment-9-1-available.176255/post-818210

Also, make sure the datastore backing storage is online and reachable while this happens, but most likely the VM is IO starved because the backup to the PBS cannot make progress (which you can avoid by using a fleecing image). Check and post the corresponding backup task log to be found on the PBS as well.
 
Last edited:
  • Like
Reactions: Heracleos
I can confirm this behavior. The problem started for us on November 6. We are using a Ceph 3/2 cluster system with enterprise hardware, and PBS has been running continuously for two years without any problems. The problem does not always occur on the same VM or host, but jumps sporadically. Neither the network nor the NVMEs are at full load, as there is a 40Gb link per host and several Kioxias. PBS ist always reachable, there is no network timeout.

Attached are the logs, I hope it helps.
 

Attachments

Last edited:
Hi,
The problem started for us on November 6.
could you check if/what upgrades were made just before that, both on PVE and PBS side? You can check /var/log/apt/history.log and its rotations like /var/log/apt/history.log.1.gz.

Is the VM stuck at full CPU for you too? If yes, what do you get when you run timeout 10 strace -c -p $(cat /run/qemu-server/123.pid) replacing 123 with the correct ID while the backup is slow or while it is stuck (both might be interesting).
 
  • Like
Reactions: quanto11
Hey fiona,

Here are the logs.

I'll run the command you provided during the next stuck backup and let you know right away. That could take a few days. The behavior cannot be easily reproduced.
 

Attachments

Last edited:
Same issue on my test cluster and another quick one i threw together to confirm. Issue carries on


Note, zero hardware issues or limited bandwidth between nodes, have had no issues prior to 9.1 so whatever it is has appeared in this version.

Few more observations

All PVE Nodes are affected, does not matter which one. 20GB LAGG between everything, different NIC's, VLAN's, 9000 MTU. X710 NIC's. I/O Delay hovers around .05 - .10% during load.

With Fleecing enabled the VM's no longer crash and hardlock which is good, but the backups will just stop and never complete. Had one stuck at over 31% for 8+ hours, no movement, no increase in network traffic (was showing pretty much idle). Reran the backup and it completed fine (Again, random)

Without fleecing the CPU and RAM on the VM would Peak (sometimes both, sometimes Memory and sometimes only CPU) which would require a sigterm to kill it and bring it back.

On the PBS Side, CPU usage remains low, however system memory will spike to over 80% which is new as of the 4.X version. System has 256GB of Ram so plenty of runway typically. However, in HTOP disk writes are well within tolerance, and when the VM stops backing up the traffic / I/O drops accordingly
 
Last edited:
  • Like
Reactions: Heracleos
Hello,

we are experiencing the same problems using LVM over iSCSI storage and PBS backups. AFAICT this happened after updating on 2025-11-18:

Code:
Start-Date: 2025-11-18  10:02:05
Commandline: /usr/bin/apt-get dist-upgrade
Install: netavark:amd64 (1.14.0-2, automatic), containernetworking-plugins:amd64 (1.1.1+ds1-3+b17, automatic), golang-github-containers-image:amd64 (5.34.2-1, automatic), proxmox-kernel-6.17.2-1-pve-signed:amd64 (6.17.2-1, automatic), python3-cffi-backend:amd64 (1.17.1-3, automatic), python3-importlib-resources:amd64 (6.5.2-1, automatic), python3-pefile:amd64 (2024.8.26-2.1, automatic), proxmox-kernel-6.17:amd64 (6.17.2-1, automatic), aardvark-dns:amd64 (1.14.0-3, automatic), skopeo:amd64 (1.18.0+ds1-1+b5, automatic), python3-bcrypt:amd64 (4.2.0-2.1+b1, automatic), python3-virt-firmware:amd64 (24.11-2, automatic), golang-github-containers-common:amd64 (0.62.2+ds1-2, automatic), python3-cryptography:amd64 (43.0.0-3, automatic)
Upgrade: pve-docs:amd64 (9.0.8, 9.0.9), pve-edk2-firmware-ovmf:amd64 (4.2025.02-4, 4.2025.05-2), console-setup:amd64 (1.240, 1.242~deb13u1), udev:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), proxmox-default-kernel:amd64 (2.0.0, 2.0.1), proxmox-widget-toolkit:amd64 (5.0.6, 5.1.2), libpve-rs-perl:amd64 (0.10.10, 0.11.3), rrdcached:amd64 (1.7.2-4.2+pve3, 1.7.2-4.2+pve4), libldb2:amd64 (2:2.11.0+samba4.22.4+dfsg-1~deb13u1, 2:2.11.0+samba4.22.6+dfsg-0+deb13u1), libpam-systemd:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), pve-qemu-kvm:amd64 (10.1.2-1, 10.1.2-3), libpve-cluster-api-perl:amd64 (9.0.6, 9.0.7), pve-edk2-firmware-legacy:amd64 (4.2025.02-4, 4.2025.05-2), pve-ha-manager:amd64 (5.0.5, 5.0.8), libpve-apiclient-perl:amd64 (3.4.0, 3.4.2), swtpm-libs:amd64 (0.8.0+pve2, 0.8.0+pve3), swtpm-tools:amd64 (0.8.0+pve2, 0.8.0+pve3), librrds-perl:amd64 (1.7.2-4.2+pve3, 1.7.2-4.2+pve4), libpve-storage-perl:amd64 (9.0.14, 9.0.18), openssl-provider-legacy:amd64 (3.5.1-1+deb13u1, 3.5.4-1~deb13u1), libsystemd0:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libnss-systemd:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libwbclient0:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), swtpm:amd64 (0.8.0+pve2, 0.8.0+pve3), libxml2:amd64 (2.12.7+dfsg+really2.9.14-2.1+deb13u1, 2.12.7+dfsg+really2.9.14-2.1+deb13u2), pve-yew-mobile-i18n:amd64 (3.6.1, 3.6.2), pve-cluster:amd64 (9.0.6, 9.0.7), systemd:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libudev1:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), console-setup-linux:amd64 (1.240, 1.242~deb13u1), libcurl3t64-gnutls:amd64 (8.14.1-2, 8.14.1-2+deb13u2), lxc-pve:amd64 (6.0.5-1, 6.0.5-3), systemd-cryptsetup:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), systemd-boot-efi:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), proxmox-backup-file-restore:amd64 (4.0.16-1, 4.0.20-1), systemd-boot-tools:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), virtiofsd:amd64 (1.13.2-1, 1.13.2-1+deb13u1), pve-xtermjs:amd64 (5.5.0-2, 5.5.0-3), qemu-server:amd64 (9.0.23, 9.0.29), libpve-access-control:amd64 (9.0.3, 9.0.4), libsmbclient0:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), pve-container:amd64 (6.0.13, 6.0.17), pve-i18n:amd64 (3.6.1, 3.6.2), base-files:amd64 (13.8+deb13u1, 13.8+deb13u2), libtdb1:amd64 (2:1.4.13+samba4.22.4+dfsg-1~deb13u1, 2:1.4.13+samba4.22.6+dfsg-0+deb13u1), monitoring-plugins-basic:amd64 (2.4.0-3, 2.4.0-3+deb13u1), libcurl4t64:amd64 (8.14.1-2, 8.14.1-2+deb13u2), proxmox-backup-client:amd64 (4.0.19-1, 4.0.20-1), libpve-network-api-perl:amd64 (1.1.8, 1.2.1), distro-info-data:amd64 (0.66, 0.66+deb13u1), libtevent0t64:amd64 (2:0.16.2+samba4.22.4+dfsg-1~deb13u1, 2:0.16.2+samba4.22.6+dfsg-0+deb13u1), smbclient:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), libwtmpdb0:amd64 (0.73.0-3, 0.73.0-3+deb13u1), proxmox-firewall:amd64 (1.2.0, 1.2.1), pve-manager:amd64 (9.0.11, 9.0.17), libpve-common-perl:amd64 (9.0.13, 9.0.15), libpve-network-perl:amd64 (1.1.8, 1.2.1), samba-libs:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), libsystemd-shared:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), libpve-notify-perl:amd64 (9.0.6, 9.0.7), monitoring-plugins-common:amd64 (2.4.0-3, 2.4.0-3+deb13u1), keyboard-configuration:amd64 (1.240, 1.242~deb13u1), libssl3t64:amd64 (3.5.1-1+deb13u1, 3.5.4-1~deb13u1), systemd-sysv:amd64 (257.8-1~deb13u2, 257.9-1~deb13u1), samba-common:amd64 (2:4.22.4+dfsg-1~deb13u1, 2:4.22.6+dfsg-0+deb13u1), librrd8t64:amd64 (1.7.2-4.2+pve3, 1.7.2-4.2+pve4), pve-yew-mobile-gui:amd64 (0.6.2, 0.6.3), curl:amd64 (8.14.1-2, 8.14.1-2+deb13u2), pve-firewall:amd64 (6.0.3, 6.0.4), proxmox-backup-docs:amd64 (4.0.19-1, 4.0.20-1), libtalloc2:amd64 (2:2.4.3+samba4.22.4+dfsg-1~deb13u1, 2:2.4.3+samba4.22.6+dfsg-0+deb13u1), proxmox-backup-server:amd64 (4.0.19-1, 4.0.20-1), pve-edk2-firmware:amd64 (4.2025.02-4, 4.2025.05-2), postfix:amd64 (3.10.4-1~deb13u1, 3.10.5-1~deb13u1), openssl:amd64 (3.5.1-1+deb13u1, 3.5.4-1~deb13u1), proxmox-termproxy:amd64 (2.0.2, 2.0.3), libpve-cluster-perl:amd64 (9.0.6, 9.0.7)
End-Date: 2025-11-18  10:04:10

The VM hangs/crashes can be mitigated by enabling fleecing, but as JCNED wrote, backups still hang randomly most the time. In our tests it doesn't matter whether snapshot or stop mode is used. At one point, iSCSI read speeds will drop to around 100 KB/s (this includes "normal" iSCSI read load from multiple VMs, so it's probably much lower than that). It will stil make progress, but even a 1 GB bitmap will take ages to backup.

Here's a backup job from last night that I killed this morning:

Code:
INFO: Backup started at 2025-11-24 21:00:05
INFO: status = running
INFO: VM Name: redacted
INFO: include disk 'scsi0' 'iscsi-lvm:vm-100-disk-0' 20G
INFO: include disk 'scsi1' 'iscsi-lvm:vm-100-disk-1' 300G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/100/2025-11-24T20:00:05Z'
INFO: drive-scsi0: attaching fleecing image local-zfs-pool:vm-100-fleece-0 to QEMU
INFO: drive-scsi1: attaching fleecing image local-zfs-pool:vm-100-fleece-1 to QEMU
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '382ce8cd-3bd1-4a0a-8af3-4e99a79a9def'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: OK (3.0 GiB of 20.0 GiB dirty)
INFO: scsi1: dirty-bitmap status: OK (14.8 GiB of 300.0 GiB dirty)
INFO: using fast incremental mode (dirty-bitmap), 17.8 GiB dirty of 320.0 GiB total
INFO:   4% (748.0 MiB of 17.8 GiB) in 3s, read: 249.3 MiB/s, write: 249.3 MiB/s
INFO:   7% (1.4 GiB of 17.8 GiB) in 6s, read: 220.0 MiB/s, write: 220.0 MiB/s
INFO:  11% (2.0 GiB of 17.8 GiB) in 9s, read: 225.3 MiB/s, write: 225.3 MiB/s
INFO:  15% (2.7 GiB of 17.8 GiB) in 12s, read: 217.3 MiB/s, write: 217.3 MiB/s
INFO:  18% (3.3 GiB of 17.8 GiB) in 15s, read: 222.7 MiB/s, write: 222.7 MiB/s
INFO:  22% (4.0 GiB of 17.8 GiB) in 18s, read: 245.3 MiB/s, write: 245.3 MiB/s
INFO:  26% (4.7 GiB of 17.8 GiB) in 21s, read: 233.3 MiB/s, write: 233.3 MiB/s
INFO:  27% (4.9 GiB of 17.8 GiB) in 3h 35m 30s, read: 13.3 KiB/s, write: 13.3 KiB/s
INFO:  28% (5.0 GiB of 17.8 GiB) in 4h 25m 54s, read: 43.3 KiB/s, write: 43.3 KiB/s
INFO:  29% (5.2 GiB of 17.8 GiB) in 4h 25m 57s, read: 72.0 MiB/s, write: 72.0 MiB/s
INFO:  30% (5.3 GiB of 17.8 GiB) in 7h 3m 45s, read: 12.1 KiB/s, write: 12.1 KiB/s
INFO:  31% (5.6 GiB of 17.8 GiB) in 7h 28m 52s, read: 184.8 KiB/s, write: 184.8 KiB/s
INFO:  32% (5.7 GiB of 17.8 GiB) in 7h 28m 55s, read: 44.0 MiB/s, write: 44.0 MiB/s
INFO:  33% (5.9 GiB of 17.8 GiB) in 8h 6m 47s, read: 73.9 KiB/s, write: 73.9 KiB/s
INFO:  34% (6.2 GiB of 17.8 GiB) in 8h 35m 27s, read: 161.9 KiB/s, write: 161.9 KiB/s
INFO:  35% (6.4 GiB of 17.8 GiB) in 10h 2m 13s, read: 46.4 KiB/s, write: 46.4 KiB/s
INFO:  36% (6.6 GiB of 17.8 GiB) in 10h 59m 40s, read: 55.8 KiB/s, write: 55.8 KiB/s
INFO:  38% (6.8 GiB of 17.8 GiB) in 10h 59m 43s, read: 74.7 MiB/s, write: 74.7 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
INFO: removing (old) fleecing image 'local-zfs-pool:vm-100-fleece-0'
INFO: removing (old) fleecing image 'local-zfs-pool:vm-100-fleece-1'
ERROR: Backup of VM 100 failed - interrupted by signal
INFO: Failed at 2025-11-25 08:15:46
 
Last edited:
At one point, iSCSI read speeds will drop to around 100 KB/s (this includes "normal" iSCSI read load from multiple VMs, so it's probably much lower than that).
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?


For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
replacing 123 with the ID of an slowed-down VM. The last command is expected to take 10 seconds. The second GDB command is not a mistake, I'd like to see how it changes. We only got one such log until now. (From @JCNED and that was after the backup was already aborted IIUC).

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
 
  • Like
Reactions: Heracleos
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?


For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
replacing 123 with the ID of an slowed-down VM. The last command is expected to take 10 seconds. The second GDB command is not a mistake, I'd like to see how it changes. We only got one such log until now. (From @JCNED and that was after the backup was already aborted IIUC).

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
Sounds good. I'll kick it off
Are other VMs bottlenecked at the same time or would they still read at higher speeds during this time?


For all, please run apt install pve-qemu-kvm-dbgsym gdb libproxmox-backup-qemu0-dbgsym to install the debug symbols for the backup library and then, while the backup is slowed down:
Code:
qm config 123
qm status 123 --verbose
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
timeout 10 strace -c -p $(cat /run/qemu-server/123.pid)
gdb --batch --ex 't a a bt' -p $(cat /run/qemu-server/123.pid)
replacing 123 with the ID of an slowed-down VM. The last command is expected to take 10 seconds. The second GDB command is not a mistake, I'd like to see how it changes. We only got one such log until now. (From @JCNED and that was after the backup was already aborted IIUC).

Are there any messages in the system logs of Proxmox VE or PBS around the time of the issues?
Hi Finona,

See attached for 3x VM's. Nothing odd in system logs for either VE or PBS.
 

Attachments

Hello everyone,
here we have two PVE 8.4.14 clusters with 4 nodes and ceph... After PBS 4.0.22 upgrade and reboot, random backup fails freezing the VM!!! Using the old PBS 3.x latest version everything is working fine. I think the problem relies on the latest PBS upgrade...

Thanks in advance for the upcoming fix.