Super slow, timeout, and VM stuck while backing up, after updated to PVE 9.1.1 and PBS 4.0.20

I don't want to celebrate too soon, but after two days with PBS running kernel 6.17.4-2-pve, all the scheduled backups have worked smoothly.
In fact, it seems even faster to me. Let's keep our fingers crossed for the next few days...
In the meantime, Merry Christmas to all the Proxmox staff and to all of you!
 
  • Like
Reactions: ebiss
I don't want to celebrate too soon, but after two days with PBS running kernel 6.17.4-2-pve, all the scheduled backups have worked smoothly.
In fact, it seems even faster to me. Let's keep our fingers crossed for the next few days...
In the meantime, Merry Christmas to all the Proxmox staff and to all of you!
We finally see the light! :cool: Merry Xmas to everybody.
 
Are these upgrades mandatory in order to retain paid support, or are they optional? Stuff like this makes me really hesitate to commit to a product. I'm very much in the old school mind set that 'if it ain't broke, don't fix it' and surely major security issues would be stopped with correct Firewall setups..?
 
Kernel 6.17.4-2-pve is now available in the no-subscription repositories, so I updated, unpinned the old 6.14.11 kernel, and rebooted all nodes. The first backup job went fine. I will report back on friday if any issues arise.
Until then, thanks a lot for the hard work to the Proxmox developers, and a good new year to all of you!
Code:
# uname -a
Linux pve2 6.17.4-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.17.4-2 (2025-12-19T07:49Z) x86_64 GNU/Linux

# proxmox-boot-tool kernel list
Manually selected kernels:
None.

Automatically selected kernels:
6.14.11-5-pve
6.17.4-1-pve
6.17.4-2-pve

# pveversion
pve-manager/9.1.4/5ac30304265fbd8e (running kernel: 6.17.4-2-pve)

# proxmox-backup-manager versions
proxmox-backup-server 4.1.1-1 running version: 4.1.1
 
Last edited:
  • Like
Reactions: Heracleos
Edit: I just deleted this whole wall of text from yesterday because it seems unrelated now.

AFAICT with latest kernel 6.17.4-2-pve and PBS 4.1.1 everything is working fine.

Something to be aware of: We use a PBS pull sync job between PBS nodes. The first sync after upgrading took way longer (1 hour instead of 2 minutes), which led me on a chase of downgrading both kernel and PBS packages again because I thought the PBS sync job hangs. But it just needed a little bit of patience. After this initial long sync, all subsequent sync jobs finished within minutes.
 
Last edited:
Hi everyone!

Created account so I can post that I've faced similar problem, could not figure out whats going on until found this topic. Yesterday I reinstalled/recreated my PBS 3.x to newest 4.x, simple default install, empty data store running on second VM disk formatted as XFS.

Facts:
Dell R640, datacenter ssd, hw raid.

Non subscription proxmox:
PVE 8.X with all the latest updates
PBS 4.x with all the latest updates

Currently the only problematic VM is Win2k22, virtio scsi single with 2 disks, 800GB and 1TB, latest virtio drivers, discard on, iothread on. However I believe it could be any other VM as well, this windows VM is the only one that I have with such big disks.

So, after reinstalled a fresh PBS, on the first run my Win2k22 backup stuck at 34% when zabbix started to panic that my VM is offline.

First what I noticed, the backup was still running, tried to stop the task, it somewhat stopped but VM stayed locked.

Task log:
Code:
....
INFO:  31% (567.1 GiB of 1.8 TiB) in 1h 17m 26s, read: 852.3 MiB/s, write: 79.8 MiB/s
INFO:  32% (583.7 GiB of 1.8 TiB) in 1h 17m 50s, read: 706.3 MiB/s, write: 82.7 MiB/s
INFO:  33% (603.1 GiB of 1.8 TiB) in 1h 19m 8s, read: 255.4 MiB/s, write: 93.9 MiB/s
INFO:  34% (621.4 GiB of 1.8 TiB) in 1h 19m 35s, read: 693.0 MiB/s, write: 87.3 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job

PVE syslog:
Code:
Dec 31 20:06:40 pve-1 pvedaemon[1635155]: VM 108 qmp command failed - VM 108 qmp command 'query-backup' failed - got timeout
Dec 31 20:08:54 pve-1 pvedaemon[1520992]: <root@pam> successful auth for user 'root@pam'
Dec 31 20:09:05 pve-1 pvedaemon[1520991]: VM 108 qmp command failed - VM 108 qmp command 'guest-ping' failed - got timeout
Dec 31 20:09:29 pve-1 pvedaemon[1520991]: VM 108 qmp command failed - VM 108 qmp command 'guest-ping' failed - got timeout
Dec 31 20:09:48 pve-1 pvedaemon[1520993]: VM 108 qmp command failed - VM 108 qmp command 'guest-ping' failed - unable to connect to VM 108 qga socket - timeout after 31 retries
Dec 31 20:10:10 pve-1 pvedaemon[1520991]: VM 108 qmp command failed - VM 108 qmp command 'guest-ping' failed - unable to connect to VM 108 qga socket - timeout after 31 retries
Dec 31 20:10:51 pve-1 pvedaemon[1520991]: VM 108 qmp command failed - VM 108 qmp command 'guest-ping' failed - unable to connect to VM 108 qga socket - timeout after 31 retries
Dec 31 20:11:02 pve-1 pvedaemon[1520992]: <root@pam> starting task UPID:pve-1:001A2EAF:01A98813:69556736:vncproxy:108:root@pam:
Dec 31 20:11:02 pve-1 pvedaemon[1715887]: starting vnc proxy UPID:pve-1:001A2EAF:01A98813:69556736:vncproxy:108:root@pam:
Dec 31 20:11:03 pve-1 pvedaemon[1715890]: starting vnc proxy UPID:pve-1:001A2EB2:01A9889B:69556737:vncproxy:108:root@pam:
Dec 31 20:11:03 pve-1 pvedaemon[1520991]: <root@pam> starting task UPID:pve-1:001A2EB2:01A9889B:69556737:vncproxy:108:root@pam:
Dec 31 20:11:08 pve-1 qm[1715889]: VM 108 qmp command failed - VM 108 qmp command 'set_password' failed - unable to connect to VM 108 qmp socket - timeout after 51 retries
Dec 31 20:11:08 pve-1 pvedaemon[1715887]: Failed to run vncproxy.
Dec 31 20:11:08 pve-1 pvedaemon[1520992]: <root@pam> end task UPID:pve-1:001A2EAF:01A98813:69556736:vncproxy:108:root@pam: Failed to run vncproxy.
Dec 31 20:11:09 pve-1 qm[1715892]: VM 108 qmp command failed - VM 108 qmp command 'set_password' failed - unable to connect to VM 108 qmp socket - timeout after 51 retries
Dec 31 20:11:09 pve-1 pvedaemon[1715890]: Failed to run vncproxy.
Dec 31 20:11:09 pve-1 pvedaemon[1520991]: <root@pam> end task UPID:pve-1:001A2EB2:01A9889B:69556737:vncproxy:108:root@pam: Failed to run vncproxy.
Dec 31 20:11:13 pve-1 pvedaemon[1520993]: VM 108 qmp command failed - VM 108 qmp command 'guest-ping' failed - unable to connect to VM 108 qga socket - timeout after 31 retries
Dec 31 20:11:23 pve-1 pveproxy[1641711]: worker exit
Dec 31 20:11:23 pve-1 pveproxy[1488]: worker 1641711 finished
Dec 31 20:11:23 pve-1 pveproxy[1488]: starting 1 worker(s)
Dec 31 20:11:23 pve-1 pveproxy[1488]: worker 1716020 started
Dec 31 20:11:25 pve-1 pvedaemon[1635155]: VM 108 qmp command failed - VM 108 qmp command 'backup-cancel' failed - interrupted by signal
Dec 31 20:11:30 pve-1 pvedaemon[1520992]: <root@pam> end task UPID:pve-1:0018F353:0196BA03:69553712:vzdump::root@pam: unexpected status

At this point no more active tasks were running on PVE, VM was locked but was on/running, only not responsible.

While trying to figure out whats happening, I noticed that on PBS I had still the backup task running (even I stopped it from PVE). Well.. lets give it a try and stop that task as well.. and wolaa.. after few seconds my VM started to respond again and everything seemed fine from there on, nothing even crashed and everything conitued to work from there on, except the downtime/freeze.

Windows VM eventlogs:
Code:
vioscsci: Reset to device, \Device\RaidPort1, was issued.

Kernel PNP: The application \Device\HarddiskVolume3\Program Files\Qemu-ga\qemu-ga.exe with process id 3540 stopped the removal or ejection for the device PCI\VEN_1AF4&DEV_1003&SUBSYS_00031AF4&REV_00\5&2490727a&0&4008F0. Process command line: "C:\Program Files\Qemu-ga\qemu-ga.exe" -d --retry-path

Storahci: Reset to device, \Device\RaidPort0, was issued.

What I tried:
1) chkdsk all disks - everything fine
2) Updated everything I can to the latest (including virtio drivers on Windows)
3) Multiple backup tries ended the same way - with a freeze in different percentage done

In the end I did completley shutdown the VM and the backup was successfull. Currently VM is back on running with incremental backup task (as snapshot) running, will see how that ends.

Edit: Incremental backup via snapshot ended successful. However, only a little bit has changed over the night, so nothing much has actually saved.

Code:
INFO:  98% (1.7 TiB of 1.8 TiB) in 1h 4m 18s, read: 3.0 GiB/s, write: 0 B/s
INFO:  99% (1.8 TiB of 1.8 TiB) in 1h 4m 24s, read: 3.0 GiB/s, write: 0 B/s
INFO: 100% (1.8 TiB of 1.8 TiB) in 1h 4m 31s, read: 2.3 GiB/s, write: 75.4 KiB/s
INFO: backup is sparse: 713.79 GiB (39%) total zero data
INFO: backup was done incrementally, reused 1.78 TiB (99%)
INFO: transferred 1.78 TiB in 3871 seconds (482.5 MiB/s)
INFO: adding notes to backup
INFO: Finished Backup of VM 108 (01:04:38)
INFO: Backup finished at 2026-01-01 11:30:57
INFO: Backup job finished successfully
INFO: skipping disabled target 'mail-to-root'

In my case, I can only repeat the error/bug only on full backup on *large* disks (1.8T total) + while VM is actually running.
 
Last edited:
Hello,

a few days ago, we upgraded our 5-node cluster (with Ceph 19.2.3) from PVE 8.4 to PVe 9.1.1 and PBS from 3 to 4.1.0.
After these upgrades, we started experiencing the issues described in this thread.
Now, after carefully reading this thread, I understand that installing the 6.17.4-2-pve kernel (or 6.14.11-4-pve ) on PBS should resolve the issue.
Given that we have an Enterprise subscription, how can I install the 6.17.4-2-pve kernel (or 6.17.4-2-pve kernel) ?
Do I need to manually add any repositories? If so, which ones?Thank you very much.


Thank you
 
... Do I need to manually add any repositories? If so, which ones?Thank you very much...
Hello,
If kernel 6.17.4-2-pve is not available in the enterprise repository, I think the way to install it automatically is to enable (at least temporarily) the no-subscription repository, at least on PBS.
However, it is best to ask Proxmox official support for confirmation.
As a workaround with the old kernel, one thing that worked for me was to enable “Fleecing” on Local-LVM on the backup job.
Screenshot.jpg