New PBS installation blocked IO on all cluster VMs indefinitely

bobf

New Member
Jan 26, 2026
6
3
3
I installed a fresh version of Proxmox Backup Server and added it to my 7 node ceph backed production cluster. I set up a test backup job to backup a set of test VMs.

When I ran the backup job the backups run for a few GB and then appeared to stall. While stalled it left all the VMs in the backup job with their IO blocked. If I leave the job running in this state the VM OS will eventually give up and mark all it's drives as failed/read-only/etc.

Stopping the backup job restores access to their drives. I tried setting up Fleecing, with bandwidth limits, without bandwidth limits, etc. I tried jobs with multiple VMs, and jobs with a single VM. No difference.

The storage network and backup network is all running over bonded 10Gbps links.

Eventually I figured out I had installed package updates on the PBS and didn't reboot it before running the backup job. Once I figured this out, after I rebooted the PBS the backups ran normally and did not cause problems.

My concern is we're planning to deploy this PBS to backup every VM on the cluster. This is a production cluster supporting customers and a had I not tested and ran a backup against all the VMs on the cluster it would have been... bad.

I have significant reservations about deploying a backup solution that has a possible failure mode of locking every VM on the cluster. We have other clusters with a PBS backing them and those didn't exhibit this behavior even when I tried to replicate it.

I understand that not rebooting the PBS after running an update was an error on my part, but I'm still stunned that an apt upgrade on a single server can effectively crash every VM on the entire cluster.

I installed PBS from the 4.1 ISO available from the site and ran an update against the no-sub repos.

At this point the server is working as expected. I just wanted to report the behavior as this is a pretty bad failure mode.

Thanks!
 
Hi,
there was unfortunately a kernel bug with kernel 6.17.2 which could lead to slow or even stall TCP connections during backups, newer kernel builds are not affected, therefore it works after you upgraded the system. This thread [0] contains the debugging efforts which finally lead us to revert problematic commits for newer kernels. The I/O interdependence of VMs and the PBS upload is however more fundamental, you must use backup fleecing [1] to decouple it.

[0] https://forum.proxmox.com/threads/s...r-updated-to-pve-9-1-1-and-pbs-4-0-20.176444/
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_vm_backup_fleecing
 
  • Like
Reactions: Johannes S
Thanks for the quick reply! I was hoping this was a known and fixed issue, and the fact it was a specific kernel revision that had the issue pretty much explains why I couldn't replicate the problem with my other PBS instances since they already existed with different kernel versions.

I did try fleecing in testing this which made no difference obviously. I'm planning on implementing fleecing on all the jobs but we have some VMs that are far larger than the local storage available on the cluster nodes. The nodes are all deployed on ZFS so I'm going to put some guard rails in place on maximum dataset size and test to ensure it's using thin provisioning as expected before I go all in on it.

I'll push forward and hope for the best.