I installed a fresh version of Proxmox Backup Server and added it to my 7 node ceph backed production cluster. I set up a test backup job to backup a set of test VMs.
When I ran the backup job the backups run for a few GB and then appeared to stall. While stalled it left all the VMs in the backup job with their IO blocked. If I leave the job running in this state the VM OS will eventually give up and mark all it's drives as failed/read-only/etc.
Stopping the backup job restores access to their drives. I tried setting up Fleecing, with bandwidth limits, without bandwidth limits, etc. I tried jobs with multiple VMs, and jobs with a single VM. No difference.
The storage network and backup network is all running over bonded 10Gbps links.
Eventually I figured out I had installed package updates on the PBS and didn't reboot it before running the backup job. Once I figured this out, after I rebooted the PBS the backups ran normally and did not cause problems.
My concern is we're planning to deploy this PBS to backup every VM on the cluster. This is a production cluster supporting customers and a had I not tested and ran a backup against all the VMs on the cluster it would have been... bad.
I have significant reservations about deploying a backup solution that has a possible failure mode of locking every VM on the cluster. We have other clusters with a PBS backing them and those didn't exhibit this behavior even when I tried to replicate it.
I understand that not rebooting the PBS after running an update was an error on my part, but I'm still stunned that an apt upgrade on a single server can effectively crash every VM on the entire cluster.
I installed PBS from the 4.1 ISO available from the site and ran an update against the no-sub repos.
At this point the server is working as expected. I just wanted to report the behavior as this is a pretty bad failure mode.
Thanks!
When I ran the backup job the backups run for a few GB and then appeared to stall. While stalled it left all the VMs in the backup job with their IO blocked. If I leave the job running in this state the VM OS will eventually give up and mark all it's drives as failed/read-only/etc.
Stopping the backup job restores access to their drives. I tried setting up Fleecing, with bandwidth limits, without bandwidth limits, etc. I tried jobs with multiple VMs, and jobs with a single VM. No difference.
The storage network and backup network is all running over bonded 10Gbps links.
Eventually I figured out I had installed package updates on the PBS and didn't reboot it before running the backup job. Once I figured this out, after I rebooted the PBS the backups ran normally and did not cause problems.
My concern is we're planning to deploy this PBS to backup every VM on the cluster. This is a production cluster supporting customers and a had I not tested and ran a backup against all the VMs on the cluster it would have been... bad.
I have significant reservations about deploying a backup solution that has a possible failure mode of locking every VM on the cluster. We have other clusters with a PBS backing them and those didn't exhibit this behavior even when I tried to replicate it.
I understand that not rebooting the PBS after running an update was an error on my part, but I'm still stunned that an apt upgrade on a single server can effectively crash every VM on the entire cluster.
I installed PBS from the 4.1 ISO available from the site and ran an update against the no-sub repos.
At this point the server is working as expected. I just wanted to report the behavior as this is a pretty bad failure mode.
Thanks!