Hello,
since some weeks some VMs (5 over 250) have problems during backup on PBS.
We have the same problem on any cluster node we move the VMs on:

What we can see but that we can't fully comprehend is that these VMs at a certain point (always the same) get stuck, that is both backup and the VM itself get stuck: the VM shows high CPU and RAM usage and its not accessible anymore via Proxmox interface.
Moving disks file on another Ceph storage (slower, because HDD instead of SSD) backup works and VM continue working even after backup.
Moving disks back, the problem occurs again.
Main storage (SSD) works with all other VMs. Even recovering the VM from a backup of months ago, before the problem occurred, still shows the problem.
Moving disks on a SAN and checking them with qemu-img check shows no errors. The same happens while checking disks with chkdsk.
All VMs having problems are windows machines with sql server on them (but we have lot more without this problem).
If we turn off the VMs and do a "Stop" type backup it works, the one not working is the "Snapshot" one.
Is seems so mysterious... Any ideas?
PS: this is a typical CPU/RAM graph:

Matteo
since some weeks some VMs (5 over 250) have problems during backup on PBS.
We have the same problem on any cluster node we move the VMs on:

What we can see but that we can't fully comprehend is that these VMs at a certain point (always the same) get stuck, that is both backup and the VM itself get stuck: the VM shows high CPU and RAM usage and its not accessible anymore via Proxmox interface.
Moving disks file on another Ceph storage (slower, because HDD instead of SSD) backup works and VM continue working even after backup.
Moving disks back, the problem occurs again.
Main storage (SSD) works with all other VMs. Even recovering the VM from a backup of months ago, before the problem occurred, still shows the problem.
Moving disks on a SAN and checking them with qemu-img check shows no errors. The same happens while checking disks with chkdsk.
All VMs having problems are windows machines with sql server on them (but we have lot more without this problem).
If we turn off the VMs and do a "Stop" type backup it works, the one not working is the "Snapshot" one.
Is seems so mysterious... Any ideas?
PS: this is a typical CPU/RAM graph:

Matteo
Last edited: