Backup Slowdown and Windows VM Corruption Issues after Recent Update


New Member
Aug 21, 2023
Hello everyone,

I've been running a PVE and a Proxmox PBS server on high-scale machines in OVH's data center flawlessly for the past two years. However, after upgrading both products to their latest versions, I began encountering some peculiar issues with my backups. The first anomaly I noticed was the random drop in backup speeds for some VMs, decreasing from 100-200 MB/s to a mere 100-500 KB/s. This slowdown occasionally caused some backups to not complete until the morning, forcing me to prematurely terminate the process.

Subsequently, I discovered that several of my Windows VMs failed to boot up. When I connected to their consoles, I was greeted by a Windows recovery screen prompting language selection. No matter what I tried, I was unable to recover from this screen.

Upon further investigation, an additional 4-5 Windows VMs displayed similar boot issues. I tried accessing the troubled Windows VMs offline using bootable Windows recovery tools. Running chkdsk revealed a plethora of errors. Even when attempting to fix these, I suffered data loss. Ultimately, I couldn't successfully boot or recover the corrupted machines. I had no choice but to revert to the latest backup, resulting in some data loss. I've now reluctantly disabled the backup feature, as it's not instilling any confidence due to the corruption risks that lead to data loss of the day's work.

I've come across some older posts describing similar issues but haven't found a definitive solution. I'm uncertain whether this issue is related to the recent update or if there's an error in my configuration.

I appreciate any insights or assistance.
Can you post the config of a VM?
qm config <vmid> --current

Did you reboot those VMs regularly prior to those issues, or did they have a relatively long uptime before?
Computers that regularly restart due to reasons like Windows updates and such.

The most recent crash of the Windows has the following configuration.

agent: 1,fstrim_cloned_disks=1
balloon: 8192
boot: order=scsi0
cores: 8
machine: pc-i440fx-5.2
memory: 12288
name: VTS11039.XXX.XXX
net0: virtio=FE:63:2F:8A:EC:85,bridge=vmbr1,firewall=1
numa: 0
onboot: 1
ostype: win10
scsi0: DATA:vm-11039-disk-0,discard=on,size=65G
scsihw: virtio-scsi-pci
smbios1: uuid=2fab6ecd-5358-4a99-a300-7456dafe8ca3
sockets: 1
vmgenid: 9cd6d419-3974-431f-b2ec-206c387e040b
Is this bug still an issue? Can you please point out which Proxmox Versions are affected, and describe exact circumstances when it can occur?
The type of storage is ZFS. Almost all of the around 70 servers are set up with SCSI, maybe a few are using IDE, but I haven't had any issues with them anyway. However, I'm sure about one thing: when I brought up crashed VMs offline and performed a check disk, it found a considerable number of corrupted files, marked some of them as CHK, and left them that way, which caused Windows not to boot. What's interesting is that when I examined the partition table of the most recent crashed VM, there was an additional small 2MB partition at the end of the Disk.
I switched my backup method from snapshot to performing shutdown, and surprisingly, I haven't experienced any machine crashes in the past two weeks.

Initially, this change was made without a deliberate intention to test its reliability, but the results have been quite remarkable. All of my virtual machines have been running smoothly without any interruptions. I plan to continue using this approach for a while longer to validate its consistency.

My plan is to monitor the stability for a bit more time, and then I will revert the process back to snapshots to see if there is any Windows crash.

I will make sure to provide an update once I've gathered more data. If anyone else has tried a similar approach or has insights to share, I'd be interested to hear about it.


