Issues during backups: VMs blocked and corrupted

Switching to Virtio-Block did remove the error messages about resets of the scsi device. The messages about slow IO are still there, but no corruption did occur since then. I will switch all our machines to virtio-block and re-enable backups. If any corruption occurs, I will report back here.
So my current guess is that it's a problem with the virtio-scsi driver, which may be better in the last update (no corruption, but we did not test it very long and still error messages about reset devices).

CORRECTION:
The error messages also occur with virtio-block (viostor Ein Zurücksetzen auf Gerät "\Device\RaidPort1" wurde ausgegeben.). But no corruption yet.
 
Last edited:
Ok. I now installed Proxmox Backup Server locally and changed the backup-job for all VMSs to back up there. I have no hiccups since then, so no warnings about slow IO, no SCSI-Resets and no corruption.

My current diagnose: The Virtio-Driver (block AND scsi) can corrupt the filesystem when a backup to a proxmox backup server runs on a slow/unreliable network connection. I recommend to backup to a local PBS instance and then sync it to a remote one.
 
  • Like
Reactions: Wadera and _gabriel
I've some proxmox 8 pve installation, some in the cloud, some local at the customer office, then I've a PBS service offered by tuxis.nl.
Everything was working smoothing until some days ago where a windows 2019 vm stopped to work, completely hangup.
The previous day I stopped the pbs backup because it was taking too much time (customer's internet speed is low like 2-3/mbit) and the machine was stuck, anyway after the backup process interruption, the machine worked again.
But the day after, after the hangup, windows machine rebooted in a "repair" loop, it seems the filesystem has been completely corrupted.
I tried the restore from PBS but also the backup or at least the vm image was corrupted with the same repair loop message.

Of course I can install a local pbs but it will cost more for the customer and I should to the same for all the customer = $$$$

I still have some kvm vanilla machine with zfs replication (I use the zrepl) and I never had such problems.

If I've to shutdown the machine before machine a remote pbs backup ok, but again, it seems a bug, not a features.

I'll continue to experiment, hope this problem, if a problem is, will be fixed quickly, very quickly or bye bye PBS remote backup..
 
Ok. I now installed Proxmox Backup Server locally and changed the backup-job for all VMSs to back up there. I have no hiccups since then, so no warnings about slow IO, no SCSI-Resets and no corruption.

My current diagnose: The Virtio-Driver (block AND scsi) can corrupt the filesystem when a backup to a proxmox backup server runs on a slow/unreliable network connection. I recommend to backup to a local PBS instance and then sync it to a remote one
Hello T. Oster, thank you for the updates and tests you're conducting. I don't believe the issue lies solely with a slow connection, or at least, it can't be the only factor. For example, in my case, the physical servers are hosted on OVH, and HOST and PBS is on two different machines in different locations, connected via 1gb fiber optics. From what I've observed, even during backup issues, the connection remains stable and excellent.

However, I would expect that even if there were occasional slow connection issues with PBS, it should only slow down the backups, which wouldn't be a problem. The issue arises when the VM is either halted or significantly slowed down due to the ongoing backup, which doesn't make sense.
 
I've some proxmox 8 pve installation, some in the cloud, some local at the customer office, then I've a PBS service offered by tuxis.nl.
Everything was working smoothing until some days ago where a windows 2019 vm stopped to work, completely hangup.
The previous day I stopped the pbs backup because it was taking too much time (customer's internet speed is low like 2-3/mbit) and the machine was stuck, anyway after the backup process interruption, the machine worked again.
But the day after, after the hangup, windows machine rebooted in a "repair" loop, it seems the filesystem has been completely corrupted.
I tried the restore from PBS but also the backup or at least the vm image was corrupted with the same repair loop message.

Of course I can install a local pbs but it will cost more for the customer and I should to the same for all the customer = $$$$

I still have some kvm vanilla machine with zfs replication (I use the zrepl) and I never had such problems.

If I've to shutdown the machine before machine a remote pbs backup ok, but again, it seems a bug, not a features.

I'll continue to experiment, hope this problem, if a problem is, will be fixed quickly, very quickly or bye bye PBS remote backup..
Yes, these are the same issues I'm experiencing as well. Unfortunately, when it randomly happens that the backup is extremely slow, it's advisable to let it finish even if the VM is very sluggish or frozen. Stopping it prematurely can damage the VM's file system, as I've unfortunately experienced several times.

In my case, with the latest updates, things have improved a bit regarding backups on external NFS disks; I haven't encountered any issues. Therefore, I assume that some updates have addressed the problem. However, with PBS, occasional issues still occur.
 
Here is another thread with probably the same problem https://forum.proxmox.com/threads/b...corruption-issues-after-recent-update.132468/.

However in my case a local PBS instance seems to fix the problem, so I still bet for the network connection to blame. But as Bic72 said, this should only slow down the backup process but never touch the VMs filesystem.

I guess proxmox should also use ZFS snapshots for the backup of VMs as it does for LXC containers. I know this is a solution which does not work for other filesystems, but the current solutions seems to affect the performance of the running VM and put the filesystem integrity in danger which is a IMHO a no-go for a hypervisior/backup-system. There should at least be a opt-in solution for using ZFS snapshots instead of whatever KVM solution which is currently in use.
 
As far as I know, the Qemu backup process works like this (Proxmox, please correct me if I'm wrong):

I use the word block a lot, with two different meanings. As a verb 'To Block' and as a noun 'a block on a harddisk'. Read carefully :)
1: Backup start
2: Start at block 0, and sequentially backup each block
3: If the VM wants to write to a block, temporarily block that write
4: Backup the block that the VM wants to write to
5: After the block has been backed up, allow to VM to finish the write
6: Resume the sequential backup

So, if your PBS, or the upload to your PBS, is slow, step 4 will take some time. This might cause a series of drivers/kernelmodules/kernelthreads to get confused and cause issues in your filesystem. That is why @t.oster's problem seems to be fixed by adding a local PBS.

Mark
 
Perhaps the same I facing with virtio-scsi and virtio-block.
Copying the 5TB inside vm from one scsi to another scsi leads to BSOD With the cause : swap page checksum mismatch. Sometime I get error "Critical process died". Though the swapped memory corruption is a most often one.
Also, trying to trim the 5TB drive occupies 80gb of RAM.

None of these issues exists when I reconnect to vm the target drive as SATA. Though the whole server feels bad due to high IOwait when copying to SATA.
 
After the recent updates, I haven't experienced any more issues with corruption or backup failures. No problems on NAS or PBS nor NFS. I want to emphasize that I haven't made any changes to the configuration or drivers, so at the moment, I consider the issue resolved directly by the Proxmox team through their updates. I appreciate the team's efforts!
 
After the recent updates, I haven't experienced any more issues with corruption or backup failures. No problems on NAS or PBS nor NFS. I want to emphasize that I haven't made any changes to the configuration or drivers, so at the moment, I consider the issue resolved directly by the Proxmox team through their updates. I appreciate the team's efforts!
Hi Bic72,

We have same issue... Vm blocked and data corruption. This is the most CRITICAL situation for the IT department !
I'm thinking to revert to Vmware Esxi (free) and Acronis Cyber Protect Backup.

Actually PVE 7.4.13 in cloud OVH and PBS locally, you suggest to upgrade to the lastest version, could you tell me your PVE & PBS version?

Have you any issue more ?

Also I appreciate the team's efforts
 
Hi Bic72,

We have same issue... Vm blocked and data corruption. This is the most CRITICAL situation for the IT department !
I'm thinking to revert to Vmware Esxi (free) and Acronis Cyber Protect Backup.

Actually PVE 7.4.13 in cloud OVH and PBS locally, you suggest to upgrade to the lastest version, could you tell me your PVE & PBS version?

Have you any issue more ?

Also I appreciate the team's efforts
Hi Bob67
I have everything updated to the latest version and I haven't had any issues for several months now. The advice I can give you is this:Update both PVE and PBS. If you still encounter problems, check that the servers have sufficient resources and aren't too slow during the backup phase. Additionally, backing up over a local network could be very slow and cause problems for you. But the main thing you need to do is updates, and that should resolve everything (especially qemu-server).

My version:
PVE: 8.1.4 bare metal OVH
PBS: 3.1-4 bare metal OVH
 
Last edited:
PBS need be local to PVE then Sync to a Remote location.
Sync doesn't impact VM but Backup to slow/unreliable PBS slowdown and can even crash/corrupt VM.
 
Hi Bic72
Hi Bob67
I have everything updated to the latest version and I haven't had any issues for several months now. The advice I can give you is this:Update both PVE and PBS. If you still encounter problems, check that the servers have sufficient resources and aren't too slow during the backup phase. Additionally, backing up over a local network could be very slow and cause problems for you. But the main thing you need to do is updates, and that should resolve everything (especially qemu-server).

My version:
PVE: 8.1.4 bare metal OVH
PBS: 3.1-4 bare metal OVH
Perfect !!!

also ours PVE is on OVH bare metal (but not upgraded 7.4.13), instead PBS in local netowork.
we don't use any support solution, we use this solution just for try in some projetc-pilot in small production enviroments,
and every upgrade worries us.. and fingers crossed :)

thanks for you support
W the free ICT community
 
Last edited:
PBS need be local to PVE then Sync to a Remote location.
Sync doesn't impact VM but Backup to slow/unreliable PBS slowdown and can even crash/corrupt VM.
Hi _gabriel,
great information !
(OVH PVE)<>(OVH PBS1) <rsinc> (LOCAL PBS2)

thanks for you support
W the free ICT community
 
I've just had a similar issue on my cluster. Snapshot backups failed due to PBS datastore being full & the backup cannot complete.
Seems to have wiped 3 of my guest VM partition tables - making them unbootable. Unfortunately I was not paying enough attention to the storage volume, as I was not aware that this could happen (ignorance isn't a good defense though).

We are currently running 2 PBS VMs (on another cluster) with NFS shares on 2x Synology 1812 (RAID 6) NAS - over 4x1Gb links. Backup performance was AOK (until it was full) however GCs and Verifications take over a week, meaning cleanup jobs are always behind the 8-ball to remove unused chunks. the first PBS runs the initial backup, the second Syncs to its own NAS and then replicates to S3 compatible services using rclone.

The consensus seems to be that PBS needs to run with local volumes, and its wise to spend the time fine tuning retention to ensure the datastore volume doesn't fill up and cause huge IO delays.

For now I have changed some critical backup jobs to use the STOP backup mode and we are looking at buying another server to run PBS locally on a RAID10 for decent IO to then replicate to a NAS.

Is there a good way to convert SATA configured VMs to use VirtIO SCSI?
 
I usually perform updates on Proxmox and also on the PBS server about once a week. After the updates in May, the issues with backups in snapshot mode on PBS have reappeared.

Let me explain better: there are two baremetal servers on OVH, one with Proxmox VE and one with Proxmox Backup Server. When launching backups in snapshot mode on PBS, the virtual machine gets corrupted (for example, a VM with a Zimbra server stops the IMAP service during the backup, and the VM needs to be restarted every time the backup is performed). Similar problems occur with other VMs during backups.

This issue had already occurred in the early months of 2023, but it was resolved with the updates. I did not make any changes. Now, since May, the problem has reappeared.

Any further advice?
 
Any further advice?
I don't think updates was the main help.
imo, fails are due to slow intermittent dest PBS.
(if you use HDD and you PBS datastore reach a significant files numbers, it becomes too slow)
Solutions are :
1/ New Fleecing option should help.
2/ If VM can be stopped during backup, schedule shutdown VM then backup offline then auto reboot VM after backup with hook script.
3/ install PBS alongside to PVE to do local backup then configure "Sync" on the remote PBS to pull backups.
 
Last edited:
Thanks _gabriel,
I appreciate your advice and I will definitely try the "Fleecing option." However, I find it very strange that the backups have been working for years and then suddenly after updates they stop working. I understand that there might be a slowdown, but not with every backup for more than a month.To give you an idea, my servers on OVH are in different locations but connected to each other with 1GB fiber; if that's not enough, I don't know what else is needed :-). Additionally, I also do other backups on NAS, also on OVH, and that works perfectly even though they are much slower compared to my PBS. So, I don't know, but I believe that the qemu server/client updates affect the backups.
Officially, do the PBS need to reside on the same rack or local network where the proxmox ve is located?

Thanks again _gabriel for your help.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!