Issues during backups: VMs blocked and corrupted

Bic72

New Member
Sep 19, 2022
23
0
1
Description:I'm in a panic because I've encountered problems during the backups of my VMs on my Proxmox servers. I have 4 dedicated servers on OVH and 1 PBS, along with a local server. Recently, I performed the upgrade of all servers to version 7.4-13 using "apt update" and "apt upgrade," which I usually do once or twice a month for several years.

After the recent updates, I've been experiencing problems during snapshot backups. The backups are scheduled at night, but in the morning, I found some of the VMs blocked and sometimes randomly corrupted. Below, I provide further details about each server and the problems encountered:

First server:
  • Description: Dedicated SSD Proxmox 7.4-13 server with a VM (Ubuntu + ISPConfig) dedicated to a web server.
  • Problem: Immediately after the backup starts, all online websites that use MySQL get completely blocked, causing MySQL itself to become corrupted. MySQL cannot be restarted unless I restore a previous backup. The backup process completes successfully with "task ok" status, but the live VM no longer functions. Backups are performed on another PBS server.
Second and third servers:
  • Description: Proxmox 7.4-13 dedicated servers used for mail management, with two Zimbra VMs (one on Ubuntu and one on CentOS).
  • Problem: In the morning, I sometimes find some of the Zimbra services blocked and mail delivery halted. Fortunately, everything resumes correctly after rebooting the VM. I encountered the same issue when performing backups on PBS or a dedicated NFS storage on OVH.
Fourth server:
  • Description: Proxmox 7.4-13 local server with a Nextcloud VM on Ubuntu.
  • Problem: In the morning, I occasionally find Nextcloud not responding, and I'm forced to restart the VM. An "initramfs" error is displayed, followed by executing "fsck" that detects lost inodes. However, the VM eventually restarts successfully. In this case, backups are performed on a locally networked NAS using NFS.
I kindly request if anyone can help me understand what is happening. Could it be a Proxmox bug or a problem stemming from Debian updates? I want to emphasize that I haven't made any changes to the VMs or their configurations, and backups have been working fine for several years. Currently, I've disabled all backups because I cannot afford the risk of corrupting production VMs.

Thank you in advance.
 
I have a couple of questions:

If I perform backups in suspension mode instead of taking snapshots, will the issue be bypassed? Or am I required to use the stop mode to ensure that the backup process does not corrupt the VMs?

Could the qemu server component be responsible for the problem? If so, can I install an older version without causing any harm?

Thank you, everyone.
 
Proxmox relies on QEMU Guest Agent to flush all data to the virtual disk before the backup (called fsfreeze, regardless of the backup mode). I have not experienced your issue with my MythTV backend container that uses MySQL. If MySQL needs to be informed that a backup is about to begin (and to flush everything in a consistent state to disk), you can probably attach some script to the QEMU Guest Agent running inside your VM. I have no personal experience with this, but I hope the link can get you started.
 
Hi,
we have the same problem: Some VMs (not LXC Containers, only VMs) are stuck in a reboot loop with randomly corrupted filesystems (windows) in the morning. Only rolling back to a previous zfs snapshot fixes the issues.
We are suspecting either a bug in KVM/Qemu or a problem with the PBS-Backup. We had this on 3 different servers on three different VMs with different Windows versions.
 
Hi,
I've been experiencing the same issues since June 23 after updating Quemu. Now, after additional updates, something has changed. Snapshots on NAS with NFS protocol are stable and no longer causing problems. However, backups on PBS are a disaster! I've tried updating both Proxmox from version 7 to 8.0.4 and PBS to 3.0-2, but the backups are not reliable. Sometimes they work fine, but other times they are extremely slow and either freeze or damage the VM. Essentially, I've stopped using PBS for backups because they are not stable. I've also tried various solutions, but I can't seem to resolve the issue. These are dedicated servers on OVH; could it be their fault? When backups go wrong, I've checked the connection speed, and it's excellent.
Can anyone tell me how to address this issue? Where should I start?
Thanks in advance to everyone.
 
Hello LnxBil,No, I don't have a paid subscription; I'm using the No-Subscription Repository.
 
Hi, I think it could be related to the fs-freeze issued in the VM by the backup process. Here is an old thread showing similar symptoms: https://forum.proxmox.com/threads/guest-agent-fs-freeze-command-breaks-the-system-on-backup.69605/

We have another cluster which has exactly the same backup target, but the VM does not have the guest agent installed and we never experienced the filesystem corruption there.

Could it be that issuing the fs-freeze/thaw leads to corrupted vm filesystems?
 
Could it be that issuing the fs-freeze/thaw leads to corrupted vm filesystems?
Yes, I think that is the problem, too. The missing thaw will indefinitely postpone any further writes and after some buffering, something is full and the problems start to arise. I encountered this also numerous times if my backupserver was temporarily not accessible.
 
This sounds plausible. I once again had the error on a machine, where I created a backup in stop-mode (not snapshot mode). However chkdsk on older zfs-snapshots of this machine prove, that the filesystem was broken before that backup, so maybe it was just sleeping corruption and has nothing to do with the stopped-mode backup.
If I run chkdsk on the last booting snapshot, I get this error https://support.microsoft.com/de-de/topic/fehlermeldung-wenn-sie-das-dienstprogramm-chkdsk-exe-im-nur-lese-modus-auf-einem-windows-basierten-computer-ausführen-der-volumebitmap-ist-falsch-oder-fehler-in-der-datei-5-index-$i30-fdc5e722-816e-844c-2e6c-f2fe7cc03946 which suggests something with VSS-Copys, which are IMHO used for the fs-freeze command.

There seems to be a hidden option to disable the fs-freeze while having the guest-agent in place: https://forum.proxmox.com/threads/disable-fs-freeze-on-snapshot-backups.122533/post-539308, so I will try that and see if there are still corrupted VMs.

Code:
vssadmin list shadows
showed 2 shadow-copies. I deleted them with vssadmin delete shadows /All. Another VM did not show any shadow copies, so I suspect those 2 copies are leftovers from the backups, which should have been deleted and may have contributed to the problem.
 
Last edited:
In my case, on all of my servers, sometimes backups become extremely slow (several hours instead of a few minutes) and the VM becomes unreachable, for example, websites go offline. At this point, after hours of inactivity, I'm forced to stop the backup on PBS + the VM, but at this point, the VM's file system gets corrupted, etc.

I have already tried with both Guest Agent enabled or disabled, but the problem persists. The issue remains even when performing manual backups instead of scheduled ones.

If I understand correctly, considering my limited English skills, it doesn't seem to be a fs-freeze problem. Instead, it appears to be a known and unresolved issue?, as seen here?: https://forum.proxmox.com/threads/disable-fs-freeze-on-snapshot-backups.122533/post-550967

Can you confirm this? I'm already aware of the problem, and are they actively working on it? What I don't understand is why it only affects some people and not everyone, given that my servers are bare metal with standard Proxmox installations.
 
Hi, I can confirm that disabling fs-freeze with freeze-fs-on-backup=0 does NOT solve the problem. I could not yet reproduce it with manual backups, but the scheduled backup broke the filesystem twice this weekend. We reset it to a working snapshot, running chkdsk does not show any problems. After the backup (which runs up to the end) chkdsk shows errors and sometimes the machine crashes and does not boot anymore.

We have a ZFS based proxmox cluster and use a remote Proxmox Backup Server (tuxis.nl). LXC Containers are backed-up without problems.

The Windows Syslog shows two important messages:
  • Oct 01 19:43:50 0.510 ESENT taskhostw (1172,D,0) WebCacheLocal: Eine Anforderung, in die Datei "C:\Users\worker\AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.dat" ab Offset 2228224 (0x0000000000220000) insgesamt 32768 (0x00008000) Bytes zu schreiben, war erfolgreich, benötigte aber ungewöhnlich viel Zeit (66 Sekunden) von Seiten des Betriebssystems. Zusätzlich haben 3 andere E/A-Anforderungen an diese Datei ungewöhnlich viel Zeit benötigt, seit die letzte Meldung bezüglich dieses Problems vor 42668 Sekunden gesendet wurde. Dieses Problem ist vermutlich durch fehlerhafte Hardware bedingt. Wenden Sie sich für weitere Unterstützung bei der Diagnose des Problems an Ihren Hardwarehersteller.
  • Oct 01 19:45:33 32772.129 vioscsi Ein Zurücksetzen auf Gerät "\Device\RaidPort1" wurde ausgegeben.
the first just indicates that IO is terrilby slow during the backup, but the second sais there was a reset int SCSI driver. No idea what that means, but I guess this could be related to the file system corruption.

The host server has ECC RAM and mirrored ZFS vdevs, so I can rule out hardware problems (we had the same issue on different servers). It must be related to Proxmox, KVM, QEMU and/or PBS. If it's not fs-freeze/VSS then maybe the dirty-bitmap/fast incremental backup feature? Or the VIOSCSI driver. The next thing I will try is enabling IO Thread, but since the VM has only one disk, I don't think it will make a difference.
 
Last edited:
There are other threads with those error messages (https://forum.proxmox.com/threads/w...csi-warning-129-locking-up-data-drive.121936/, https://github.com/virtio-win/kvm-guest-drivers-windows/issues/756 and https://forum.proxmox.com/threads/what-is-kvm-desc-next-is-3-indicative-of.128739/), which suggest that the virtio-scsi driver is the problem.
Did anyone experience the corruption on a machine NOT using virtio-scsi? I will upgrade the virtio-drivers to 0.1.240 (https://fedorapeople.org/groups/vir...ownloads/archive-virtio/virtio-win-0.1.240-1/) and continue testing.
If that fails, I will try the virtio-block driver and after that changing io_uring etc.
 
just out of curiosity that this is not caused by locking / virtio-dataplane issue - could you (those who are affected) check latency of virtual machines during backup and report if you see any noticeable latency issues (i pingtime mich larger then average ping) ?

you can use mtr for getting min/max ping to a VM.

you can use any other tool which pings VM for a longer period of time and prints min/max/avg values
 
Last edited:
I just experienced the corruption on a VM with scsihw: virtio-scsi-single, aio=native,cache=directsync,discard=on,iothread=1 so it is probably not io_uring.
The ping time during backup was not very different from the normal operation, but I did only ping for a few minutes.
I now will try updating the virtio drivers and I changed the max-workers to 1 for the backup-job (https://forum.proxmox.com/threads/t...ad-behavior-during-backup.118430/#post-513106)
 
since I update the virtio drivers, I did not experience complete corruption, but I still get the error messages in the log about very slow IO and resets of the scsi device. I will report if switching to virtio-block changes anything.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!