Veeam Silent Data Corruption

bbgeek17

Distinguished Member
Nov 20, 2020
4,627
1,350
198
Blockbridge
www.blockbridge.com
Hello Everyone,

Our development team has been super busy qualifying Veeam integration for our Proxmox customers.

Using internal tools designed to test the integrity of snapshots, we're seeing that Veeam 12.2 backups of live virtual machines contain silent data corruption and that the backups are not "point in time" consistent.

We can reliably reproduce the issue with LVM, ZFS, and other backing storage (including Blockbridge). The issue does not reproduce on PBS.

Is anyone else seeing this? Is anyone else out there testing their backups? @Pavel Tide is this a known issue?


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
A brief update on this:

Here is an easy way to test your backup strategy as well as reproduce the issue with an open-source tool: dm-integrity (https://docs.kernel.org/admin-guide/device-mapper/dm-integrity.html).

The dm-integrity kernel module emulates a block device that stores and validates data integrity in line with access (with a performance penalty). Essentially, dm-integrity keeps a separate tag/hash generated by performing math on the contents of a sector. The tag ensures that the block device has referential integrity. If the tag doesn't match the data, corruption is present.

You can reproduce the Veeam data corruption with any live VM running on any Proxmox storage type:
  1. create a VM (any Linux distro will do)
  2. attach an additional disk for testing
  3. initialize and activate the disk with dm-integrity
  4. format the resulting device with a file system
  5. kick off some filesystem traffic (ie., a bash loop, bonnie++, etc.)
  6. execute a backup while the filesystem traffic is active.
When you restore the virtual machine to any type of storage, you can initialize the dm-integrity device and try to mount the filesystem. If you're lucky, it will mount successfully. However, if the filesystem was heavily used during the backup, mount is likely fail. In either case, you'll notice integrity issues reported in dmesg. To uncover all inconsistencies, you can run a dd command to copy the device to /dev/null. In our tests, we've encountered tens of thousands of inconsistencies.

Given the degree and pervasiveness of the corruption, this feels like an architectural issue in how Veeam has integrated with Proxmox/QEMU.

@Pavel Tide this seems pretty severe; what's the best way to get some attention on this? We are happy to assist if needed.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,

just FYI, Pavel did respond on our devel list to the question where to report bugs to with the following:

Hi Dominik,

For now the best course of action would be to post a message on our forums (forums.veeam.com) in this subsection:

https://forums.veeam.com/kvm-rhv-olvm-proxmox-f62/

I am in the process of arranging some external bug-tracker (we don’t have one right now). If you have any preferences please let me know.

so maybe in their forum would be best?

EDIT: see here: https://lore.proxmox.com/pve-devel/mailman.50.1727091601.332.pve-devel@lists.proxmox.com/
 
  • Like
Reactions: carles89
Hi @dcsapak thanks! I'll start to drive this through a few different channels. Someone, somewhere, cares that backups are broken. I know we have several existing Veeam customers that do.

@news We deal mainly with service providers and enterprise customers with mission-critical infrastructure. Many of these people simultaneously run multiple platforms (i.e., PVE, ESX, Hyper-V, bare metal, k8s, etc). In these environments, people have legacy Veeam infrastructure. After all, it is the goto solution for ESX.

We think PBS is the best solution for PVE. However, reworking someone's backup infrastructure and recovery workflows takes time, money, and analysis. It also raises lots of questions relative to the status quo. A common question: "What do I do with my existing ESX Veeam backups while I'm transitioning to PVE?".

The Veeam integration for Proxmox is incomplete. However, it's our job to support our customers, advise them on migration paths that fit with their business, and test everything for correctness and stability (to the best of our ability).

We've already developed and tested workarounds for Veeam's PVE integration to enable our customers. We're holding back public release until someone addresses the corruption in the backup flow. We suspect that restore of non-PVE backed VMs to PVE is functional, providing a strategy to restore from an existing Veeam backup into Proxmox.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
And I assume that the guest agent was installed in all cases?
Yes, the test VMs are standard Linux cloud images (i.e., Alma Linux 9) with QEMU agents installed and enabled by default. The agent was also enabled in the PVE config.

That said, the agent synchronization is not really of concern and is unrelated to the corruption. Our testing focuses on validating that the backups are crash-consistent (i.e., point in time) and have data integrity (at a block level). This is ultimately determined by how Veeam integrates with QEMU. We do not believe this is a QEMU issue.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Kingneutron
We have experienced what I assume is this issue on a Proxmox Linux VM hosting Minecraft. Essentially, it seems Veeam is not grabbing data at a specific point in time, as it does in VMware (w/ a snapshot). The result is corruption of any chunks being written to at the time the backup was taken.

In testing the same scenario in VMware w/ Veeam, this does not occur.
 

Attachments

  • Bad Chunks Screenshot 2.png
    Bad Chunks Screenshot 2.png
    971.3 KB · Views: 44
Last edited:
  • Like
Reactions: mandrews
We have experienced what I assume is this issue on a Proxmox Linux VM hosting Minecraft. Essentially, it seems Veeam is not grabbing data at a specific point in time, as it does in VMware (w/ a snapshot). The result is corruption of any chunks being written to at the time the backup was taken.
Hi @DLEM , this is exactly what we saw in our testing. Thanks for sharing your experience.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
with backups that are testing bad, what happens if you try and do a guest file restore? can veeam mount the backup at all?
 
Hi @NatO ,
Unfortunately, I don't have an answer to the question. Once the folks at Veeam diagnose and reproduce the issue described above, we'll have a clearer picture.

If corruption occurs in the backup flow, then the source data for restoration has questionable integrity. They might restore valid contents, invalid contents, or none at all. TBH, I am not familiar enough with Veeam's features and functions to know how their file level restore works. Maybe @Pavel Tide can shed some light on this.

For reference, here's everything we've learned in order:
  1. We found that a live backup of a Linux VM showed signs of corruption while updating our storage plugin to support Veeam.
  2. To eliminate variables and learn about the pattern of corruption, we ran our "snapshot_consistency" developer tool during a live vm backup cycle. Our tool detected misdirected writes and dropped writes in the restored image. This led to our conclusion that the backup was not point-in-time.
  3. We then reproduced the behavior on LVM and ZFS to eliminate our storage as a variable.
  4. Finally, we used dm-integrity to provide a reproducible test to assist folks in debugging the issues and double-checking our findings.
We've been having an off-forum conversation trying to assist. Give them some time to troubleshoot and wait for guidance.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Thanks for your reply @bbgeek17

I've done some very simple testing and realised that the disks within a VM are not snapped at the same time. So when you get the Veeam backup one disk will have more recent files than the other. Clearly snapshotting the disks sequentially as it backps them up rather than all at once and then backing up (like it does with VMware/HyperV)

my test was simple, create a VM with 2 disks.

create the required folders for the test files and then run a powershell script something like this. it writes one file every second.

Code:
$path1 = "d:\z_snaptest"
$path2 = "c:\z_snaptest"
$path3 = "d:\a_snaptest"
$path4 = "c:\a_snaptest"
 
for ($i = 100; $i -le 459; $i++) {


    $currentTime = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
    $fileName1 = "$path1\time_$i.txt"
    $fileName2 = "$path2\time_$i.txt"
    $fileName3 = "$path3\time_$i.txt"
    $fileName4 = "$path4\time_$i.txt"
    $currentTime | Out-File -FilePath $fileName1
    $currentTime | Out-File -FilePath $fileName2
    $currentTime | Out-File -FilePath $fileName3
    $currentTime | Out-File -FilePath $fileName4


    Start-Sleep -Seconds 1
}

I know Veeam doesn't support application aware processing so if you were running anyhing like Exchange or SQL you'd want to be using the Veeam agent.

I'm just surprised by this. I'm sure there's a reason for it, but I've checked the release notes and help guides again and can't find any mention of it. Just not something I expected coming from VMware.

this happens if the VM is on LVM (storage over FC) or lvm-thin on the local host, which supports snapshots but from what I saw doesn't use them
 
which supports snapshots but from what I saw doesn't use them
unfortunately qemu/kvm snapshots and disk snapshots both are named snapshots, yet have nothing to do with each other. On a backup job (at least with the Proxmox backup routine (PVE internal or PBS), a QEMU snapshot is created that ensures that the disks are snapshotted in memory, so that you'll read consistent data and will be discarded after the backup is finished.

Just not something I expected coming from VMware.
Do you mean Veeam?

if you were running anyhing like Exchange or SQL you'd want to be using the Veeam agent.
There is already the QEMU agent that is able to do application specific stuff, if you added it there. It is already been called by the qemu backup routing.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!