Veeam Silent Data Corruption

bbgeek17 · Sep 30, 2024

Hello Everyone,

Our development team has been super busy qualifying Veeam integration for our Proxmox customers.

Using internal tools designed to test the integrity of snapshots, we're seeing that Veeam 12.2 backups of live virtual machines contain silent data corruption and that the backups are not "point in time" consistent.

We can reliably reproduce the issue with LVM, ZFS, and other backing storage (including Blockbridge). The issue does not reproduce on PBS.

Is anyone else seeing this? Is anyone else out there testing their backups? @Pavel Tide is this a known issue?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Oct 1, 2024

A brief update on this:

Here is an easy way to test your backup strategy as well as reproduce the issue with an open-source tool: dm-integrity (https://docs.kernel.org/admin-guide/device-mapper/dm-integrity.html).

The dm-integrity kernel module emulates a block device that stores and validates data integrity in line with access (with a performance penalty). Essentially, dm-integrity keeps a separate tag/hash generated by performing math on the contents of a sector. The tag ensures that the block device has referential integrity. If the tag doesn't match the data, corruption is present.

You can reproduce the Veeam data corruption with any live VM running on any Proxmox storage type:

create a VM (any Linux distro will do)
attach an additional disk for testing
initialize and activate the disk with dm-integrity
format the resulting device with a file system
kick off some filesystem traffic (ie., a bash loop, bonnie++, etc.)
execute a backup while the filesystem traffic is active.

When you restore the virtual machine to any type of storage, you can initialize the dm-integrity device and try to mount the filesystem. If you're lucky, it will mount successfully. However, if the filesystem was heavily used during the backup, mount is likely fail. In either case, you'll notice integrity issues reported in dmesg. To uncover all inconsistencies, you can run a dd command to copy the device to /dev/null. In our tests, we've encountered tens of thousands of inconsistencies.

Given the degree and pervasiveness of the corruption, this feels like an architectural issue in how Veeam has integrated with Proxmox/QEMU.

@Pavel Tide this seems pretty severe; what's the best way to get some attention on this? We are happy to assist if needed.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dcsapak · Oct 2, 2024

Hi,

just FYI, Pavel did respond on our devel list to the question where to report bugs to with the following:

Hi Dominik,

For now the best course of action would be to post a message on our forums (forums.veeam.com) in this subsection:

https://forums.veeam.com/kvm-rhv-olvm-proxmox-f62/

I am in the process of arranging some external bug-tracker (we don’t have one right now). If you have any preferences please let me know.

so maybe in their forum would be best?

EDIT: see here: https://lore.proxmox.com/pve-devel/mailman.50.1727091601.332.pve-devel@lists.proxmox.com/

bbgeek17 · Oct 2, 2024

Hi @dcsapak thanks! I'll start to drive this through a few different channels. Someone, somewhere, cares that backups are broken. I know we have several existing Veeam customers that do.

@news We deal mainly with service providers and enterprise customers with mission-critical infrastructure. Many of these people simultaneously run multiple platforms (i.e., PVE, ESX, Hyper-V, bare metal, k8s, etc). In these environments, people have legacy Veeam infrastructure. After all, it is the goto solution for ESX.

We think PBS is the best solution for PVE. However, reworking someone's backup infrastructure and recovery workflows takes time, money, and analysis. It also raises lots of questions relative to the status quo. A common question: "What do I do with my existing ESX Veeam backups while I'm transitioning to PVE?".

The Veeam integration for Proxmox is incomplete. However, it's our job to support our customers, advise them on migration paths that fit with their business, and test everything for correctness and stability (to the best of our ability).

We've already developed and tested workarounds for Veeam's PVE integration to enable our customers. We're holding back public release until someone addresses the corruption in the backup flow. We suspect that restore of non-PVE backed VMs to PVE is functional, providing a strategy to restore from an existing Veeam backup into Proxmox.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pavel Tide · Oct 2, 2024

Hi,

RnD is looking into this issue, stay tuned (might be posting some updates soon).

Thanks!

bbgeek17 · Oct 2, 2024

Thanks for getting back to us!

I believe our team has identified the source of the problem. Get in touch with me if you want specifics. Also, let me know if you want to test a fix in our lab. We're always happy to assist in finding solutions.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pavel Tide · Oct 2, 2024

@bbgeek17 ,

Thanks, much appreciated!

Just one question - when you used PBS, which backup mode did you use?

stop mode, suspend mode, or snapshot mode?

bbgeek17 · Oct 2, 2024

Pavel Tide said:
Just one question - when you used PBS, which backup mode did you use?

stop mode, suspend mode, or snapshot mode?

We always use "snapshot" mode, as the goal is to match production live systems

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pavel Tide · Oct 2, 2024

And I assume that the guest agent was installed in all cases?

bbgeek17 · Oct 2, 2024

Pavel Tide said:
And I assume that the guest agent was installed in all cases?

Yes, the test VMs are standard Linux cloud images (i.e., Alma Linux 9) with QEMU agents installed and enabled by default. The agent was also enabled in the PVE config.

That said, the agent synchronization is not really of concern and is unrelated to the corruption. Our testing focuses on validating that the backups are crash-consistent (i.e., point in time) and have data integrity (at a block level). This is ultimately determined by how Veeam integrates with QEMU. We do not believe this is a QEMU issue.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

DLEM · Oct 3, 2024

We have experienced what I assume is this issue on a Proxmox Linux VM hosting Minecraft. Essentially, it seems Veeam is not grabbing data at a specific point in time, as it does in VMware (w/ a snapshot). The result is corruption of any chunks being written to at the time the backup was taken.

In testing the same scenario in VMware w/ Veeam, this does not occur.

Pavel Tide · Oct 3, 2024

@DLEM

In testing the same scenario in VMware w/ Veeam, this does not occur.

That is, you ran a VMware backup job that had VMware Tools quiescence disabled in the job options, and the resulting VM backup did not have any of the described issues. Is that correct?

P.S. I would appreciate if you could upload another picture with a higher detalisation : ) this one os blurry

DLEM · Oct 3, 2024

Pavel Tide said:
@DLEM

That is, you ran a VMware backup job that had VMware Tools quiescence disabled in the job options, and the resulting VM backup did not have any of the described issues. Is that correct?

P.S. I would appreciate if you could upload another picture with a higher detalisation : ) this one os blurry

Apologies! I uploaded a better screenshot.
I shall double check on the VMware Tools quiescence piece.

Pavel Tide · Oct 3, 2024

Thanks for the screenshot!

Please also make sure to check whether guest VM tools are installed on the VMware machine.

bbgeek17 · Oct 3, 2024

DLEM said:
We have experienced what I assume is this issue on a Proxmox Linux VM hosting Minecraft. Essentially, it seems Veeam is not grabbing data at a specific point in time, as it does in VMware (w/ a snapshot). The result is corruption of any chunks being written to at the time the backup was taken.

Hi @DLEM , this is exactly what we saw in our testing. Thanks for sharing your experience.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

NatO · Oct 4, 2024

with backups that are testing bad, what happens if you try and do a guest file restore? can veeam mount the backup at all?

bbgeek17 · Oct 4, 2024

Hi @NatO ,
Unfortunately, I don't have an answer to the question. Once the folks at Veeam diagnose and reproduce the issue described above, we'll have a clearer picture.

If corruption occurs in the backup flow, then the source data for restoration has questionable integrity. They might restore valid contents, invalid contents, or none at all. TBH, I am not familiar enough with Veeam's features and functions to know how their file level restore works. Maybe @Pavel Tide can shed some light on this.

For reference, here's everything we've learned in order:

We found that a live backup of a Linux VM showed signs of corruption while updating our storage plugin to support Veeam.
To eliminate variables and learn about the pattern of corruption, we ran our "snapshot_consistency" developer tool during a live vm backup cycle. Our tool detected misdirected writes and dropped writes in the restored image. This led to our conclusion that the backup was not point-in-time.
We then reproduced the behavior on LVM and ZFS to eliminate our storage as a variable.
Finally, we used dm-integrity to provide a reproducible test to assist folks in debugging the issues and double-checking our findings.

We've been having an off-forum conversation trying to assist. Give them some time to troubleshoot and wait for guidance.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

NatO · Oct 11, 2024

Thanks for your reply @bbgeek17

I've done some very simple testing and realised that the disks within a VM are not snapped at the same time. So when you get the Veeam backup one disk will have more recent files than the other. Clearly snapshotting the disks sequentially as it backps them up rather than all at once and then backing up (like it does with VMware/HyperV)

my test was simple, create a VM with 2 disks.

create the required folders for the test files and then run a powershell script something like this. it writes one file every second.

Code:

$path1 = "d:\z_snaptest"
$path2 = "c:\z_snaptest"
$path3 = "d:\a_snaptest"
$path4 = "c:\a_snaptest"
 
for ($i = 100; $i -le 459; $i++) {


    $currentTime = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
    $fileName1 = "$path1\time_$i.txt"
    $fileName2 = "$path2\time_$i.txt"
    $fileName3 = "$path3\time_$i.txt"
    $fileName4 = "$path4\time_$i.txt"
    $currentTime | Out-File -FilePath $fileName1
    $currentTime | Out-File -FilePath $fileName2
    $currentTime | Out-File -FilePath $fileName3
    $currentTime | Out-File -FilePath $fileName4


    Start-Sleep -Seconds 1
}

I know Veeam doesn't support application aware processing so if you were running anyhing like Exchange or SQL you'd want to be using the Veeam agent.

I'm just surprised by this. I'm sure there's a reason for it, but I've checked the release notes and help guides again and can't find any mention of it. Just not something I expected coming from VMware.

this happens if the VM is on LVM (storage over FC) or lvm-thin on the local host, which supports snapshots but from what I saw doesn't use them

LnxBil · Oct 13, 2024

NatO said:
which supports snapshots but from what I saw doesn't use them

unfortunately qemu/kvm snapshots and disk snapshots both are named snapshots, yet have nothing to do with each other. On a backup job (at least with the Proxmox backup routine (PVE internal or PBS), a QEMU snapshot is created that ensures that the disks are snapshotted in memory, so that you'll read consistent data and will be discarded after the backup is finished.

NatO said:
Just not something I expected coming from VMware.

Do you mean Veeam?

NatO said:
if you were running anyhing like Exchange or SQL you'd want to be using the Veeam agent.

There is already the QEMU agent that is able to do application specific stuff, if you added it there. It is already been called by the qemu backup routing.

NatO · Oct 13, 2024

LnxBil said:
Do you mean Veeam?

A Veeam user coming from VMware to pve and still using Veeam for backups

Veeam Silent Data Corruption

Distinguished Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Attachments

New Member

New Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

Distinguished Member

New Member

We value your privacy