PBS backup destroyed VMs.

liptech · Jun 4, 2024

Hello everybody,

Today I had a surprise when I arrived at work, some PBS backup routines were showing an error, and the worst part was that some machines in this routine were corrupted during the process, I had already seen this in faulty equipment, but this time it was a brand new server.

What caused the error was a faulty 10Gb transciver, exactly the one that connects this server to the switch dedicated to bakups, I have already changed the device and everything is going well now.

However, what brings me here is, why the virtual machine was damaged if the backup process is done via snapshot.

I'll attach the logs to see if they can help me find out what happened.

jlauro · Jun 4, 2024

The snapshot is stored on PBS, unless you configure fleecing storage to have it somewhere else. The probably means there is a risk of corruption with network or PBS issue during backup.

I wonder if that's eliminated/reduced if you set the fleecing to local?

jlauro · Jun 4, 2024

Thanks for the heads up... this is good to know... as it probably means you should pause backups during any sort of network maintenance... normally I figure it will just fail and cleanup after itself new backup run...

liptech · Jun 5, 2024

jlauro said:
Thanks for the heads up... this is good to know... as it probably means you should pause backups during any sort of network maintenance... normally I figure it will just fail and cleanup after itself new backup run...

I think you misunderstood, I wasn't doing maintenance in my environment, a connection failure occurred during the backup and the connection failure generated the problem, but what doesn't make sense is the file being backed up being corrupted.
Did I not make any necessary configuration?
Will it always be like this when a connection failure occurs?

jlauro · Jun 5, 2024

liptech said:
I think you misunderstood, I wasn't doing maintenance in my environment, a connection failure occurred during the backup and the connection failure generated the problem, but what doesn't make sense is the file being backed up being corrupted.
Did I not make any necessary configuration?
Will it always be like this when a connection failure occurs?

I understood... a failure from the transceiver could be similar to a brief outage from maintenance, and a little bit more concerning because maintenance happens more frequently than a transceiver failing...

By default, the snapshot works through PBS I believe, so I am guessing loosing the connection would loose the changes during the snapshot.

Under that backup job, click advanced, and then under fleecing storage set that to local-lvm (or whatever you have named for local storage) and that might be more resilient so it keeps the delta for the snapshot local instead of using PBS. I don't know if that would be enough to keep from corrupting, but from what I understand how the backup works, it makes sense that it corrupted if you left that setting as the default and the connection to PBS was lost in the middle the backup.

fabian · Jun 5, 2024

@liptech how exactly does the corruption look like?

you have to understand how it is implemented under the hood, the backup target basically sits between the guest and the actual storage (fleecing changes that a bit, but not fundamentally). some kinds of interruptions/outages/errors can cause guest writes to not arrive at the actual target (think of it like unplugging the disks of a system while it is running). while this can cause some level of inconsistency/corruption depending on how exactly the VM and OS inside are configured, normally this should be similar to a cold crash (lose the last writes, maybe require some cleanup on next boot) and not fatal.

liptech · Jun 20, 2024

fabian said:
@liptech how exactly does the corruption look like?

you have to understand how it is implemented under the hood, the backup target basically sits between the guest and the actual storage (fleecing changes that a bit, but not fundamentally). some kinds of interruptions/outages/errors can cause guest writes to not arrive at the actual target (think of it like unplugging the disks of a system while it is running). while this can cause some level of inconsistency/corruption depending on how exactly the VM and OS inside are configured, normally this should be similar to a cold crash (lose the last writes, maybe require some cleanup on next boot) and not fatal.

@fabian

Hi Fabian,

In this case, the virtual machine crashed and when restarted, the boot failed, I carried out a check on the virtual disk, which did not come back as damaged, so I booted the VM with a CD to see if the disk was readable to be recovered, can the virtual disk it had no partitions.
As it was a critical machine, I decided to restore a valid BKP.
In my original post I attached the logs.

fabian · Jun 21, 2024

how is the VM and storage configured?

Search

Search

PBS backup destroyed VMs.

liptech

Member

Attachments

jlauro

Member

jlauro

Member

liptech

Member

jlauro

Member

fabian

Proxmox Staff Member

liptech

Member

fabian

Proxmox Staff Member