PBS backup destroyed VMs.

liptech

Member
Jan 14, 2021
30
6
13
44
Brasil
Hello everybody,

Today I had a surprise when I arrived at work, some PBS backup routines were showing an error, and the worst part was that some machines in this routine were corrupted during the process, I had already seen this in faulty equipment, but this time it was a brand new server.

What caused the error was a faulty 10Gb transciver, exactly the one that connects this server to the switch dedicated to bakups, I have already changed the device and everything is going well now.

However, what brings me here is, why the virtual machine was damaged if the backup process is done via snapshot.

I'll attach the logs to see if they can help me find out what happened.
 

Attachments

  • 800005VM SYSLOG.txt
    101.6 KB · Views: 5
  • 800005VM.txt
    9.3 KB · Views: 6
The snapshot is stored on PBS, unless you configure fleecing storage to have it somewhere else. The probably means there is a risk of corruption with network or PBS issue during backup.

I wonder if that's eliminated/reduced if you set the fleecing to local?
 
  • Like
Reactions: liptech
Thanks for the heads up... this is good to know... as it probably means you should pause backups during any sort of network maintenance... normally I figure it will just fail and cleanup after itself new backup run...
 
Thanks for the heads up... this is good to know... as it probably means you should pause backups during any sort of network maintenance... normally I figure it will just fail and cleanup after itself new backup run...
I think you misunderstood, I wasn't doing maintenance in my environment, a connection failure occurred during the backup and the connection failure generated the problem, but what doesn't make sense is the file being backed up being corrupted.
Did I not make any necessary configuration?
Will it always be like this when a connection failure occurs?
 
I think you misunderstood, I wasn't doing maintenance in my environment, a connection failure occurred during the backup and the connection failure generated the problem, but what doesn't make sense is the file being backed up being corrupted.
Did I not make any necessary configuration?
Will it always be like this when a connection failure occurs?
I understood... a failure from the transceiver could be similar to a brief outage from maintenance, and a little bit more concerning because maintenance happens more frequently than a transceiver failing...

By default, the snapshot works through PBS I believe, so I am guessing loosing the connection would loose the changes during the snapshot.

Under that backup job, click advanced, and then under fleecing storage set that to local-lvm (or whatever you have named for local storage) and that might be more resilient so it keeps the delta for the snapshot local instead of using PBS. I don't know if that would be enough to keep from corrupting, but from what I understand how the backup works, it makes sense that it corrupted if you left that setting as the default and the connection to PBS was lost in the middle the backup.
 
@liptech how exactly does the corruption look like?

you have to understand how it is implemented under the hood, the backup target basically sits between the guest and the actual storage (fleecing changes that a bit, but not fundamentally). some kinds of interruptions/outages/errors can cause guest writes to not arrive at the actual target (think of it like unplugging the disks of a system while it is running). while this can cause some level of inconsistency/corruption depending on how exactly the VM and OS inside are configured, normally this should be similar to a cold crash (lose the last writes, maybe require some cleanup on next boot) and not fatal.
 
@liptech how exactly does the corruption look like?

you have to understand how it is implemented under the hood, the backup target basically sits between the guest and the actual storage (fleecing changes that a bit, but not fundamentally). some kinds of interruptions/outages/errors can cause guest writes to not arrive at the actual target (think of it like unplugging the disks of a system while it is running). while this can cause some level of inconsistency/corruption depending on how exactly the VM and OS inside are configured, normally this should be similar to a cold crash (lose the last writes, maybe require some cleanup on next boot) and not fatal.
@fabian

Hi Fabian,

In this case, the virtual machine crashed and when restarted, the boot failed, I carried out a check on the virtual disk, which did not come back as damaged, so I booted the VM with a CD to see if the disk was readable to be recovered, can the virtual disk it had no partitions.
As it was a critical machine, I decided to restore a valid BKP.
In my original post I attached the logs.
 
how is the VM and storage configured?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!