PBS killed a virtual server

sa10

Renowned Member
Feb 6, 2009
46
0
71
Canada
During the attempt to restore, the virtual server was destroyed because the backup copy contained an error.

Code:
Formatting '/MainPool/4Files/images/273/vm-273-disk-0.qcow2', fmt=qcow2 cluster_size=65536 preallocation=metadata compression_type=zlib size=128849018880 lazy_refcounts=off refcount_bits=16
new volume ID is 'MainPoolFiles:273/vm-273-disk-0.qcow2'
restore proxmox backup image: /usr/bin/pbs-restore --repository xxxxx@pbs@xxx.xxx.xxx.xxx:PBS1 vm/273/2022-05-11T22:00:02Z drive-scsi0.img.fidx /MainPool/4Files/images/273/vm-273-disk-0.qcow2 --verbose --format qcow2 --skip-zero
connecting to repository 'xxxxx@pbs@xxx.xxx.xxx.xxx:PBS1'
open block backend for target '/MainPool/4Files/images/273/vm-273-disk-0.qcow2'
starting to restore snapshot 'vm/273/2022-05-11T22:00:02Z'
download and verify backup index
restore failed: reading file "/mnt/datastore/PBS1/main/.chunks/7710/7710b15cb179064131b30556c0f068072297494cd6fda4dad8e401aeed09bced" failed: No such file or directory (os error 2)
temporary volume 'MainPoolFiles:273/vm-273-disk-0.qcow2' sucessfuly removed
TASK ERROR: command '/usr/bin/pbs-restore --repository xxxxx@pbs@xxx.xxx.xxx.xxx:PBS1 vm/273/2022-05-11T22:00:02Z drive-scsi0.img.fidx /MainPool/4Files/images/273/vm-273-disk-0.qcow2 --verbose --format qcow2 --skip-zero' failed: exit code 255

Backup Server 2.0-10

pveversion
pve-manager/6.2-11/22fb4983 (running kernel: 5.4.60-1-pve)

Datastore file system: ZFS

1. Why could the file disappear?
2. Why is the disk deleted before checking the verify of the archive?
3. Is there any way to restore data from corrupted backup?
 
1. Check that atime is enabled on your ZFS backup pool. atime is used to mark chunks that are still in use during garbage collection.
zfs get atime
 
Do you also regularily reverify your backups? If chunks are missing or got corrupted on the PBS then you would see this before starting the actual restore (in cases nothing corrupted later after the reverify task).
 
1. Check that atime is enabled on your ZFS backup pool. atime is used to mark chunks that are still in use during garbage collection.
zfs get atime
NAME PROPERTY VALUE SOURCE
PBS1 atime on default
 
Do you also regularily reverify your backups? If chunks are missing or got corrupted on the PBS then you would see this before starting the actual restore (in cases nothing corrupted later after the reverify task).

Each time after archiving, a check is performed, but this is a long process and the customer didn't pay attention to the status of the archive - Verify State = None

I see in the log
First, this is done: - Formatting '/MainPool/4Files/images/273/vm-273-disk-0.qcow2'
and then this is performed: - download and verify backup index
 
when restoring from backup into an existing vmid, the current vm is first deleted (this is even written in the confirmation popup). while the missing chunks are bad, an error can always happen on restore, regardless if its pbs or some backup file.
there is no way to check that aside from reading the backup completely before actually restoring, which would make the restore very slow.

what you can always do though is to restore to a *different* vmid, that way the current one will not be touched. after the restore was successful, you could remove the old vm
 
  • Like
Reactions: alsicorp
when restoring from backup into an existing vmid, the current vm is first deleted (this is even written in the confirmation popup). while the missing chunks are bad, an error can always happen on restore, regardless if its pbs or some backup file.
there is no way to check that aside from reading the backup completely before actually restoring, which would make the restore very slow.

what you can always do though is to restore to a *different* vmid, that way the current one will not be touched. after the restore was successful, you could remove the old vm
I think that if there was an option to run a backup verify before recovery, then it was used almost always.
Without that option I will be forced to refuse to use the Proxmox-backup.
The risk of server loss during restoration is much worse than slow recovery.


About integration with the Proxmox VE:
1. It is impossible to run the verification from Proxmox VE.
2. Proxmox VE always shows the status of verification - None
 
I think that if there was an option to run a backup verify before recovery, then it was used almost always.
Without that option I will be forced to refuse to use the Proxmox-backup.
The risk of server loss during restoration is much worse than slow recovery.
you can do that manually by logging onto the pbs and initiate a verify, normally this should be done on a schedule, so that you know your backups are impacted *before* you need to restore
also you can do a 'file-restore' which does not touch your existing vms, and you can restore to a different vmid first
 
>> you can do that manually ...
I can, but my customer cannot. He can only press the button to restore :(

And for me it remained a mystery why that chunk disappeared.
I suspect a reason - incorrect work of the Garbage Collection.
Do you have any idea?
 
>> you can do that manually ...
I can, but my customer cannot. He can only press the button to restore :(
ok but you can still create a verify schedule on pbs. that way you catch such things before you need the data

And for me it remained a mystery why that chunk disappeared.
I suspect a reason - incorrect work of the Garbage Collection.
Do you have any idea?
not really. did you pbs crash / hard shutoff at one point? what is the underlying filesystem on pbs ?
 
ok but you can still create a verify schedule on pbs. that way you catch such things before you need the data
Yes, I have a scheduled task verification after performing a backup and garbage collection, but the verification takes 3 days and VMs for a long time are not verified.
not really. did you pbs crash / hard shutoff at one point? what is the underlying filesystem on pbs ?
Sequence of tasks - backup (23h), garbage collection (37h), verification (3days)
ZFS raidz2 12 x Western Digital WD6003FRYZ 6TB
The space used for backup - about 20TB
 
Last edited:
And how do you like this idea?
Rename the disk to the image_name.tmp and delete it after a successful recovery
 
i mean, in principle, you can open a feature request here: https://bugzilla.proxmox.com maybe my colleagues have a different opinion than me,
but i think this would make things only marginally better.

1. it would require double the space in contrast to now
2. the backup itself can contain corrupt data (you don't know until you do restore tests)
3. i think not all storage types allow for renaming a disk (think iscsi for example)
4. again, you can restore onto a different vmid which already does what you want, you get a full copy of your vm, can check if all is there, then delete the old vm

since errors can also be inside the backup (e.g. when the data was corrupted during backup already) ideally one would do regular restore tests anyway (and check the content of it!),
but this is a process that cannot really be automated without deeper knowledge about the content of the guests
 
I also got a question as last weeks someone also asked how to change the VMID of a guest.
Is there a CLI command to change the VMID of an existing guest without needing to do a backup+restore?

It would make it easier to temporarily restore a VM (with VMID 100) for example as VMID 10100 and then delete the old VM (the VMID 100) later if that worked, if you could then rename the new VM from VMID 10100 to 100.
 
Is there a CLI command to change the VMID of an existing guest without needing to do a backup+restore?
no there is no such command, but you could open a feature request, might be interesting (and probably not that hard to implement?): https://bugzilla.proxmox.com
 
  • Like
Reactions: Dunuin
i mean, in principle, you can open a feature request here: https://bugzilla.proxmox.com maybe my colleagues have a different opinion than me,
but i think this would make things only marginally better.

1. it would require double the space in contrast to now
2. the backup itself can contain corrupt data (you don't know until you do restore tests)
Do you think this problems are more difficult than losing the server?
is it serious problem to find space for one image If you have 50 vm on the each nodes? This space you will need if you restore only one of VM.
There is corrupt data? But you have copy of previous disk.

3. i think not all storage types allow for renaming a disk (think iscsi for example)
The availability of this option is not difficult to make depending on the type of storage.
4. again, you can restore onto a different vmid which already does what you want, you get a full copy of your vm, can check if all is there, then delete the old vm
I can do this, but my customer (owner of the VM) cannot.
The customer does not suspect that he may lose his server.
Yes, he gets a warning that his disk will be overwritten, but he thinks that it will be overwritten by a backup copy, but there is hidden surprise - he doesn't have no one copy of data.
since errors can also be inside the backup (e.g. when the data was corrupted during backup already) ideally one would do regular restore tests anyway (and check the content of it!),
but this is a process that cannot really be automated without deeper knowledge about the content of the guests
 
Is it possible to put an empty file instead of a lost chink, ignore the file reading error and restore at least part of the file system and try to extract part of the data?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!