PBS killed a virtual server

sa10 · May 12, 2022

During the attempt to restore, the virtual server was destroyed because the backup copy contained an error.

Code:

Formatting '/MainPool/4Files/images/273/vm-273-disk-0.qcow2', fmt=qcow2 cluster_size=65536 preallocation=metadata compression_type=zlib size=128849018880 lazy_refcounts=off refcount_bits=16
new volume ID is 'MainPoolFiles:273/vm-273-disk-0.qcow2'
restore proxmox backup image: /usr/bin/pbs-restore --repository xxxxx@pbs@xxx.xxx.xxx.xxx:PBS1 vm/273/2022-05-11T22:00:02Z drive-scsi0.img.fidx /MainPool/4Files/images/273/vm-273-disk-0.qcow2 --verbose --format qcow2 --skip-zero
connecting to repository 'xxxxx@pbs@xxx.xxx.xxx.xxx:PBS1'
open block backend for target '/MainPool/4Files/images/273/vm-273-disk-0.qcow2'
starting to restore snapshot 'vm/273/2022-05-11T22:00:02Z'
download and verify backup index
restore failed: reading file "/mnt/datastore/PBS1/main/.chunks/7710/7710b15cb179064131b30556c0f068072297494cd6fda4dad8e401aeed09bced" failed: No such file or directory (os error 2)
temporary volume 'MainPoolFiles:273/vm-273-disk-0.qcow2' sucessfuly removed
TASK ERROR: command '/usr/bin/pbs-restore --repository xxxxx@pbs@xxx.xxx.xxx.xxx:PBS1 vm/273/2022-05-11T22:00:02Z drive-scsi0.img.fidx /MainPool/4Files/images/273/vm-273-disk-0.qcow2 --verbose --format qcow2 --skip-zero' failed: exit code 255

Backup Server 2.0-10

pveversion
pve-manager/6.2-11/22fb4983 (running kernel: 5.4.60-1-pve)

Datastore file system: ZFS

1. Why could the file disappear?
2. Why is the disk deleted before checking the verify of the archive?
3. Is there any way to restore data from corrupted backup?

mvs · May 12, 2022

1. Check that atime is enabled on your ZFS backup pool. atime is used to mark chunks that are still in use during garbage collection.
zfs get atime

Dunuin · May 12, 2022

Do you also regularily reverify your backups? If chunks are missing or got corrupted on the PBS then you would see this before starting the actual restore (in cases nothing corrupted later after the reverify task).

sa10 · May 12, 2022

mvs said:
1. Check that atime is enabled on your ZFS backup pool. atime is used to mark chunks that are still in use during garbage collection.
zfs get atime

NAME PROPERTY VALUE SOURCE
PBS1 atime on default

sa10 · May 12, 2022

Dunuin said:
Do you also regularily reverify your backups? If chunks are missing or got corrupted on the PBS then you would see this before starting the actual restore (in cases nothing corrupted later after the reverify task).

Each time after archiving, a check is performed, but this is a long process and the customer didn't pay attention to the status of the archive - Verify State = None

I see in the log
First, this is done: - Formatting '/MainPool/4Files/images/273/vm-273-disk-0.qcow2'
and then this is performed: - download and verify backup index

dcsapak · May 13, 2022

when restoring from backup into an existing vmid, the current vm is first deleted (this is even written in the confirmation popup). while the missing chunks are bad, an error can always happen on restore, regardless if its pbs or some backup file.
there is no way to check that aside from reading the backup completely before actually restoring, which would make the restore very slow.

what you can always do though is to restore to a *different* vmid, that way the current one will not be touched. after the restore was successful, you could remove the old vm

sa10 · May 16, 2022

dcsapak said:
when restoring from backup into an existing vmid, the current vm is first deleted (this is even written in the confirmation popup). while the missing chunks are bad, an error can always happen on restore, regardless if its pbs or some backup file.
there is no way to check that aside from reading the backup completely before actually restoring, which would make the restore very slow.

what you can always do though is to restore to a *different* vmid, that way the current one will not be touched. after the restore was successful, you could remove the old vm

I think that if there was an option to run a backup verify before recovery, then it was used almost always.
Without that option I will be forced to refuse to use the Proxmox-backup.
The risk of server loss during restoration is much worse than slow recovery.

About integration with the Proxmox VE:
1. It is impossible to run the verification from Proxmox VE.
2. Proxmox VE always shows the status of verification - None

dcsapak · May 16, 2022

sa10 said:
I think that if there was an option to run a backup verify before recovery, then it was used almost always.
Without that option I will be forced to refuse to use the Proxmox-backup.
The risk of server loss during restoration is much worse than slow recovery.

you can do that manually by logging onto the pbs and initiate a verify, normally this should be done on a schedule, so that you know your backups are impacted *before* you need to restore
also you can do a 'file-restore' which does not touch your existing vms, and you can restore to a different vmid first

sa10 · May 16, 2022

>> you can do that manually ...
I can, but my customer cannot. He can only press the button to restore

And for me it remained a mystery why that chunk disappeared.
I suspect a reason - incorrect work of the Garbage Collection.
Do you have any idea?

dcsapak · May 16, 2022

sa10 said:
>> you can do that manually ...
I can, but my customer cannot. He can only press the button to restore

ok but you can still create a verify schedule on pbs. that way you catch such things before you need the data

sa10 said:
And for me it remained a mystery why that chunk disappeared.
I suspect a reason - incorrect work of the Garbage Collection.
Do you have any idea?

not really. did you pbs crash / hard shutoff at one point? what is the underlying filesystem on pbs ?

sa10 · May 16, 2022

dcsapak said:
ok but you can still create a verify schedule on pbs. that way you catch such things before you need the data

Yes, I have a scheduled task verification after performing a backup and garbage collection, but the verification takes 3 days and VMs for a long time are not verified.

dcsapak said:
not really. did you pbs crash / hard shutoff at one point? what is the underlying filesystem on pbs ?

Sequence of tasks - backup (23h), garbage collection (37h), verification (3days)
ZFS raidz2 12 x Western Digital WD6003FRYZ 6TB
The space used for backup - about 20TB

sa10 · May 16, 2022

And how do you like this idea?
Rename the disk to the image_name.tmp and delete it after a successful recovery

dcsapak · May 17, 2022

i mean, in principle, you can open a feature request here: https://bugzilla.proxmox.com maybe my colleagues have a different opinion than me,
but i think this would make things only marginally better.

1. it would require double the space in contrast to now
2. the backup itself can contain corrupt data (you don't know until you do restore tests)
3. i think not all storage types allow for renaming a disk (think iscsi for example)
4. again, you can restore onto a different vmid which already does what you want, you get a full copy of your vm, can check if all is there, then delete the old vm

since errors can also be inside the backup (e.g. when the data was corrupted during backup already) ideally one would do regular restore tests anyway (and check the content of it!),
but this is a process that cannot really be automated without deeper knowledge about the content of the guests

Dunuin · May 17, 2022

I also got a question as last weeks someone also asked how to change the VMID of a guest.
Is there a CLI command to change the VMID of an existing guest without needing to do a backup+restore?

It would make it easier to temporarily restore a VM (with VMID 100) for example as VMID 10100 and then delete the old VM (the VMID 100) later if that worked, if you could then rename the new VM from VMID 10100 to 100.

dcsapak · May 17, 2022

Dunuin said:
Is there a CLI command to change the VMID of an existing guest without needing to do a backup+restore?

no there is no such command, but you could open a feature request, might be interesting (and probably not that hard to implement?): https://bugzilla.proxmox.com

Dunuin · May 17, 2022

dcsapak said:
no there is no such command, but you could open a feature request, might be interesting (and probably not that hard to implement?): https://bugzilla.proxmox.com

Created one: https://bugzilla.proxmox.com/show_bug.cgi?id=4061

sa10 · May 17, 2022

dcsapak said:
i mean, in principle, you can open a feature request here: https://bugzilla.proxmox.com maybe my colleagues have a different opinion than me,
but i think this would make things only marginally better.

1. it would require double the space in contrast to now

dcsapak said:
2. the backup itself can contain corrupt data (you don't know until you do restore tests)

Do you think this problems are more difficult than losing the server?
is it serious problem to find space for one image If you have 50 vm on the each nodes? This space you will need if you restore only one of VM.
There is corrupt data? But you have copy of previous disk.

dcsapak said:
3. i think not all storage types allow for renaming a disk (think iscsi for example)

The availability of this option is not difficult to make depending on the type of storage.

dcsapak said:
4. again, you can restore onto a different vmid which already does what you want, you get a full copy of your vm, can check if all is there, then delete the old vm

I can do this, but my customer (owner of the VM) cannot.
The customer does not suspect that he may lose his server.
Yes, he gets a warning that his disk will be overwritten, but he thinks that it will be overwritten by a backup copy, but there is hidden surprise - he doesn't have no one copy of data.

dcsapak said:
since errors can also be inside the backup (e.g. when the data was corrupted during backup already) ideally one would do regular restore tests anyway (and check the content of it!),
but this is a process that cannot really be automated without deeper knowledge about the content of the guests

sa10 · May 18, 2022

Is it possible to put an empty file instead of a lost chink, ignore the file reading error and restore at least part of the file system and try to extract part of the data?

fabian · May 19, 2022

it would be nice to keep the discussion either here or in the bug report you opened, I already answered that question there

https://bugzilla.proxmox.com/show_bug.cgi?id=4062

PBS killed a virtual server

Renowned Member

Member

Distinguished Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Renowned Member

Renowned Member

Proxmox Staff Member