Corrupted restored VM

Eric Delaet · Nov 4, 2020

Hello,

I have a Linux VM (Ubuntu 20.04) that gets backupped every night. The backup is taken incremental with mode "snapshot". qemu-guest-agent is installed (needed to freeze the fs?). This VM works fine, the backup seems to run always succesfully (marked "OK"), the verify runs succesfully, as does the restore (to a new VM id).

But when I try to power on the new VM, I see a lot of filesytem errors on the root file system, and I get segfaults because of corrupted binaries.

1) How is that possible ? Should the verify not notice if something is corrupted ?
2) How do I fix this ? Can I force proxmox to take a new full backup ? Or do I need to remove all the backups for this VM from the backup store ?

After further investigation I found such errors in the logs :

2020-11-03T15:04:24+01:00: download chunk "/mnt/pve/nfs-sata-storage/proxmox-backup-server/.chunks/dcb4/dcb42e7318545f81d449bb1665b60432c1073d210c17fe0c3acaa3c1af003abe"
2020-11-03T15:04:25+01:00: TASK ERROR: connection error: Transport endpoint is not connected (os error 107)

The file above, however, does exists. I'm not entirely sure what that errors means, but even if a read error would occur, then the final result of a restore or a verify should also fail, right?

Thanks!

dcsapak · Nov 5, 2020

what versions are your pve/backup server? does that occur on all backups or only on some? can you send a log from such a backup task?

Eric Delaet · Nov 5, 2020

Hello,

It doesn't seem to be only with this VM, I tried two other Linux machines with the same problem. I first rebooted the VM's with a seperate rescue CD, did a complete file system check, no errors. Restored a backup, did a file check with the same rescue CD, file system errors popped up. But I also restored a Windows VM, I did do a file system check there as well and it showed no errors.

My Proxmox version :

proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
proxmox-backup-client: 0.9.4-1

Proxmox Backup server : 0.9-4

When I go to the backup server, I see next to all the jobs "Verify state: OK".

I don't see errors in the backup logs, nor in the restore log when I restore a VM, I only saw an error in the dashboard in "Longest running tasks" (I think this is about the restore), at "Datastore read objects" that ended with TASK ERROR: connection error: Transport endpoint is not connected (os error 107). I restored a VM, I also include the output of the restore log (in Proxmox) and the log of the task viewer in Proxmox Backup Server (which spawns an error). I don't actually know where those logs are kept, but I grabbed the output from "Running tasks" - after the restore is done and then this log disappears...

Thanks!

dcsapak · Nov 5, 2020

do you have very old backups for those vms? or are those new vms? we changed some things with encrypted chunks a while back, maybe there are simply some chunks that are wrong (from before our changes) ?

Eric Delaet · Nov 5, 2020

Depends on what you mean with "very old" ... I started using the backup server since 22/09/2020.

Can I force a new full backup ? As far as I know, there is no option to do a new full backup every on demand or for example every week ? Or do I need to remove all the backups from the store ?

Thanks

dcsapak · Nov 5, 2020

Eric Delaet said:
Depends on what you mean with "very old" ... I started using the backup server since 22/09/2020.

yeah not 'old' but not from the last past days/weeks, we try to move fast in the beta to eliminate most bugs...

Eric Delaet said:
Can I force a new full backup ? As far as I know, there is no option to do a new full backup every on demand or for example every week ? Or do I need to remove all the backups from the store ?

yes you would have to remove all snapshots for that backup-group to start 'fresh'

Eric Delaet · Nov 5, 2020

Ok,

I tried to remove a snapshot via the backup server and I got the error :

proxmox-backup-client failed: Error: removing backup snapshot "/mnt/pve/nfs-sata-storage/proxmox-backup-server/vm/104/2020-11-05T04:00:02Z" failed - Directory not empty (os error 39) at /usr/share/perl5/PVE/API2/Storage/Content.pm line 350. (500)

In Proxmox VE the snapshot shows as "1 byte" now.

The directory /mnt/pve/nfs-sata-storage/proxmox-backup-server/vm/104/2020-11-05T04:00:02Z is empty, but I don't know if the underlying chunks are removed now...

What's the best way to clean this up ?

It would be a nice feature to be able to make a new full backup for example, every week. So that when there is a failure for some reason, at least the next week you start from a fresh new full backup.

dcsapak · Nov 5, 2020

Eric Delaet said:
What's the best way to clean this up ?

i would do that via the backup server web ui, that should work there... (no clue why it does not work on the pve side, you can post the output of 'ls -lhaR <DIR>' to debug that)

Eric Delaet said:
It would be a nice feature to be able to make a new full backup for example, every week. So that when there is a failure for some reason, at least the next week you start from a fresh new full backup.

that makes no sense really because of how the chunking works, but we already upload all chunks again if the last backup was not fully verified, the problem here is that
for encrypted chunks, the server can only verify the crc32 checksum, not the actual content and we changed how this gets written (but only during beta, if we had done this for a 'production release' we would have made it backwards compatible)

Eric Delaet · Nov 5, 2020

Via the backup server I also get the errors "proxmox-backup-client failed: Error: removing backup snapshot "/mnt/pve/nfs-sata-storage/proxmox-backup-server/vm/104/2020-11-05T12:42:42Z" failed - Directory not empty (os error 39) at /usr/share/perl5/PVE/API2/Storage/Content.pm line 350. (500)"

When I remove the snapshot (Are you sure you want to remove entry 'PBS:backup/vm/104/2020-11-05T12:42:42Z' This will permanently erase all data) I get this error.

After removing the snapshot however, the directory /mnt/pve/nfs-sata-storage/proxmox-backup-server/vm/104/2020-11-05T12:42:42Z is empty (but the dir still exists) and under /mnt/pve/nfs-sata-storage/proxmox-backup-server/vm/104/2020-11-05T12:42:42Z I see a file "owner".

I removed the empty directory and I don't see any backups anymore in the interface.

I made a new backup in the hope that I would be able to make a clean new backup, but in the backup log I still see :

INFO: backup was done incrementally, reused 12.71 GiB (63%)

And when I restore that backup, the machine is corrupted.

So it seems like, although I purged all the backups for that VM, there is still some data left that gets reused from somewhere...

dcsapak · Nov 5, 2020

Eric Delaet said:
Via the backup server I also get the errors "proxmox-backup-client failed: Error: removing backup snapshot "/mnt/pve/nfs-sata-storage/proxmox-backup-server/vm/104/2020-11-05T12:42:42Z" failed - Directory not empty (os error 39) at /usr/share/perl5/PVE/API2/Storage/Content.pm line 350. (500)"

When I remove the snapshot (Are you sure you want to remove entry 'PBS:backup/vm/104/2020-11-05T12:42:42Z' This will permanently erase all data) I get this error.

i did not mean the 'pbs storage' selection in the tree, but the actual web interface of the proxmox-backup-server ( https://<ip-of-pbs>:8007/ )

Eric Delaet said:
I made a new backup in the hope that I would be able to make a clean new backup, but in the backup log I still see :

INFO: backup was done incrementally, reused 12.71 GiB (63%)

And when I restore that backup, the machine is corrupted.

ah yes... to get rid of the actual chunks, you'd either have to
* delete all backups that reference the chunks, wait a day and do a garbage collect
* or (possibly easier), change the encryption key (caution, pve will not know that some backups are done with an older key and the restore would fail)

alternatively, you could add a new datastore and make backup there (just to test)

Eric Delaet · Nov 6, 2020

Waiting a day didn't help, even after a garbage collect.

I wanted to test it (just if backups can be made and restored correctly) with a new datastore, but even when all the backups are gone for that machine, when I make a new datastore and backup there, the backup is still done incrementally.

I think it will be best to wipe the whole datastore and start over in my case.

dcsapak · Nov 6, 2020

Eric Delaet said:
when I make a new datastore and backup there, the backup is still done incrementally.

no that cannot happen... deduplication is per datastore, never across datastores.... there has to be something else that was wrong (wrong parameter? pve storage not changed to the new datastore?)

Eric Delaet · Nov 6, 2020

No, i am backupping on the right store.

On the proxmox backup server:

add datastorage, I added a new store PBS-TEST, with empty path.

On Proxmox VE Cluster level :

Edit the old PBS Store, uncheck "enabled"
Add store, Proxmox Backup Store, id TEST, datastore PBS-TEST, enable

On the Virtual Machine level :

Backup, make sure all backups are "deleted".
Backup now, storage: TEST

I see in the log that the command INFO: starting new backup job: vzdump 104 --node pm03 --remove 0 --storage TEST --mode snapshot is executed.

After the backup I see that the newly created, empty dir on the PBS server is being populated with new data.

But I still see :

INFO: 100% (20.0 GiB of 20.0 GiB) in 1m 16s, read: 2.7 GiB/s, write: 28.0 MiB/s
INFO: backup is sparse: 12.53 GiB (62%) total zero data
INFO: backup was done incrementally, reused 12.71 GiB (63%)
INFO: transferred 20.00 GiB in 76 seconds (269.5 MiB/s)
INFO: Finished Backup of VM 104 (00:01:16)

dcsapak · Nov 6, 2020

Eric Delaet said:
add datastorage, I added a new store PBS-TEST, with empty path.

what do you mean with 'empty path' ? when you create a datastore, you have to provide a path for it...

dcsapak · Nov 6, 2020

Eric Delaet said:
INFO: 100% (20.0 GiB of 20.0 GiB) in 1m 16s, read: 2.7 GiB/s, write: 28.0 MiB/s
INFO: backup is sparse: 12.53 GiB (62%) total zero data
INFO: backup was done incrementally, reused 12.71 GiB (63%)

that line is also printed when chunks are reused inside a single backup
(e.g. backing up a completely empty disk with only 0 bytes will print ~99% reused since there is only one chunk that is deduplicated)

Eric Delaet · Nov 6, 2020

Sorry, I meant, for the new store, I provide a new path, which is an empty directory (the path exists but there is no data in it).

If it really is the case that a new, full backup is taken on a new store, then unfortunately it doesn't help the original problem. I restored the machine from the new store but it is still corrupted.

I include log files of the backup (backup.txt), the restore (restore), and the readlog (read.txt) of the proxmox backup server (i already found out it is logged in user.log).

There are 2 interesting lines :

Nov 6 14:50:41 pm04 proxmox-backup-proxy[1318908]: protocol upgrade done

and the error that I saw when using the original datastore as well :

Nov 6 14:51:42 pm04 proxmox-backup-proxy[1318908]: TASK ERROR: connection error: Transport endpoint is not connected (os error 107)

Search

Search

Corrupted restored VM

Eric Delaet

Member

dcsapak

Proxmox Staff Member

Eric Delaet

Member

Attachments

dcsapak

Proxmox Staff Member

Eric Delaet

Member

dcsapak

Proxmox Staff Member

Eric Delaet

Member

dcsapak

Proxmox Staff Member

Eric Delaet

Member

Attachments

dcsapak

Proxmox Staff Member

Eric Delaet

Member

dcsapak

Proxmox Staff Member

Eric Delaet

Member

dcsapak

Proxmox Staff Member

dcsapak

Proxmox Staff Member

Eric Delaet

Member

Attachments