Proxmox Backups causing data loss with regards to partitions

neuron

Active Member
Mar 15, 2019
25
3
43
43
For more times than I can count now, across about a half dozen VMs over the course of several months, I have experienced partition table losses due to Proxmox backups. I have to boot the VM with a partition recovery tool (easeUS), do a scan, it always finds the partitions, thankfully, then recover them. Then I have to do a whole thing of repairing the Windows install to get it to boot correctly again (the drive letters and active partitions get messed up).

I am running the latest everything. PVE, PBS, everywhere.

Is anyone else experiencing this terrible issue? Why is a backup process actually causing VM partition tables to go missing?
 
Thanks @fabian for making me aware of this bugzilla thread. Glad I'm not the only one but also not too thrilled that this has been a non-reproduceable bug for 2+ years now. I'll add my own experiences here:

- I have been replacing old on-prem servers (we are an MSP) with Micro ITX rackmounts - very simple stuff - they all have NVMEs and maybe a larger SATA SSD for storage.
- Since it's usually a single disk, we're doing the installation defaults of where it automatically sets up LVM-Thin volumes
- They are brand new, fresh installs of Proxmox 7.2
- We always create a new management VM, usually Windows 10 or Windows Server
- We do a p2v migration of the old physical server using Veeam, into a new VM created on PVE
- We have a PBS in our colo, about 50TB storage in a ZFS raidz2. We set up a new datastore for each client, on the same ZFS volume, but dumping images in a different path
- Almost always, without fail, we do an initial backup of a newly created VM, and then the partition table gets hosed. I'm just today learning that it's a 512 byte MBR record. Curious how others are repairing this at all? I've been using the easeUS partition recovery tool and while it always works, thankfully, it never makes the active partition or drive letters correct for NTFS, so I have to boot again with a windows image, to use the CLi to do all the stuff like diskpart assign letter=c and bootrec switches like /fixmbr, /scanos ,/rebuildbcd

Now I'm learning that previous backups might be bad,, with the missing 512 byte MBR baked in and it's only noticed when the VM is rebooted? I've only experienced immediate issues where after a backup job, the server is "down" and we go and check it, sure enough it's in a boot loop with "missing hard disk". We always reboot servers within 30 days and have never had a manual reboot come up with a missing disk issue.

I'd like to contribute in any way I can to try to replicate this problem. How can we do the dump of the boot disk in pre- and post-breakage? Is there a way to automate that? I also want to know how people are fixing the missing MBR quickly and easily?
 
Thanks @fabian for making me aware of this bugzilla thread. Glad I'm not the only one but also not too thrilled that this has been a non-reproduceable bug for 2+ years now. I'll add my own experiences here:

- I have been replacing old on-prem servers (we are an MSP) with Micro ITX rackmounts - very simple stuff - they all have NVMEs and maybe a larger SATA SSD for storage.
- Since it's usually a single disk, we're doing the installation defaults of where it automatically sets up LVM-Thin volumes
- They are brand new, fresh installs of Proxmox 7.2
- We always create a new management VM, usually Windows 10 or Windows Server
- We do a p2v migration of the old physical server using Veeam, into a new VM created on PVE
- We have a PBS in our colo, about 50TB storage in a ZFS raidz2. We set up a new datastore for each client, on the same ZFS volume, but dumping images in a different path
- Almost always, without fail, we do an initial backup of a newly created VM, and then the partition table gets hosed. I'm just today learning that it's a 512 byte MBR record. Curious how others are repairing this at all? I've been using the easeUS partition recovery tool and while it always works, thankfully, it never makes the active partition or drive letters correct for NTFS, so I have to boot again with a windows image, to use the CLi to do all the stuff like diskpart assign letter=c and bootrec switches like /fixmbr, /scanos ,/rebuildbcd

thanks for the info!

Now I'm learning that previous backups might be bad,, with the missing 512 byte MBR baked in and it's only noticed when the VM is rebooted? I've only experienced immediate issues where after a backup job, the server is "down" and we go and check it, sure enough it's in a boot loop with "missing hard disk". We always reboot servers within 30 days and have never had a manual reboot come up with a missing disk issue.

if regular reboots work for you, but (sometimes?) backups make the VM crash and then the partition table is hosed it might be a different (albeit possibly related) issue. just to confirm - you only see the partition table corruption if the VM crashes during the backup?

I'd like to contribute in any way I can to try to replicate this problem. How can we do the dump of the boot disk in pre- and post-breakage? Is there a way to automate that? I also want to know how people are fixing the missing MBR quickly and easily?

if your VM disks are on LVM thin, dumping the first 512bytes of the boot disk volume before and after the backup (e.g., using dd) and providing pveversion -v, the VM config, the full backup task log and any journal messages during the backup run would be helpful!
 
This just happened again during tonight's backup run from a new server with a p2v migrated VM that was newly created this past weekend. I hadn't had a successful backup all week. This time around, it finally bombed out and blew away the partition table (512 byte MBR loss probably, but I don't know how to officially check that). Here's my logs:

INFO: VM Name: NHFR-DC INFO: include disk 'sata0' 'local-lvm:vm-101-disk-0' 256G INFO: include disk 'sata1' 'local-lvm:vm-101-disk-1' 1T INFO: backup mode: snapshot INFO: ionice priority: 7 INFO: creating Proxmox Backup Server archive 'vm/101/2022-08-20T05:05:29Z' INFO: issuing guest-agent 'fs-freeze' command INFO: issuing guest-agent 'fs-thaw' command ERROR: VM 101 qmp command 'guest-fsfreeze-thaw' failed - got timeout INFO: started backup task '4cf2b5c7-28f5-4377-adb1-a01ad00ab1b5' INFO: resuming VM again INFO: sata0: dirty-bitmap status: created new INFO: sata1: dirty-bitmap status: created new INFO: 0% (460.0 MiB of 1.2 TiB) in 3s, read: 153.3 MiB/s, write: 128.0 MiB/s INFO: 1% (12.8 GiB of 1.2 TiB) in 1h 9m 5s, read: 3.1 MiB/s, write: 2.6 MiB/s ERROR: VM 101 qmp command 'query-backup' failed - got timeout INFO: aborting backup job ERROR: VM 101 qmp command 'backup-cancel' failed - unable to connect to VM 101 qmp socket - timeout after 5984 retries INFO: resuming VM again ERROR: Backup of VM 101 failed - VM 101 qmp command 'cont' failed - unable to connect to VM 101 qmp socket - timeout after 450 retries INFO: Failed at 2022-08-20 00:25:30 INFO: Backup job finished with errors TASK ERROR: job errors

The Syslog:

Aug 20 00:48:54 nhfr-pve pvestatd[1037]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:48:55 nhfr-pve pvestatd[1037]: status update time (6.665 seconds) Aug 20 00:49:01 nhfr-pve pvedaemon[336160]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:49:01 nhfr-pve postfix/smtp[888371]: connect to mxa.mailgun.org[3.93.221.84]:25: Connection timed out Aug 20 00:49:04 nhfr-pve pvestatd[1037]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:49:05 nhfr-pve pvestatd[1037]: status update time (6.691 seconds) Aug 20 00:49:09 nhfr-pve pvedaemon[317571]: <root@pam> successful auth for user 'root@pam' Aug 20 00:49:14 nhfr-pve pvestatd[1037]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:49:14 nhfr-pve pvestatd[1037]: status update time (6.632 seconds) Aug 20 00:49:15 nhfr-pve pvedaemon[336160]: VM 101 qmp command failed - VM 101 qmp command 'guest-ping' failed - got timeout Aug 20 00:49:17 nhfr-pve pvedaemon[336160]: <root@pam> starting task UPID:nhfr-pve:000D8F7C:00DC4AD8:630091FD:qmstop:101:root@pam: Aug 20 00:49:17 nhfr-pve pvedaemon[888700]: stop VM 101: UPID:nhfr-pve:000D8F7C:00DC4AD8:630091FD:qmstop:101:root@pam: Aug 20 00:49:20 nhfr-pve pvedaemon[888700]: VM 101 qmp command failed - VM 101 qmp command 'quit' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:49:20 nhfr-pve pvedaemon[888700]: VM quit/powerdown failed - terminating now with SIGTERM Aug 20 00:49:20 nhfr-pve pvedaemon[317571]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:49:24 nhfr-pve pvestatd[1037]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 31 retries Aug 20 00:49:25 nhfr-pve pvestatd[1037]: status update time (6.624 seconds) Aug 20 00:49:30 nhfr-pve pvedaemon[888700]: VM still running - terminating now with SIGKILL Aug 20 00:49:30 nhfr-pve kernel: vmbr0: port 7(tap101i0) entered disabled state Aug 20 00:49:30 nhfr-pve kernel: vmbr0: port 7(tap101i0) entered disabled state Aug 20 00:49:30 nhfr-pve qmeventd[711]: read: Connection reset by peer Aug 20 00:49:30 nhfr-pve systemd[1]: 101.scope: Succeeded. Aug 20 00:49:30 nhfr-pve systemd[1]: 101.scope: Consumed 11h 25min 39.338s CPU time. Aug 20 00:49:30 nhfr-pve pvestatd[1037]: VM 101 qmp command failed - VM 101 not running Aug 20 00:49:31 nhfr-pve qmeventd[888755]: Starting cleanup for 101 Aug 20 00:49:31 nhfr-pve qmeventd[888755]: trying to acquire lock... Aug 20 00:49:31 nhfr-pve qmeventd[888755]: OK Aug 20 00:49:31 nhfr-pve pvedaemon[336160]: <root@pam> end task UPID:nhfr-pve:000D8F7C:00DC4AD8:630091FD:qmstop:101:root@pam: OK Aug 20 00:49:31 nhfr-pve qmeventd[888755]: Finished cleanup for 101 Aug 20 00:49:31 nhfr-pve postfix/smtp[888371]: connect to mxa.mailgun.org[52.23.14.211]:25: Connection timed out