VM disk corruption

elgingero

Member
Mar 15, 2021
4
1
8
46
Hi

Weird one here.

I have run a number of VMs on Proxmox for years. One of which is Home Assistant. Today I came to reboot the guest, and it failed to boot - hung at EFI menu implying there is no valid data on there.

I perform a daily PBS backup, and it appears that a few days ago the partition table on the disk was hosed.

This is a backup from 19th March, where a valid partition table is there: -

1711125763924.png
But subsequent backups the partition table is no longer there: -

1711125803272.png
If I boot the VM from a rescue disk and look at the first 1024 bytes of the disk I get this: -

1711125825209.png

I've checked, and the "100EFI PART" bit of that screenshot is exactly 512 bytes into the disk - so appears to be the start of an EFI partition - but the 1st 512 bytes are clearly some home assistant CSS code!!

The way I see it there's 2 ways this could happen: -

* The operating system saw its way fit to write a configuration file directly to /dev/sda or similar...
* Something went wrong in the hypervisor and it's jumbled blocks around.

I will be posing the question to the Home Assistant community of course for the former, but my question to this community is are there any known issues whereby PVE (or the underlying thin LVM) jumbles up data blocks such that pointers to blocks could get mixed up, and what I'm seeing here is that the first block is simply a pointer that's gone awry?

My intention is to restore the last known good to another file, and hopefully copy over just the first 512 bytes to recover the partition table as I know I've made lots of changes since that backup - but of course if it's an underlying LVM/Proxmox issue I'd like to know, because there could be other corrupty-weirdness going in.

For info I'm on 7.2.7 and yes, I know it needs updating but posting version number in case there were any known bugs in this version.

Thanks
 
There was some problem in the past where using SATA could cause the partition table to corrupt on backups. I thought this got patched but not totally sure. But as your PVE is totally outdated this might be the problem. ;)
You might want to switch from SATA to virtio SCSI anyway, as it got more features, is faster and best-practice.
And upgrade to PVE8.1.5 of cause. PVE7.X will be End-of-Life this summer and you aren't even on the latest 7.4 so 2 years of unpatched bugs and serious vulnerabilities...

You might want to search this forum or the bugtracker for that thread I have in mind.
 
Last edited:
  • Like
Reactions: elgingero and fiona
Hi,

not only on backups, but when the guest reset the SATA device while writes were happening. Yes, this was fixed in Proxmox VE 8/pve-qemu-kvm >= 8.0.2-7: https://git.proxmox.com/?p=pve-qemu...f;hp=816077299c92b2e20b692548c7ec40c9759963cf

Perfect - thanks! Looks like I need to expedite an upgrade :)

I successfully recovered my guest by restoring the partition table from an older backup, and this gives me confidence that it is likely only sector 0 that was impacted and not potentially other random bits of data.

Thanks
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!