[SOLVED] VM's cloned or restored from backups have corrupted filesystems

Apr 22, 2024
3
2
3
Hi,

I am having a problem with a Proxmox 8.1 server that I'd really appreciate some help with.

Whenever I clone or restore a VM from a backup, the cloned/restored guests filesystem is corrupted. Boot errors, LVM errors, obvious corruption.

Host storage:
- RAID: Raid 5 (Hardware)
- Drives: 4x INTEL SSDSC2KB960G8 (960GB)

VM's are stored on a LVM thin pool.

I currently only have two VM's setup on the machine that I've tried to both clone and restore:
1) Guest 1 / Debian 12 - LVM / ext4 partitioning and filesystem for guest OS
3) Guest 2 / CentOS 9 - XFS / no LVM

The issue is the same when restoring or cloning either of these VM's - either filesystem or boot errors related to the filesystem.

  • This is a newly deployed server
  • There is no indication at all of any hardware failure on the host
  • The VM's I am cloning/restoring work fine - only the copies have the issue
  • The server has been burnt in with memtest for days but I'm starting to run more testing
  • I have tested restoring backups and cloning about 10 times now with the same results
  • The issue happens for "stop" and "snapshot" backups

qm Config
agent: 1
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
ide2: local-lvm:iso/CentOS-Stream-9-latest-x86_64-dvd1.iso,media=cdrom,size=10021824K
memory: 2048
meta: creation-qemu=8.1.5,ctime=1713732599
name: Centos9Base
net0: virtio=02:00:00:c8:1e:d6,bridge=vmbr0,firewall=1,link_down=1
numa: 0
ostype: l26
scsi0: lvm_thin_local:vm-1006-disk-0,cache=writethrough,discard=on,iothread=1,size=250G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=20aef623-18b6-48c7-abbe-9f2ec476913c
sockets: 1
vmgenid: 544953e7-66a6-47c4-98c1-dff113d92b55

Any ideas are greatly appreciated. Thanks.
 
Hi _gabriel,

Thanks for the reply.

I've attempted the following:
- Restored a backup made on the problem server to a working server: NOT corrupted.
- Restored a known working backup from a working server to the problems server: IS corrupted.

So it appears the corruption is happening when the backups are restored or the VM is cloned.

Unfortunately I do not have access to a proxmox backup server.


Any ideas? The server has never been unsafely shutdown.

The server setup involves a MegaRAID9560-8i RAID controller managing four 960GB Intel SSDs in a RAID5 configuration, resulting in one partition for BIOS boot, one for the EFI System, and a main 2.6TiB Linux LVM partition.

The LVM setup consists of a volume group named 'pve' with five logical volumes for Proxmox and VM storage, including a 60GB volume for temporary storage or backups, a 2.48TiB thin pool for sparse VM disk storage, a 20GB root partition for the OS, a 4GB swap space.

##### Hardware:
- RAID: MegaRAID9560-8i 4GB
- Drives: 4x INTEL SSDSC2KB960G8 (960GB)

Configuration:
- RAID5 - 3 data, 1 parity

The RAID device is /dev/sda

Partitions on /dev/sda
Device Start End Sectors Size Type
/dev/sda1 34 2047 2014 1007K BIOS boot
/dev/sda2 2048 2099199 2097152 1G EFI System
/dev/sda3 2099200 5622988766 5620889567 2.6T Linux LVM

All storage except except boot and EFI are in a single partition (sda3), using Linux LVM

SDA3 Contains a single LVM logical volume group (vg)
# vgdisplay
--- Volume group ---
VG Name pve
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 51
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 5
Open LV 3
Max PV 0
Cur PV 1
Act PV 1
VG Size <2.62 TiB
PE Size 4.00 MiB
Total PE 686143
Alloc PE / Size 672850 / <2.57 TiB
Free PE / Size 13293 / <51.93 GiB
VG UUID m9W22T-ULSY-GeMQ-msDY-pkCv-Nbxj-Nm27P2

# vgs
VG #PV #LV #SN Attr VSize VFree
pve 1 5 0 wz--n- <2.62t <51.93g


And that volume group contains these logical volumes (lv)
# lvs
LV VG Attr LSize Pool Origin Data% Meta%
data pve -wi-ao---- 59.99g
lvm_thin_local pve twi-aot--- 2.48t 0.34 10.55
root pve -wi-ao---- 20.00g
swap pve -wi-ao---- 4.00g
vm-1001-disk-0 pve Vwi-a-t--- 300.00g lvm_thin_local 2.84

Thanks again