Corrupt GPT causing VM boot failures

proximutt

New Member
Nov 9, 2024
2
0
1
Our VM bringup procedure frequently results in VMs that don't start and need manual fixing with a partition utility before they will boot. It happens like this:

1. On creation, pvesh create prints a warning about detecting an existing GPT signature on both of the volumes it creates. An example is [1] at the bottom of this post.

2. When the VM is started, it hangs for a while waiting for the root device before dropping into the initramfs prompt. What resolves the problem is using gdisk to repair the GPT partition table of the VM's disk 0. gdisk's rescue submenu has some options that make this easy. Then, we stop the VM, restart it and it boots just fine. An abbreviated boot log is [2] at the bottom of this post.

We are running Proxmox 6.4 with an LVM storage backend (the data store is an iSCSI SAN device). The VMs are always based on recent Ubuntu server ISOs: 20.04 and 22.04.

Our current hypothesis is that there is some vestige of ZFS or mdadm exists on the newly created volumes. Maybe so, but I don't completely understand that. I wouldn't think a recognizable signature would appear on every "chunk" of the datastore allocated by Proxmox.

It does seem that the way we do the creation is not quite right. We run pvesh create from a script, i.e non-interactively, and it doesn't have the ability to answer 'y' to the request to wipe the partition table. Do we need to figure out way to do this? Is there an option to the create subcommand that instructs it to wipe?

More generally, conceptually (and practically, I guess), what goes into the "preparation process" when Proxmox creates a new block device for use by a VM. Regardless of this particular problem, I'm curious about how it works.

[1]
Code:
RUN: pvesh create /nodes/pve1-c1n2/qemu/10004/clone --newid 136 --target pve1-c4n3 --name server7.dc1.internal --full true
create full clone of drive scsi0 (storage2:vm-10004-disk-0)
WARNING: PMBR signature detected on /dev/STOR2-PVE1-01/vm-136-disk-0 at offset 510. Wipe it? [y/n]: [n]  Aborted wiping of PMBR.
Logical volume "vm-136-disk-0" created.  1 existing signature left on the device.
... many transferring message ...
create full clone of drive ide2 (storage2:vm-10004-cloudinit)
WARNING: iso9660 signature detected on /dev/STOR2-PVE1-01/vm-136-cloudinit at offset 32769. Wipe it? [y/n]: [n]
Aborted wiping of iso9660.  Logical volume "vm-136-cloudinit" created.
1 existing signature left on the device."UPID:pve1-c1n2:00006BA5:276EC1D88:67303F0E:qmclone:10004:root@pam:"

[2]
Code:
Booting from Hard Disk...
[    0.000000] Linux version 5.15.0-46-generic (buildd@lcy02-amd64-115) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) )
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-46-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
... boot messages ...
begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... [    3.568375] Btrfs loaded, crc32c=crc32c-intel, zoned=yes, fsverity=yes
Scanning for Btrfs filesystems
done.

Begin: Waiting for root file system ... Begin: Running /scripts/local-block ... mdadm: No arrays found in config file or automatically
done.
mdadm: No arrays found in config file or automatically
... duplicates of above ....
mdadm: error opening /dev/md?*: No such file or directory
mdadm: No arrays found in config file or automatically
... duplicates of above ....
done.
Gave up waiting for root file system device.  Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT!  LABEL=cloudimg-rootfs does not exist.  Dropping to a shell!


BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)
 
An upgrade isn't feasible in the short term, so we will continue to fix new VMs manually on creation.
 
maybe

yes | pvesh create .....

is worth a try?

not sure, if pve9 made a progress with this issue, as i don't know how to reproduce this one.
anyhow, i searched a bit and that "PMBR signature detected" message seems to originate from "lvcreate", which should be used under the hood.

lvcreate has an "-y" and "--wipesignatures" option
-W|--wipesignatures y|n
Controls detection and subsequent wiping of signatures on new LVs. There is a prompt for each signature detected to confirm its wiping (unless --yes is used to
override confirmations.) When not specified, signatures are wiped whenever zeroing is done (see --zero). This behaviour can be configured with lvm.conf(5) alloca‐
tion/wipe_signatures_when_zeroing_new_lvs. If blkid wiping is used (lvm.conf(5) allocation/use_blkid_wiping) and LVM is compiled with blkid wiping support, then the
blkid(8) library is used to detect the signatures (use blkid -k to list the signatures that are recognized). Otherwise, native LVM code is used to detect signatures
(only MD RAID, swap and LUKS signatures are detected in this case.) The LV is not wiped if the read only flag is set.

-y|--yes
Do not prompt for confirmation interactively but always assume the answer yes. Use with extreme caution. (For automatic no, see -qq.)

it seems that you also can modify default behaviour of lvcreate by adding wipe_signatures_when_zeroing_new_lvs to /etc/lvm/lvm.conf

# Configuration option allocation/wipe_signatures_when_zeroing_new_lvs.
# Look for and erase any signatures while zeroing a new LV.
# The --wipesignatures option overrides this setting.
# Zeroing is controlled by the -Z/--zero option, and if not specified,
# zeroing is used by default if possible. Zeroing simply overwrites the
# first 4KiB of a new LV with zeroes and does no signature detection or
# wiping. Signature wiping goes beyond zeroing and detects exact types
# and positions of signatures within the whole LV. It provides a
# cleaner LV after creation as all known signatures are wiped. The LV
# is not claimed incorrectly by other tools because of old signatures
# from previous use. The number of signatures that LVM can detect
# depends on the detection code that is selected (see
# use_blkid_wiping.) Wiping each detected signature must be confirmed.
# When this setting is disabled, signatures on new LVs are not detected
# or erased unless the --wipesignatures option is used directly.
# This configuration option has an automatic default value.
# wipe_signatures_when_zeroing_new_lvs = 1


@fiona , wouldn't it make sense that proxmox would do so by default when creating logical volumes ?
i think it's unfortunate, that interactive questions being asked a point or from some command where it doesn't seem to make sense at all.
 
Last edited: