Proxmox Boot Issue - Missing Disk by UUID

dandelion_crab

New Member
Jan 11, 2025
9
3
3
Hi All,

I had an issue with a power outage where regrettably the UPS settings meant that a host in a cluster dropped suddenly instead of gracefully. The end result is now it won't boot.

The initial error seemed to be around importing a local ZFS pool but I have disabled the mounting of this pool at boot for now as part of troubleshooting, however a new problem arises around a disk that cannot be checked by fsck by it's UUID.

I have confirmed that this UUID does not exist. I thought, perhaps, it was part of the ZFS pool due to it's problem with initial import but I had checked the status of the pool prior to disabling the imports and it was reporting back with no errors or issues.

Where else can I check or what else can I do here to try and isolate/identify why fsck is trying to check a non-existent disk please? Regrettably the journal was just showing me errors mostly around a quorum not being able to checked - would it be fair to assume that perhaps the host was trying to migrate something, failed, and this UUID is a foreign disk that was mounted and that's why it can't be found?
 
blkid didn't show any partitions or disks missing or anything with the UUID either but I will triple check to be safe.
 
Thanks for the tip on blkid. Sorry I mistook this for lsblk which didn't yield a result but blkid has and I have found which disk it is.

Will progress on this further when I have a bit more time, but now I know which disk it is... it's a step forward. I am not sure what purpose this disk fulfils for now, but will re-enable the ZFS import target and see if that helps just in case its in the ZFS pool.
 
  • Like
Reactions: waltar
Okay, so its got nothing to do with the ZFS disks.

The UUID is one of the partitions on the PVE OS disk. I suspect I am going to need to boot into the PVE ISO and then run an fsck against the root LVM partition to see if that fixes it.

My primary concern is that this host is ID 1 in the cluster, so I am unsure what it means if I need to reinstall the PVE OS.
 
  • Like
Reactions: Kingneutron
fsck run against a filesystem not against a lvm partition but fsck will say that to you. You might run it against a lvm volume with a filesystem.
 
Yep, so I know I need to use a boot ISO to run it against the path /dev/mapper/pve-root as I found that from my searching on how to check the disk itself properly because even in recovery mode, the disk is being listed as being active, which, is obviously the case.

If I get 5 minutes tonight, my time, I'll try to get the boot ISO running an fsck against the PVE OS disk and see how I go.

Do you know if I need to reinstall, what the implications for the cluster is/are? Will I need to leave the cluster with the other hosts to rejoin it all again later?
 
Don't know the size of your cluster, depends if 2, 3 or more nodes which is no problem if 3 or more.
I don't assume any further new installation as it seems to me there is just a filesystem timer that it wasn't checked for a while (x mounts or x month).
You don't have to de-join before.
 
So I ran the boot installer, chose Advanced Options, Rescue Boot, and for some damn reason it keeps then booting back onto the PVE OS disk in the host, meaning I can't actually run the fsck.

Will give this a shot next time:

 
I thought you have a non-zfs boot pve and an additional zfs pool (which doesn't import) ... or not ... ?
 
I meant just the use of the Install in Debug Mode part so I can access a console that is not mounting the PVE OS disk (sda) - not ZFS.

Sorry, that was confusing.

So my disk layout is as follows:

sda, the PVE OS, which is partitioned into 3 LVM partitions. It's sda2 that is the UUID that fsck is having issues with on boot.
sdb, zfs mirror 1
sdc, zfs mirror 2
sdd, leftover vmfs disk I need to reformat once the host is back up reliably.

When I booted from the boot ISO (sde I think as its a USB boot ISO), and chose Advanced Options, Rescue Boot, for some reason it would then start to boot off of sda and put me back into emergency mode.

Will report back how it goes.
 
So I used debug mode to start to try and repair the disk.

There was a few errors associated with that disk as part of the boot. I am not sure what I am doing incorrectly, but fsck wants to try to access /etc/fstab to do its repair and it's complaining about no /etc/fstab existing.

I can definitely see the disk but no joy on doing a repair.

In the end, I reinstalled the OS and I used xfs instead of ext4 as doing some research showed me that should this happen again, I will have a better chance at recovery.

Thanks all for your help and assistance.
 
  • Like
Reactions: waltar