Host Broken after PowerLoss

rechena

New Member
Apr 27, 2020
14
0
1
42
Hi, I know its been discussed alot and I've tried everything I could find, and I'm at a loss at this stage :(

This morning we had a powerloss and the ups couldn't keep up. tldr, my proxmox went down hard. And when I try to bring it back, I have the No Boot message.
I've tried to recover the grub but I keep getting issues, it just won't boot.

This is the process I've followed: https://pve.proxmox.com/wiki/Recover_From_Grub_Failure

But when I try to mount the boot partition:


Code:
# mount /dev/sdd1 /media/RESCUE/boot/
mount: /media/RESCUE/boot: wrong fs type, bad option, bad superblock on /dev/sdd1, missing codepage or helper program, or other error.

This is my fdisk..

Code:
# fdisk -l /dev/sdd
Disk /dev/sdd: 113 GiB, 121332826112 bytes, 236978176 sectors
Disk model: APPLE SSD TS128A
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 822A7CF2-48A5-490D-9093-39071F944796

Device       Start       End   Sectors   Size Type
/dev/sdd1       34      2047      2014  1007K BIOS boot
/dev/sdd2     2048   1050623   1048576   512M EFI System
/dev/sdd3  1050624 236978142 235927519 112.5G Linux LVM

I tried fsck but no joy also..

Code:
# fsck /dev/sdd1
fsck from util-linux 2.33.1
e2fsck 1.44.5 (15-Dec-2018)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdd1

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

on /dev/sdd2 I seem to have a dirty bit which I can't remove also :(

Code:
root@sauron:/# fsck /dev/sdd2
fsck from util-linux 2.33.1
fsck.fat 4.1 (2017-01-24)
0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
1) Remove dirty bit
2) No action
?

Any other ideas what I can do? Thanks for the help.

Should I assume at this stage I need to reinstall the system?
 

rechena

New Member
Apr 27, 2020
14
0
1
42
Eventually I had to get the server up and running so I just did a fresh install and reimported the zpools...

Question though, any idea why this would corrupt so easily? is it because its a ssd?
 

avw

Active Member
May 31, 2020
387
58
28
the Netherlands
Glad to hear you recovered the system. These are just my thoughts on this subject and maybe a partial answer to your question.

There are more posts that have has boot and/or root partition issues after an unexpected power loss when using non-battery-backed/consumer SSDs. I have personally experienced a ZFS-pool to go bad because of a power interruption on a cheap SSD mirror. Even ZFS on a mirror! It did however detect that the metadata was broken.

Please note that the bios boot partition is not a filesystem, therefore mounting and fsck will not work. It contains binary code for running GRUB.
The ESP partition is a FAT32 filesystem that contains the actual boot files (for UEFI at least), such as the kernel and initramfs, which still has the dirty bit set because of the hard power-off. However, unless you were actually writing to it (e.g., running apt-get dist-upgrade), I would not expect any errors in that particular filesystem.

Is your system booting from UEFI or BIOS (or UEFI with CSM enabled)? Please note that the wiki-page is about Proxmox 4 and out of date (i.e., no ESP at that time).
I'm not sure what the "No Boot message" is exactly. Maybe it was a problem with the Proxmox root filesystem on LVM? Were you using thinly-provisioned LVM? There were probably disk writes going on to that filesystem when the power went out.
 

rechena

New Member
Apr 27, 2020
14
0
1
42
Glad to hear you recovered the system. These are just my thoughts on this subject and maybe a partial answer to your question.

There are more posts that have has boot and/or root partition issues after an unexpected power loss when using non-battery-backed/consumer SSDs. I have personally experienced a ZFS-pool to go bad because of a power interruption on a cheap SSD mirror. Even ZFS on a mirror! It did however detect that the metadata was broken.

Please note that the bios boot partition is not a filesystem, therefore mounting and fsck will not work. It contains binary code for running GRUB.
The ESP partition is a FAT32 filesystem that contains the actual boot files (for UEFI at least), such as the kernel and initramfs, which still has the dirty bit set because of the hard power-off. However, unless you were actually writing to it (e.g., running apt-get dist-upgrade), I would not expect any errors in that particular filesystem.

Is your system booting from UEFI or BIOS (or UEFI with CSM enabled)? Please note that the wiki-page is about Proxmox 4 and out of date (i.e., no ESP at that time).
I'm not sure what the "No Boot message" is exactly. Maybe it was a problem with the Proxmox root filesystem on LVM? Were you using thinly-provisioned LVM? There were probably disk writes going on to that filesystem when the power went out.
Thanks so much for the extended reply, will try to reply as good :)

How do I check if I"m booting from UEFI or BIOS? Tried to look for indication and couldn't..
Fair point on the fsck, didn't though of that.

Humm I'm only using the boot disk for Proxmox with the default settings, nothing else. I do use thin-provision for the VMs and LXCs but thats on the zpool, not sure if this is the info you we're looking for?

Thanks
 

avw

Active Member
May 31, 2020
387
58
28
the Netherlands
How do I check if I"m booting from UEFI or BIOS? Tried to look for indication and couldn't..
Are there files in /sys/firmware/efi/efivars (UEFI with CSM disabled) or does that directory not exist (Legacy/BIOS/UEFI with CSM enabled)?
Humm I'm only using the boot disk for Proxmox with the default settings, nothing else. I do use thin-provision for the VMs and LXCs but thats on the zpool, not sure if this is the info you we're looking for?
I think it is wise to separate the Proxmox host installation from the VM/CT storage (as you have done), because it allows easier reinstallation if it is needed.

From your initial post, I cannot determine what was broken and why it was not booting. But I guess it does not matter anymore (and we cannot check anyway).
 

rechena

New Member
Apr 27, 2020
14
0
1
42
Are there files in /sys/firmware/efi/efivars (UEFI with CSM disabled) or does that directory not exist (Legacy/BIOS/UEFI with CSM enabled)?

Yep

Code:
oot@sauron:~# ls -la /sys/firmware/efi/efivars |wc -l
136

I think it is wise to separate the Proxmox host installation from the VM/CT storage (as you have done), because it allows easier reinstallation if it is needed.
Yeah, that was a good call alright, and I also have mirror on the VMs just not on the boot one.. but I suppose it wouldnt make a difference in this case.
 

avw

Active Member
May 31, 2020
387
58
28
the Netherlands
Using a mirror (Proxmox handles keeping multiple ESP up-to-date) and ZFS allows one to lose a drive and also detect data-corruption if one of the drives is slowly failing.
However, when you use the exact same drives that do write action the same at the same time (without proper battery backup), you can still have them corrupt data together on a power loss. I've seen that the same sectors were corrupt (nice to have checksums) and thus not recoverable, luckily it was just a log file.

In your case, I would make sure that the UPS works and that the system shuts down quickly enough (maybe even kill the VMs, which can be restored from backup, if necessary), as this is the battery backup for your drives. Or, if you don't have a working UPS, to buy an enterprise SSD with battery backup to mirror or replace your Proxmox drive.
PS: Note that you only need a 8GB disk for Proxmox itself and it don't need to be fast, so get the cheapest drive with (real) power loss protection (PLP).
 
Last edited:
  • Like
Reactions: rechena

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!