Frozenstiff

Member
Feb 10, 2020
16
1
8
44
I am running version 6.1-7 on a 2+1 node setup. After I ran updates on both nodes, it gave me a message that a new kernel was installed so I rebooted one server after migrating the active VMs to the other.

I waited a little over 10 minutes for the server to come back up and nothing. I connected to the Dell IPMI console interface and it just shows a grub prompt(not grub rescue).

The only post I found concerning this was Proxmox Grub Failure article, but I don't have physical access to the server at the moment to perform the steps. Walking someone else through the steps isn't an option either.

I've attempted to follow the solution in the highest rated answer from this link (pictures below), but after rebooting I'm back at the same grub prompt:
The boot process can't find the root partition (the part of the disk, that contains the information for starting up the system), so you have to specify its location yourself.


I think you have to look at something like this article: how-rescue-non-booting-grub-2-linux


short: in this grub rescue> command line type

ls


... to list all available devices, then you have to go through each, type something like (depends what is shown by the ls command):

ls (hd0,1)/
ls (hd0,2)/


... and so on, until you find

(hd0,1)/boot/grub OR (hd0,1)/grub


In case of efi

(hd0,1)/efi/boot/grub OR (hd0,1)/efi/grub


... now set the boot parameters accordingly, just type this with the correct numbers and after each line press return

set prefix=(hd0,1)/grub
set root=(hd0,1)
insmod linux
insmod normal
normal


now it should boot. Start a commandline now (a terminal) and execute

sudo update-grub


... this should correct the missing information and it should boot next time.


If not, you have to go through the steps again an might have to repair or install grub again (look at this article: https://help.ubuntu.com/community/Boot-Repair)
hd0 GPT 1-3.png

lvm pve-root boot grub.png

grub params.png

It references another walk through, which I've also tried, but when I attempted the "boot from grub" section, I don't know how to set the parameters when using LVM.
The following is the example it gives:
Booting From grub>

This is how to set the boot files and boot the system from the grub> prompt. We know from running the ls command that there is a Linux root filesystem on (hd0,1), and you can keep searching until you verify where /boot/grub is. Then run these commands, using your own root partition, kernel, and initrd image:

grub> set root=(hd0,1)
grub> linux /boot/vmlinuz-3.13.0-29-generic root=/dev/sda1
grub> initrd /boot/initrd.img-3.13.0-29-generic
grub> boot


So, I have one of two servers functional and I can't reboot the other in case it has the same issue. Is there anything I can do remotely to fix this, or do I just hope the remaining server doesn't go down until I get back in a week? Also, am I entering the wrong parameters or will the steps I've taken not work because it is an LVM partition?

If there is no other option than to physically be there, is the Proxmox article I previously linked about recovering from a grub failure still applicable?


Thanks in advance for any help that anyone can provide.
 
the message on your first screen shot inidicates that you are using HW raid and that your HW raid controller lies about the size of the logical disk it creates. Grub does not allow reading outside of the disk boundaries for obvious reasons. this is a firmware bug of your HW raid controller. possible solutions include moving /boot and the bootloader to a non-HW-raid disk or switching to UEFI if supported
 
the message on your first screen shot inidicates that you are using HW raid and that your HW raid controller lies about the size of the logical disk it creates. Grub does not allow reading outside of the disk boundaries for obvious reasons. this is a firmware bug of your HW raid controller. possible solutions include moving /boot and the bootloader to a non-HW-raid disk or switching to UEFI if supported


The hardware RAID controller is not in use (I think it is disabled on this node in fact). I'm booting from a 128gb flash drive and it is the only drive in the system. The storage and backup for the VMs are on an NFS share so both nodes have access in case I need to migrate VMs over or a system fails.

Any other ideas as to what might be wrong or other information I can prove to help with the diagnosis?

BTW, same config on both nodes with a boot USB and no other drives. One is a R610(the one I haven't restarted yet), and the other is a R710 which is having the problem currently.
 
what does Grub say about the whole disk's size? the same issue can of course also apply to the mainboard's firmware and USB drives..
 
what does Grub say about the whole disk's size? the same issue can of course also apply to the mainboard's firmware and USB drives..

It is a 128GB drive.
This is all I've got, unless there is another way to get the information you want with a different command:
hd0 size.png

I only updated Proxmox last night, so I'm unsure how the system interpreting the partition information would have changed.
 
Last edited:
that looks okay. can you try booting the kernel and initrd from the grub rescue shell? alternatively, you could boot using a live environment and check whether the EFI partition is full/corrupt/.. and re-install Grub on the the USB drive.
 
that looks okay. can you try booting the kernel and initrd from the grub rescue shell? alternatively, you could boot using a live environment and check whether the EFI partition is full/corrupt/.. and re-install Grub on the the USB drive.


How do I boot the rescue shell if it goes to the grub prompt before I can key anything in? I don't get a grub boot menu even. Is there a command I can use to get it to work from just the grub prompt, or are they the same? If so, then I've attempted to do that using the links in my previous post, but don't know the correct parameters to use because it is a LVM partition. I get stuck at the /dev/sda1 part in the example, since I don't know how to reference it to direct it to the correct partition. Though I'd assume it would be /dev/sda3 being the largest partition.

I would use a live environment, but I don't have an SD card installed in the IPMI to upload an image to boot from, and like I said before I don't have physical access at the moment.


BTW, thanks for the help so far. I'd normally be on site, but with the virus rate increasing I got stuck in another state.
 
I meant the Grub prompt, yes. since you are able to read files, you could try running the 'linux' and 'initrd' lines from /boot/grub/grub.cfg after setting root= accordingly. usually IPMI should allow you to boot from an ISO via SMB or NFS as well, but if you don't have a second, running machine nearby that's gonna be tricky as well ;)
 
I meant the Grub prompt, yes. since you are able to read files, you could try running the 'linux' and 'initrd' lines from /boot/grub/grub.cfg after setting root= accordingly. usually IPMI should allow you to boot from an ISO via SMB or NFS as well, but if you don't have a second, running machine nearby that's gonna be tricky as well ;)

Yeah, the iDRAC IPMI hasn't been updated for functionality in these systems since they are about a decade old now. I've tried the network boot options, but the only one that works is PxE and I don't have a server setup there to do that right now.

Do you know which command is used to read the grub.cfg file? Nano and gedit aren't installed; I also tried vi and vim but those aren't there either.
 
Last edited:
you can use help in the prompt to get a list of commands. set pager=1 enables a built-in pager so that you can actually read output that spans more than one screen ;). there is a cat command that displays file contents. the output of ls (lvm/pve-root)/boot should show you which kernel and initrd files are available.

https://help.ubuntu.com/community/Grub2/Troubleshooting should be a helpful resource
 
you can use help in the prompt to get a list of commands. set pager=1 enables a built-in pager so that you can actually read output that spans more than one screen ;). there is a cat command that displays file contents. the output of ls (lvm/pve-root)/boot should show you which kernel and initrd files are available.

https://help.ubuntu.com/community/Grub2/Troubleshooting should be a helpful resource

Hey again, sorry about the long delay in replying. I was working on a laptop and the power supply died. By the time my company sent me a new one, they shut down the office due to the shelter-in-place order.

Anyway, I'm back at work now and was able to mount Rescatux 0.73 (a rescue ISO) and reinstall GRUB(used the install GRUB option after it detected what partition GRUB was previously installed on) on the R710. It booted and I'm installing the next batch of upgrades.

Thanks for the help with that problem.

Let me know if I should make this another post, or if it is fine to continue here:

My current issue is the R610, that was running a VM that was set to HA and some CTs for databases and other things, seemed to have powered off during a storm. Our battery backups only last about 15 minutes and no one was here to shut things down. So the VM moved back to the R710 as soon as I started it up, but the CTs are stuck on the R610 that won't boot now, I guess since they weren't set to HA.

When I booted the R610 I got a "RUN fsck MANUALLY" screen, but I forgot to capture it. After I ran the fsck command on dm-1(which it said had the errors on it), it let me type "exit" and proceeded to boot into proxmox. It gave me a number of errors during boot and got stuck at the "bring up network interfaces" section. I thought it had failed to initialize the connection because it said it failed after about 5 minutes, but it got an IP and let me ping other devices from the prompt after logging in. It won't let me access the web interface though, so I thought there might be some file corruption.

I tried running fsck -pf from the prompt, which follows. It seems the file system is mounted as read-only, which is the only reason it worked, and the changes don't take. Also, it only lists dm-0 as having an issue, then it stalls until I hit enter, then it takes me back to the prompt. I've restarted twice, but I get the same results. I've even tried the "Rescue Boot" option from the ProxMox ISO, even though I didn't think that would work, and it does the same thing.

Let me know if you'd like me to post the boot log, or anything else, but I think just fixing the file system errors at this point would be a good start before any other troubleshooting.

Any help would be appreciated with this new problem.

R610 FSCK file errors after power outage.jpg
 
Last edited:
I'd boot from a live-CD and attempt to save stuff in r/o mode before attempting to fix a potentially broken file system... are the guest images local or on some shared storage? do you have current backups?
 
I'd boot from a live-CD and attempt to save stuff in r/o mode before attempting to fix a potentially broken file system... are the guest images local or on some shared storage? do you have current backups?


All backups were current before the second server lost power(R610) since they are set to backup weekly, and all VM storage and backups are on a separate file server. The only thing on either of the VM servers is the 128GB flash drives with OS/config files. I don't guess there is a way to copy the whole ProxMox configuration and restore it after installing from scratch? It seems like that might save some time, but that would have to include cluster, storage, and any other configs and I haven't seen an option for that in the interface.
 
in that case I'd save/move VM configs currently stored in /etc/pve/nodes/BROKEN_NODE/ from your other, working node, remove the broken node (read https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node first!) from the cluster, re-install it and re-join the cluster - you only need to re-install, setup network, join the cluster. storage is cluster-wide already, the guest configs are stored on the synced cluster file system as well.
 
in that case I'd save/move VM configs currently stored in /etc/pve/nodes/BROKEN_NODE/ from your other, working node, remove the broken node (read https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node first!) from the cluster, re-install it and re-join the cluster - you only need to re-install, setup network, join the cluster. storage is cluster-wide already, the guest configs are stored on the synced cluster file system as well.


Alright, I'll do that first thing when I get back into the office. It will take way less time to do it in person.

I'll get back to you with the results.
 
So I figured this out but got busy and forgot to close the thread. The solution was to stop using USB flash drives. I'm not sure if it is the R610/R710, but after ~3-6 reboots it was corrupting the boot information. Brand new San Disks, PNYs, and others all died in a few different ways at half a dozen or less reboots. I've been using external HDDs without any issues on the same USB port for a few months to test it out and rebooted ~15 times without fail.

I'm going to replace the RAID card and just boot from a drive connected to that now on, which will let me do drive mirroring if there is another failure.

Thanks for all the help and sorry about the delay in closing this thread.
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!