UEFi installation issue with zfs on root on Dell R630

Okay, let me preface this with I'm not new to Proxmox and I've run this on lots of Dell servers. I've also been working on this for a few days so I might have missed a step or two of the actual troubleshooting

This issue is only for UEFI boot and not bios boot.


The setup consists of an 11-node cluster that is made from of 3 dell R620s and 4 Dell R630s and 4 supermicro servers.

The 4 super micro servers were for a 24 drive all SSD cluster that I've decided to retire and remove and do hyper-converged on the Dells to save power. During this transition, I will be pulling out the 3 R620's and replacing them with R630's so the end configuration will be 7 R630s 10 bays with 7 bays dedicated to ceph and ssds.

Because I'm removing the Perc730s and replacing them with a LSI 9300-8I in IT mode I'm dropping out each server one by one and then replacing the storage controller and then installing Proxomx again and adopting back into the cluster under the same name. I've done this multiple times over the years, no issue here.

The first two servers went without a hitch and Proxmox 7.4 was installed from the USB drive and I was able to set up ZFS on root using UEFI. The next server I tried failed and it didn't matter what I attempted it would install but wouldn't boot on the next boot There were about 2 months between the first two and the next one. When I came back I was originally trying to install this over IDRAC and virtual disc. Then I remembered that it would fail and I had to USB boot for UEFI to work. So I tried another and it failed as well. I got fed up with it so I pulled it from the rack and drove home where I'd have more time to mess with it. Once I got it home and powered on I put the USB drive back in and connected to the server through the IPMI and completed the install and the server rebooted into live proxmox with no issue. I scratched my head on this and decided that maybe it was the USB drive because I'd formatted it and installed the iso back onto it.

I take this working server back to the data center and rack and everything is perfect. It joined the cluster, I installed ceph, did some sketchy remove drives from another ceph node and then discover the physical volumes and then the logical volumes, and then start all the Ceph osds, and boom server is online and has replaced one of the existing supermicros. Don't try this unless you are willing to lose data. YMMV.

Ok now back to the first server that will not boot after installation. Using the same USB stick as the previous day I did the install and then the server hung on first uefi boot.

I've tried upgrading IDRAC, Lifecycle controller, BIOS going up in releases and also trying to Proxmox from 6.4 to 8.0 and I've tried going down on the dell IDRAC and Lifecycle and BIOS and going through all the Proxmox versions again each time.

The servers that have completed the installation (3) with the new LSI cards are running
BIOS 2.13.0
Firmware 2.82.82.82
Lifecycle Controller Firmware 2.82.82.82

On this server, BIOS 2.13 and firmware cause a UEFI boot exception and crashes. The only information I can find that is similar from dell is a known issue if the firmware isn't upgraded in the correct order. https://www.dell.com/support/kbdoc/...protection-fault-during-uefi-pre-boot-startup

Dell Says: To prevent this issue updates should be performed in the following order
  1. iDRAC
  2. LCC
  3. BIOS
I also reset the lifecycle controller at some point so I had a clean IRDAC and BIOS, I then proceeded to try various up/downgrades to attempt to get it to reboot.

On Bios 2.17.0 and firmware 2.84.84.84 the latest for this server, the server hangs after enumerating boot devices and will not continue to boot. At least with the 2.13.0 I get an error code.

What I don't understand is that I can have a working server with the same bios and firmware as the one that I'm trying to install and it fails.

Tonight I'm going to go down one release older than the versions that I want to get to and then upgrade the order that Dell says one more time. Hopefully, this will work or someone here can shed some light on something that I've missed.

I've got 3 more new to me servers to get this done on after this one is completed. Trust me the thought of mdadm and vanalla debain has crossed my mind since I don't need zfs on root really. I just need an easy way to fail over and grub on both boot drives and a hot-swap has worked for many years.
 
. The next server I tried failed and it didn't matter what I attempted it would install but wouldn't boot on the next boot
How exactly did it fail - where does it hang, does it reset itself?
were there any warnings when you installed with the ISO?
does it work if you install Debian first and then PVE on top of it:
https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_12_Bookworm

a screenshot of where the boot fails might help

else - please compare the actual Settings in the BIOS between a working machine and one that does not work (in my experience Dell does not ship them with consistent settings)
 
While I was scratching my head I decided to put the 730mini back into the server and I was going to create a raid setup in an attempt to rule out the LSI 3008 so while I was in there I found that there is an option to turn the disk into a non-raid disk. I'm assuming that there was some sort of configuration that might have been lurking on these drives. I've put them in a batch of servers with perc controllers to clear the raid arrays and also to make them a nonraid disk. They are currently running a "clear" on them that is going to take a few hours. I will report back once to see if this helps resolve the issue.
 
I have a similar problem with a Dell R650xs, currently what I am doing is some tests, how is it going with BIOS mode? I also tried with the disk in non-raid and creating zfs raid 1 and it also fails, although my failure is a little different since proxmox does. It starts but an lxc with boot at startup generates certain errors and I have to manually restart the container for it to work
 
Bios mode works fine and after all the head scratching and lifecycle and bios up/downgrades I decided that I'd move onto another server.

For the next 3 servers I had to flash the LSI 3008's into IT mode as they were not already done.

Server #2 BIOS 2.17 Lifecycle 2.84.84.84.0
I did the install but dropped to a shell and ran the following lines

proxmox-boot-tool init /dev/sda2 and it completed but gave an error that I didn't record but it was something along the lines of couldn't find the kernel.
I then ran the same command but on /dev/sdb2 and it completed and wrote out the information so I ran it again on /dev/sda2 and it completed with no errors. Next I ran proxmox-boot-tool reinit and then rebooted. It did the first boot but dropped me to the initramfs shell complaining that it couldn't import the pool. I did import it with disk-by-id and typed exit so initramfs would attempt the boot again and soon I was presented with a login screen. I then did the updates and restarted the server for the second time. It went straight to the login so I was super happy. I had also removed the USB stick <---- important.

At this point, I started upgrading server #1 to the same lifecycle and bios while also doing the other servers.

Server #3 BIOS 2.17 Lifecycle 2.84.84.84.0
I did the exact same thing as server 2 but I had started the installer in debug so it dropped me to a shell without having to use F3 on the reboot. While the server was booting I ran out to the garage and pulled the USB drive before the server started booting I don't know if it made a difference but this server jumped right to the login prompt without any issue. So I updated and rebooted and everything was fine.


Server #4 BIOS 2.17 Lifecycle 2.84.84.84.0
Same as #3 so not removing the USB on the first boot must be what caused the #2 server to hang on importing the zfs rpool.

Once server 4 was working I decided that I would put the drives in server #1 and see if it would boot. Well, it hung just like it had been doing above so I pulled the cables and the 3008 card and swapped it too. Well, wouldn't you know server #1 that I've been fighting with booted just fine. This is starting to make some sense now and that is why server #0 didn't boot at the DC but did boot when I got it home. It's because I had gotten the SAS cards swapped laying on the bench and when I put #1's in #0 it booted and I was like that was strange but it really wasn't so I took it back and put it into production.

I didn't know it at the time but either I have one bad SAS card or cables. Either way, I've spent way too much time on this project so I hope that tomorrow I can rule out if it's the eBay sas cables or the controller.

Notable posts are:

#20 from https://forum.proxmox.com/threads/error-preparing-initrd-bad-buffer-size.129427/#post-581068
#34 from https://forum.proxmox.com/threads/failed-to-import-pool-rpool.29942/page-2
I'm not sure if I needed to do that ^ on servers 2-4 but I did and they are working.
 
Last edited:
Having issue with PVE 8.0 booting. I get EFI boot mode detected, mounting efivars filesystem and never complete booting. When I try PVE 7.4 it works fine. But did previously have issue with 7.4 but after disable secure boot it worked. Not sure what happening with 8.0.
 
What kind of SAS controller was it? LSI? I took out the stock Dell H730 Mini Mono and this is happening with the Dell HBA330 Mini I put in. I never tested with the stock controller. I updated the HBA330 firmware after installing it, but I didn't do any sort of "clearing" procedure. Are you just saying you did a full BIOS reset to stock while you were troubleshooting?

I'll post a picture of my error when I get home. I'm curious if you have any insight. It's a UEFI001 software crash. I'm honestly about to bust out the serial cable here and get some more details.
 
It was an ebay 9300-8i LSI SAS3008 that came "preflashed" to IT mode. I tried upgrading the FW on the card and going back and forth to Raid mode and IT mode. It didn't make any difference. I worked on it for about a week at night before I swapped the same card to another working server and noticed it wouldn't post correctly. Sorry, I don't have much more information it's been too long :(
 
No worries! I'm glad the issue is behind you. If nothing else, I appreciate the shared misery. I have another HBA330 Mini on the way to test against and I'm going to test w/ the H730 using "HBA-mode" (only for testing, not permanently since it isn't a true HBA).

I've attached my cute little error. Did you ever attempt using Serial Over LAN to get diagnostic info like the error says? How did you go about that?
 

Attachments

  • dell-boot-error.jpg
    dell-boot-error.jpg
    133.1 KB · Views: 8

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!