pve 8.0 and 8.1 hangs on boot

iprigger · Nov 29, 2023

IT ProCare said:
I think I'm a bit lost
In our case:
On Dell R240 or R340. Boot from JBoss RAID1 LVM 6.5.11-4-pve UEFI
The server stops at initial ramdisk. It doesn't get up
Same machines on 6.2.16-19-pve no problem

/etc/default/grub:

Code:

GRUB_DEFAULT=0 GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="quiet" GRUB_CMDLINE_LINUX=""

/etc/initramfs-tools/modules

Code:

simplefb

Proxmox & bios latest possible stable version

Hi,

Same here with a R240

Tobias

ReleasedSpirit · Nov 30, 2023

mir said:
nomodeset was enabled.

This is wat I see after adding earlyprintk=vga which is exactly the same as I saw before, no output what so ever
View attachment 58761
To me it looks like a kernel segfault .

networking is enabled (management interface:
$ ping -c1 esx2
PING esx2.datanom.net (172.16.3.9) 56(84) bytes of data.
64 bytes from esx2.datanom.net (172.16.3.9): icmp_seq=1 ttl=64 time=0.342 ms

--- esx2.datanom.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.342/0.342/0.342/0.000 ms

Not that it matters, but I was able to install 7.4 without any problems onto this hardware (below)!

Previously,

Having this exact issue!
ASRock B650D4U-2L2T/BCM
AMD Ryzen 9 7900X
32BG ECC DDR5

iprigger · Dec 2, 2023

ReleasedSpirit said:
Not that it matters, but I was able to install 7.4 without any problems onto this hardware (below)!

Previously,

Having this exact issue!
ASRock B650D4U-2L2T/BCM
AMD Ryzen 9 7900X
32BG ECC DDR5

8.1 with the old kernel works as well... must be something with the 6.5 kernel.

tobias

mir · Dec 2, 2023

ReleasedSpirit said:
Not that it matters, but I was able to install 7.4 without any problems onto this hardware (below)!

Previously,

Having this exact issue!
ASRock B650D4U-2L2T/BCM
AMD Ryzen 9 7900X
32BG ECC DDR5

Try read the entire thread before making assumptions. The reason is explained here: https://forum.proxmox.com/threads/pve-8-0-and-8-1-hangs-on-boot.137033/post-609320 which also makes it obviously way you see no problems on proxmox 7.4

mir · Dec 2, 2023

iprigger said:
8.1 with the old kernel works as well... must be something with the 6.5 kernel.

tobias

Try read the entire thread before making assumptions. The reason is explained here: https://forum.proxmox.com/threads/pve-8-0-and-8-1-hangs-on-boot.137033/post-609320 which also explains why this has nothing todo with the kernel but was related to living out a module from the initramfs which was required when the servers terminal is controlled via IPMI.

iprigger · Dec 2, 2023

IT ProCare said:
I think I'm a bit lost
In our case:
On Dell R240 or R340. Boot from JBoss RAID1 LVM 6.5.11-4-pve UEFI
The server stops at initial ramdisk. It doesn't get up
Same machines on 6.2.16-19-pve no problem

/etc/default/grub:

Code:

GRUB_DEFAULT=0 GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="quiet" GRUB_CMDLINE_LINUX=""

/etc/initramfs-tools/modules

Code:

simplefb

Proxmox & bios latest possible stable version

UPDATE:
On 6.5.11-6 same problem

Hi,

Yes, indeed. I can confirm: Same problem as before - system is unusable with kernel 6.5 unfortunately.

Tobias

iprigger · Dec 4, 2023

iprigger said:
Hi,

Yes, indeed. I can confirm: Same problem as before - system is unusable with kernel 6.5 unfortunately.

Tobias

Hi All,

*any* Update on this? It's quite annoing to be honest...

Thanks
Tobias

ReleasedSpirit · Dec 4, 2023

mir said:
Try read the entire thread before making assumptions. The reason is explained here: https://forum.proxmox.com/threads/pve-8-0-and-8-1-hangs-on-boot.137033/post-609320 which also makes it obviously way you see no problems on proxmox 7.4

I wasn't assuming anything. I simply need a system that works.
7.4 works.
That's all I said.

t.lamprecht · Dec 5, 2023

iprigger said:
*any* Update on this? It's quite annoing to be honest...

Using the 6.2 kernel if it's not affected for such specific HW issues, it still gets updates.

Please also note that this is the community forum, while we look into every issue eventually, the ones we get in from enterprise support have naturally priority, and as we sadly have no such broken HW in our test lab it's a bit hard to bisect the issue fast.
We'll still look into it, but it might need a bit more time.

zodiac · Dec 5, 2023

t.lamprecht said:
EDID is display info stuff, and FWIW, I got some HW that also reports this (IIRC because I run it headless and the vendor/firmware just cannot believe anybody would do so and wants a display connected), it's annoying, but I just ignore it there because it's a cheap mini server that I do not expect too much from. But that it happens on server HW does strike me as slightly odd.

You never know what crazy bugs you find in a system BIOS. I am not at all surprised that some hardware misbehaves if you don't connect a monitor. Fortunately, dummy monitor plugs cost only around $5. That's a pragmatic solution to this type of bug.

Plug one of those like gizmos into your computer, and it now believes you have a monitor connected. No more EDID errors.

Of course, in many cases, EDID errors are merely cosmetic. So, it would necessarily address the hangs that people have reported in this thread. But even cosmetic issues are worth addressing, as they can be confusing and hide real problems.

parkervcp · Dec 6, 2023

IT ProCare said:
I think I'm a bit lost
In our case:
On Dell R240 or R340. Boot from JBoss RAID1 LVM 6.5.11-4-pve UEFI
The server stops at initial ramdisk. It doesn't get up
Same machines on 6.2.16-19-pve no problem

/etc/default/grub:

Code:

GRUB_DEFAULT=0 GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="quiet" GRUB_CMDLINE_LINUX=""

/etc/initramfs-tools/modules

Code:

simplefb

Proxmox & bios latest possible stable version

UPDATE:
On 6.5.11-6 same problem

I am on an R340 as well with the same issue. 6.5.11+ has issues and I haven't been able to find out why. using debug just shows it stopping lvm for some reason.

zodiac · Dec 6, 2023

Whenever you suspect a problem with the initramfs, it's a good idea to try booting with the "break=top" kernel command line option. If the initramfs was loaded at all, then this should give you a shell prompt. And that would be a useful additional data point.

Of course, you should still pass whatever command line option were necessary to even enable console output. That can be a little tricky to sort out and depends on the hardware in your computer. If in doubt, a serial console can make all the difference.

ivenae · Dec 7, 2023

t.lamprecht said:
Thanks for your feedback!

So the server is fully up, but fails to initialize the console correctly suggesting a hang?

That would then point to the removal of simplefb from initramfs:
https://forum.proxmox.com/threads/o...st-no-subscription.135635/page-10#post-608958

Actually I might have some idea why that could be the cause of it, maybe you could try re-adding that module and see if that fixes the issue? That would help in confirming that theory (until I find a host here that shows the same issue).

I have had lot of trouble today
3x Dell T140 doesn't come up. They only work with 6.2, but it was a mess to solve the customers at the same time.

2x Dell T30 does not come up as well.
One of them suddenly works after some reboots and now works fine with 6.5.11-6

The other one only works after rolling back the rpool/ROOT/PVE-1 partition. Otherwise is has no network connection (ifconfig shows only lo1 as the only network interface)
If i roll back the old snapshot it boots with both kernel and both with working network, but the 6.5 is missing (of course) some modules which prevent KVM from working. So this is NOT a kernel thing.

otimm · Dec 8, 2023

Tested several installations and/ or upgrades from 7.4 to 8.0/ 8.1.
(All on older Dell R720 with H310 mini in IT Mode with boot on mirrored zfs and internal nic daughtercard 2x 10Gbit/2x 1Gbit.)

No successes at all booting an upgraded system (7.4 to 8.x), booting to rescue shell was possible in any case at any time.
- all upgrades from 7.4 hung on bringing up network
- downgrading pve-firmware to 3.8.5 were not successful
- simplefd in grub.conf doesn't work for me
- using old kernel 5.15.126 or 5.15.131 with 8.1 doesn't work for me either

Only new installations were succesfully comming up and could also successfully be upgraded to kernel 6.5.11-[5/6/7]-pve.
No other solution than to reinstall the cluster nodes step by step ...
No problems so far.

thedm96 · Dec 9, 2023

Same issue here. I had to pin a kernel older than 6.5 in order to get 5 DELL servers to boot. Otherwise it hangs at "loading initial ramdisk" and never comes up without a power cycle.

Bandit089 · Dec 9, 2023

Same issue here. no boot.

thedm96 · Dec 9, 2023

thedm96 said:
Same issue here. I had to pin a kernel older than 6.5 in order to get 5 DELL servers to boot. Otherwise it hangs at "loading initial ramdisk" and never comes up without a power cycle.

I don't know If this is anecdotal or not, but when I attempt to install proxmox 8.1 on my external usb/nvme dongle it has the problem. When I accidently installed 8.1 on the internal SSD drive (where the guests are normally stored) it booted with no issues.

vmcms · Dec 9, 2023

Happy Proxmox paying customer w/three nodes, all Dell and Supermicro 2U servers.
Unhappy that I can't upgrade the Supermicro to any 6.5.x kernel.
Not sure if the bug report should go here or a new thread, but it totally blocks the upgrade path.

All 6.2 and earlier Proxmox 8.x work fine. Using the Enterprise repository, upgrading to any of the 6.5.x Linux kernels causes the fault.

If during a node reboot I select the 6.2 kernel from the boot menu, all is well. So no touching hardware or disks, just selecting different kernel at boot menu.

(very) tempory fix: pin the 6.2 kernel using Proxmox boot tool, so nobody accidentally boots into 6.5.x

Here's the bug:

Servers have a BMC (or lights-out or remote management...) and it lets you use a separate ethernet port to talk to the chassis and power on/off, get a remote terminal with screen, keyboard etc. The BMC also shows temperatures and controls the fans. So the BMC is sort of important as if you don't know the temperatures and don't control the fans your $12,000 server can toast itself (or at least be in thermal throttle mode which isn't great either).

Using Proxmox on 6.2 or earlier the BMC works fine on Supermicro AS-2015CS-TNR which is an AMD Epyc.

Booting Proxmox 6.5.x causes ALL BMC sensor data to go away. There are normally many dozen entries for temperatures, voltages, fan RPM and EVERY one of them is just gone, BMC web page says NA for all of them. This is with no code added to Proxmox, totally out of the box.

With 6.2.x this all works from the BMC web page and (optional, I tried with and without) installing ipmitool and running ipmitool sensors command shows them all too.

So when 6.5.x fails, first thing I tried was apt purge ipmitool (in case the user space tool or libraries it pulls in is causing the issue). Sadly, no improvement after ipmitool purge and a reboot. But reboot into 6.2.x still OK.

I then did the obvious... read up on how ACPI figures out what it should do. Oh my, uses ACPI and by default uses ACPI in the Bios to figure things out. Nothing like x86 Bios vs OS battles. Only the user loses

What I *think* I'd like to do is understand how I can blacklist the usual suspects such that Linux kernel simply does NOT TOUCH the BMC stuff, I can live with access to the BMC to check temps and control fans only through the separate ethernet interface. In the near term, I do not need the node to be able to touch the BMC at all, and would hope that having it not touch anything results in it not breaking the BMC.

The Proxmox boot pin command is really nice. Is there some similar way to blacklist the IPMI module(s) and maybe bisect the problem down to one of them?

I'm reluctant to try and build kernels as Proxmox builds their own kernels and even if I started with plain Debian and spent time trying to bisect it at the Debian USB live stick route, no telling if that would help getting it fixed in Proxmox.

Please advise what to do... Oh and our (much older) Dell R730xd servers all seem perfectly happy with all the Enterprise updates and are on 6.5 for a couple weeks now... with their temps & fans just fine.

unixe · Dec 11, 2023

Hi,
same for me on Dell PowerEdge R340 when upgrading from Proxmox 7 to 8: system hangs on "loading initrd" without any output when trying to load a kernel 6.5.x; last try was 6.5.11-7-pve, but it doesn't boot either. Putting "simplefb" in place made do difference. Older kernels work fine, so for the moment, I pinned 6.2.16-20-pve. But of course this can't be a permanent solution.
I upgraded several Proxmox 7 instances to 8, and they all are up & running; but these are Dell R6515 or R630. The R340 is the only one making this noise (SO FAR! And it can stay that way!).
Regards,
Marianne

soapee01 · Dec 22, 2023

Same issues on a Dell Poweredge T140. More information here: https://forum.proxmox.com/threads/cant-boot-efi-stub-loaded.138549/#post-618446

I've pinned the 6.2.16-20-pve kernel at boot as well for now. In my case, the network cards blink, but I'm never able to ping the server or log in, even with physically moving the ethernet cable to other nics. It's just dead. Keyboard is locked as well, and I have to hard power off the PC (or use idrac to reboot).

pve 8.0 and 8.1 hangs on boot

Renowned Member

Active Member

Renowned Member

Famous Member

Famous Member

Renowned Member

Renowned Member

Active Member

Proxmox Staff Member

Member

Member

Member

Member

New Member

New Member

New Member

New Member

New Member

New Member

Well-Known Member