Happy Proxmox paying customer w/three nodes, all Dell and Supermicro 2U servers.
Unhappy that I can't upgrade the Supermicro to any 6.5.x kernel.
Not sure if the bug report should go here or a new thread, but it totally blocks the upgrade path.
All 6.2 and earlier Proxmox 8.x work fine. Using the Enterprise repository, upgrading to any of the 6.5.x Linux kernels causes the fault.
If during a node reboot I select the 6.2 kernel from the boot menu, all is well. So no touching hardware or disks, just selecting different kernel at boot menu.
(very) tempory fix: pin the 6.2 kernel using Proxmox boot tool, so nobody accidentally boots into 6.5.x
Here's the bug:
Servers have a BMC (or lights-out or remote management...) and it lets you use a separate ethernet port to talk to the chassis and power on/off, get a remote terminal with screen, keyboard etc. The BMC also shows temperatures and controls the fans. So the BMC is sort of important as if you don't know the temperatures and don't control the fans your $12,000 server can toast itself (or at least be in thermal throttle mode which isn't great either).
Using Proxmox on 6.2 or earlier the BMC works fine on Supermicro AS-2015CS-TNR which is an AMD Epyc.
Booting Proxmox 6.5.x causes ALL BMC sensor data to go away. There are normally many dozen entries for temperatures, voltages, fan RPM and EVERY one of them is just gone, BMC web page says NA for all of them. This is with no code added to Proxmox, totally out of the box.
With 6.2.x this all works from the BMC web page and (optional, I tried with and without) installing ipmitool and running ipmitool sensors command shows them all too.
So when 6.5.x fails, first thing I tried was apt purge ipmitool (in case the user space tool or libraries it pulls in is causing the issue). Sadly, no improvement after ipmitool purge and a reboot. But reboot into 6.2.x still OK.
I then did the obvious... read up on how ACPI figures out what it should do. Oh my, uses ACPI and by default uses ACPI in the Bios to figure things out. Nothing like x86 Bios vs OS battles. Only the user loses
What I *think* I'd like to do is understand how I can blacklist the usual suspects such that Linux kernel simply does NOT TOUCH the BMC stuff, I can live with access to the BMC to check temps and control fans only through the separate ethernet interface. In the near term, I do not need the node to be able to touch the BMC at all, and would hope that having it not touch anything results in it not breaking the BMC.
The Proxmox boot pin command is really nice. Is there some similar way to blacklist the IPMI module(s) and maybe bisect the problem down to one of them?
I'm reluctant to try and build kernels as Proxmox builds their own kernels and even if I started with plain Debian and spent time trying to bisect it at the Debian USB live stick route, no telling if that would help getting it fixed in Proxmox.
Please advise what to do... Oh and our (much older) Dell R730xd servers all seem perfectly happy with all the Enterprise updates and are on 6.5 for a couple weeks now... with their temps & fans just fine.