Proxmox now provides its own 6.11 kernel. There is no longer a need for the old Ubuntu kernel workaround
how did you do go back to 6.11.0-9?Had to back to 6.11.0-9, with 6.11.0-12 got hangs again.
6.11.0-13 need to be tested also and pve-6.11.0-1.
6.11.0-2-pve
proxmox-kernel-6.8/stable 6.8.12-5 all [upgradable from: 6.8.12-4]
root@crashing-server:~# lspci | grep -E -i --color 'network|ethernet'
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
81:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
root@working-server:~# lspci | grep -E -i --color 'network|ethernet'
44:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)
44:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)
81:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
root@crashing-server:~# lspci | grep -E -i --color 'network|ethernet'
44:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)
44:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GBASE-T (rev 02)
81:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
81:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
You didn't specify the Supermicro Motherboard you are using at all.
- The other two servers that have identical hardware did not freeze once sind their deployment (but they were purchased together and later in comparison to the crashing server)... who knows.
minicom
what's going on. Obviously making sure that the BIOS configured the Serial Port correctly first and do a test on the "Client" System. You should see e.g. the Boot Process & initial Kernel/Initrd Output via Serial./var/log/syslog
and /var/log/kern.log
and such I believe are of no use, since the Kernel is already frozen by that Time, so it cannot write to Disk what caused the Issue in the first Place (obviously).TIMESTAMP=$(date +"%Y%m%d-%Hh%Mm%Ss"); minicom --capturefile="serial-log-${TIMESTAMP}.log" --device=/dev/ttyS0 --baudrate 115200
Not 100% sure that it is related, but I had a crontab entry for setting the CPU governor to `performance` after every reboot. I removed the entry yesterday and set the governor to `powersave` instead. So far it has been running without interruptions.
I used a tteck helper script for setting up the crontab. This one: https://github.com/tteck/Proxmox/blob/main/misc/scaling-governor.sh.
Edit: I also have the microcode for Intel installed.
You didn't specify the Supermicro Motherboard you are using at all.
I have zero Experience with AMD EPYC Systems, but if you say that they have Identical Hardware, did you also check the Hardware Revision of the Motherboard (e.g. 1.01 vs 1.20 or whatever) and/or the other Hardware (e.g. NIC) to see if they are any different ?
Possibly also different FW on the NIC and other Components (HBA etc).
Version: 3.3
Release Date: 03/28/2025
Version: 2.5
Release Date: 09/14/2022
[ 3.046149] mpt3sas_cm0: FW Package Ver(20.00.00.03)
[ 3.046826] mpt3sas_cm0: SAS3816: FWVersion(20.00.01.00), ChipRevision(0x00)
[ 4.452318] mpt3sas_cm0: FW Package Ver(16.00.08.01)
[ 4.471775] mpt3sas_cm0: SAS3816: FWVersion(16.00.08.00), ChipRevision(0x00)
The freezing issues reported here seem sporadic/inconsistent, which shouts hardware instability. One person even confirmed it was a RAM problem.
The fixes in the first post, one of them disables command queuing for SATA devices which is a pretty big deal.
No issues for me on my 3 proxmox devices running 8.x and 6.8.x kernel.
My suggestion, if not done already update bios.
Disable PCIe power saving features in bios.
Make sure RAM is configured in a safe configuration, correct voltage, not overclocked etc.
Stress test RAM.
Disable SATA power management in bios.
Disable unused hardware in bios.
Make sure everything seated properly on board, and no kinks in cables.
Disable special performance related features in bios, especially vendor specific one's, these are often unsupported performance hacks.
If still unstable try in bios if available.
Disable package c-states.
Disable core c-states higher than C6.
As always backup/make a note of anything changed, in case need to revert.
sed -i '$ s/$/ pcie_port_pm=off/' /etc/kernel/cmdline
sed -i '$ s/$/ libata.force=noncq/' /etc/kernel/cmdline
update-initramfs -u -k all && proxmox-boot-tool refresh
HBA
HBA is Serial Attached SCSI controller: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx, Subsystem: Super Micro Computer Inc AOC-S3816L-L16iT (NI22) Storage Adapter, Kernel driver in use: mpt3sas, Kernel modules: mpt3sas
(lspci -nn | grep -i 'raid\|sas\|sata')
The failing server has a newer firmware:
Compared to the working servers:Code:[ 3.046149] mpt3sas_cm0: FW Package Ver(20.00.00.03) [ 3.046826] mpt3sas_cm0: SAS3816: FWVersion(20.00.01.00), ChipRevision(0x00)
(sudo dmesg | grep mpt3sas)Code:[ 4.452318] mpt3sas_cm0: FW Package Ver(16.00.08.01) [ 4.471775] mpt3sas_cm0: SAS3816: FWVersion(16.00.08.00), ChipRevision(0x00)
So I might check what is the latest firmware here and try to update it.
I doubt that is an Issue *per se*, unless it would prevent you from booting at all, which clearly is NOT the case here.RAM / memory
I found that the failing server does have memory from Micron Technology (36ASF4G72PZ-3G2R1), the others are Samsung (M393A4K40EB3-CWE) - but both have 3200MT/s speed and look like they have the same specs.
For the other components it's also possible that there are minor version changes, but not really sure what else to check.
Did you runCPU / microcode
I found that the amd64-microcode package was not present on any of my 3 servers, I installed the package on the failing one and rebooted.
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_debian_firmware_repo
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_firmware_cpu
update-initramfs -k all -u
afterwards ?No Experience with those.disables command queuing for SATA devices
done using
according to the thomas-krenn site / recommendationsCode:sed -i '$ s/$/ pcie_port_pm=off/' /etc/kernel/cmdline sed -i '$ s/$/ libata.force=noncq/' /etc/kernel/cmdline update-initramfs -u -k all && proxmox-boot-tool refresh
About the System Time, easiest is probably toBIOS settings:
- already checked for PCIE power management - ASPM is already disabled.
- found the time / clock was exactly 2 hours off - changed that
- did not find any other power related settings e.g. for SATA
- did not find anything about package c-states
apt-get install chrony
& systemctl enable --now chrony
.I'd be surprised if that was the Case. I ALWAYS useOther changes:
- pinned kernel "6.8.12-13-pve" instead of "6.8.12-12-pve" with proxmox-boot-tool and rebooted
- made sure all VMs on the failing server do not have CPU configured as [host] but default - had 3 with [host] (not really sure if this will affect anything)
Will now see if the issues happen again.
host
CPU Type in all of my Guests.Any notable Correlation with Crashing with regards to the Load / Temperature ? Or high Network Activity ? Are you 100% it's a Kernel Panic and not "just" the NIC shutting down for whatever Reason ?My next tasks if this freeze is happening again is to:
- run memtest
- update firmware of HBA
- do some research on / try the scaling-governor in powersave mode (like @MagicHarri did)
- check if hardware is in place / properly mounted - I feel that it was much more stable when I opened the server and replaced the NIC card a year ago (probably also checking all connections). After the hardware change it was good for a longer period. Maybe I'll reseat all the memory bars and maybe also the CPUs (had another Supermicro Server freeze completely, suddenly wouldn't even POST, reseating everything fixed that - completely other datacenter and usage)
So probably - I'll be back in 2-3 weeks when it's freeze time again![]()
powersave
helps in some Cases.It could indeed by a stubborn Firmware Quirk. I got no direct Experience with this specific one, only LSI 2118/3008 Series Chipsets so far.
I doubt that is an Issue *per se*, unless it would prevent you from booting at all, which clearly is NOT the case here.
Memory could be faulty though, which is why I suggested a Memtest86.
However, if you have lots of Memory it could very well take several DAYS to complete such Test.
Did you look in IPMI / BMC System Event Log (SEL) to see if there are any Memory or BIOS related Issues ?
Did you runupdate-initramfs -k all -u
afterwards ?
No Experience with those.
About the System Time, easiest is probably toapt-get install chrony
&systemctl enable --now chrony
.
Very weird that there is no C-State related Setting.
I believe you could, if you really wanted to check the Configuration 100% and see if there are any notable Differences that you might have overlooked, get a Configuration "dump" of the BIOS Settings via Supermicro SUM to XML File
See for Instance: https://www.supermicro.com/support/faqs/faq.cfm?faq=28095 and Supermicro Update Manager (SUM) at https://www.supermicro.com/en/solutions/management-software/supermicro-update-manager.
I'd be surprised if that was the Case. I ALWAYS usehost
CPU Type in all of my Guests.
Any notable Correlation with Crashing with regards to the Load / Temperature ? Or high Network Activity ? Are you 100% it's a Kernel Panic and not "just" the NIC shutting down for whatever Reason ?
Could it be caused by high Temperature of CPU / NIC / HBA ? That might be a Reason why powersave helps in some Cases.
It's also possible that you had a bad Silicon Lottery and your CPU needs slightly higher Voltage, but I'm surprised that the Defaults from OEM like Supermicro fail to provide that. I had some Issues when I was undervolting a CPU (crashes every 2-3 Weeks), but since you cannot do that with a Supermicro Motherboard ... unless of course you are using an undervolting Tool (you can find some on GitHub for both Intel and AMD) and the MSR "Trick" (or equivalent) works correctly for you.
Alright ...Everything clean in IPMI / Event Log - no issues at all. Also compared the RAM again, their performance specs are 1:1 the same.
Sometimes the OEM / BIOS Vendor either puts them in Places that are impossible to find, or they are altogether hidden.Maybe I‘m just too stupid to find them. Will look that up again.
uefitool
to extract the BIOS, ifrextractor
to dump that to Text File, edit using e.g, HxD
Editor, rebuilding a BIOS with a LEGACY version of uefitool
and finally either flashing via IPMI OR you may need to use a CH341).Kernel Version and ZFS Version if applicable ?I‘m not monitoring any other addon cards / PCIE temperature - so could technically be that one if the NICs is suddenly too hot (but as the system is stable if I cause severe load on the network I doubt thats the issue)
And no - not really sure it‘s kernel panic. The system just freezes with no (sys)logs that could give a hint. (Also nothing on the virtual HTML5 console - just the Proxmox login)
dmesg
about that). I'm NOT saying it's your Case, but if you only have that NIC monitored via Ping etc, that could be ONE Possibility.I agree.Hm, could be - but would this really be the case if the server works for a whole year, and then „suddenly“ breaks? Something must have happened… Never had a faulty CPU and I had over 50 Supermicro Servers within the last 10 years. Would be my firstBut thats also why I have reseating everything, including the CPUs, on my list.
zfs-dkms
Package & the entire Build Toolchain for building a Kernel Module (plus the Kernel Headers). I'd try Kernel 6.5.x first if I were you.No Problem. I'm about as lost as you areThanks alot for your reply!! Really appreciate it.
Kernel Version and ZFS Version if applicable ?
root@failing-server:~# uname -r
6.5.13-6-pve
root@failing-server:~# modinfo zfs | grep ^version:
version: 2.2.3-pve1
root@failing-server:~# zfs version
zfs-2.2.8-pve1
zfs-kmod-2.2.3-pve1
But you cannot type or do anything with the Keyboard, right ? That in itself is NOT conclusive of a Hardware Freeze though. On my Supermicro Motherboards sometimes I get a KVM Freeze instead, thus I need to reboot the KVM and sometimes even the IPMI/BMC altogether (cold Reset).
That can happen e.g. when I have Network Issues and a flood of Warnings/Errors gets printed to the TTY (and seen via KVM).
Rebooting the KVM & IPMI/BMC can sometimes Help.
auto lo
iface lo inet loopback
auto enp68s0f0
iface enp68s0f0 inet manual
#10G PCIE - UPLINK BOND
auto enp68s0f1
iface enp68s0f1 inet manual
#10G PCIE - UPLINK BOND
auto enp129s0f0
iface enp129s0f0 inet static
mtu 9000
#10G PCIE - CEPH - local routing
auto enp129s0f1
iface enp129s0f1 inet static
mtu 9000
#10G PCIE - CEPH - local routing
auto bond0
iface bond0 inet manual
bond-slaves enp68s0f0 enp68s0f1
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer2+3
offload-rx-vlan-filter off
auto vmbr0
iface vmbr0 inet manual
bridge-ports bond0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 100 200 300
offload-rx-vlan-filter off
#VM BRIDGE
auto vmbr0.100
iface vmbr0.100 inet static
address 130.83.167.138/24
gateway 130.83.167.254
offload-rx-vlan-filter off
#ACCESS VLAN 100
#for local routing ceph
post-up /usr/bin/systemctl restart frr.service
Do you have monitoring via e.g. Uptime Kuma, Zabbix, Gatus, Monit maybe, ..., of your LAN Connection from another Host ? Just to check that it's not a NIC Issue only. Did you try pinging/monitoring the Host on another NIC as well, to see if it's not "only" just the NIC that failed ? To Mind comes a case with a Mellanox NIC shutting down due to Overtemperature (and there was something indmesg
about that). I'm NOT saying it's your Case, but if you only have that NIC monitored via Ping etc, that could be ONE Possibility.
I'd just like that you are 100% sure that it IS a Hardware Freeze, because in that case:
- You could setup a Watchdog to automatically reboot the System
- There should definitively be some Indication somewhere (if not you'll need to setup & test Serial Debugging using Null Modem Cable and/or Netconsole, but make sure you validate that they are working in "normal Operation" before "counting" on them to log the Error)
Out of Curiosity, did you try Kernel 6.5.x and see if it crashes/freezes with that ? I agree it's NOT ideal but when you are out of Ideas.
Personally when the Proxmox Kernel failed to boot for whatever Reason, I installed the Debian Stock Backports Kernel (probably NOT needed with PVE 9 / Debian Trixie, I am talking about Bookworm) although of course that will make a bit of a Mess in terms of ZFS and you'll need Debian'szfs-dkms
Package & the entire Build Toolchain for building a Kernel Module (plus the Kernel Headers). I'd try Kernel 6.5.x first if I were you.
No Problem. I'm about as lost as you are. I also had a few Quirks on other Systems and unfortunately there is no easy Way around it
.
It doesn't matter how unlikely you think an Issue is, once you went through everything that is "common", whatever is left must be your Issue.
But "common" Issues can be a very very very long list, from bent CPU/Motherboard Pins, to glitchy/faulty PSU, to Transient Phenomenon, to "broken" CPU Cores that trigger in some Circumstances but not all the Time, to quircky Firmware of the Motherboard, NIC, HBA, Hardware Faults in one Component that due to a Firmware Bug in another System instead of failing "gracefully" trigger a complete Breakdown, etc
We use essential cookies to make this site work, and optional cookies to enhance your experience.