Random kernel panic/crashes/reboot

In my initial post I said I tested with kernel 5.15 and it also crashed. I looks like I made a mistake when I wanted to test this because now I can run on the last 5.15 kernel without crashes.

I upgraded to the basic plan and working with proxmox support on this.
 
Sadly no, I will either use and external SATA controller next or set up a new system with identical hardware to allow for better diagnosis.
 
Sadly no, I will either use and external SATA controller next or set up a new system with identical hardware to allow for better diagnosis.
Hi, does the Proxmox team have anything to say? I've tried everything, and still no luck. I'm considering complete server replacement with different hardware.
 
I'm also experiencing this issue and am a bit shocked to not have seen a single reply of the Proxmox team here. I'm on Proxmox 8.2.2 with the newest updates and getting the exact same reboot issue as like 5 other people here.
 
I'm facing a similar issue where the machine itself shuts down abruptly without leaving any crash messages in the systemd journal logs.

My setup:
Kernel version: Linux 6.8.4-3-pve
Intel Core i3 9100F
MSI Z390A Pro
GTX 3070
56Gb DDR4 RAM (16+16+16+8)
Proxmox installed on a 500Gb Samsung 980 Pro NVMe
5x4Tb WD SATA and 1x256Gb SSD passed through to a VM (used as NAS)
6 VMs in total, no LXCs

Things I've tried:
1. Disabling all power saving related features of the CPU from BIOS. (Didn't work, Proxmox crashed without any error logs within 24hours)
2. Added kernel params intel_idle.max_cstate=1 libata.force=noncq to /etc/kernel/cmdline to disable power save for the CPU and ATA (Didn't work, crashed within 24hours)
3. Removed memory and ran the sytem on just a single memory stick. (Didn't work, crash witin 24 hours)
4. Tried with different memory once again. (Didn't work)
5. Replaced NVMe boot drive to a brand new one. (Didn't work)
6. Disabled TPM2.0 from BIOS.

Tried last option 48 hours ago and the system has been up the longest I can remember. I'll report back if it crashes again.

Edit 1:
Crashed after 3 days today. So the TPM2.0 option also did not work, it possibly increased stability but can't say for sure.

7. Added enable_dc=0 to the boot parameter.
8. Ran ethtool -K eno1 tso off gso off via command line to fix the hardware unit hang up error for network card.

Also tried following this guide to setup a remote log server https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

The guide simply does not work because netconsole on the latest proxmox does not support bonded/bridged interfaces.

Console error:
Code:
netconsole: network logging stopped on interface eno1 as it is joining a master device

Using the default eth0 or vmbr0 the error is interface not found. To make netconsole work I'll have to install an additional network card which is not feasible for me right now.

Will try to figure out other ways to get kernel crash logs.
 
Last edited:
Hi
Hi, does the Proxmox team have anything to say? I've tried everything, and still no luck. I'm considering complete server replacement with different hardware.
We paused troubleshooting where the next step would mean either bisect the kernel or use different SATA controller (or set up a new system on identical spare hardware) but I didn't have the time to do that.

However, on May, 31st days ago I decided to test running a VM again and after it did not crash for a day I started multiple VMs like I used to and it still runs without any crashes (since 10 days).
I thought one of the latest kernel updates may have solved it.
 
after it ran fine with multiple VMs for 6 weeks I rebooted yesterday and it is now crashing again. I had applied upgrades during the 6 weeks. I also upgraded the latest kernel image that was pending and also after that it still crashes
 
Hello. I have the same issue on Z390 too. The system crashes in a way that cuts off the power from the motherboard and power it back on after 1 second.

Kernel: 6.8.12-2-pve
My hardware is based on Z390:

CPU: Intel i9 9900K
M/B: Gigabyte Z390 Aorus Master
Ram: G.Skill 32GB (4x8GB)
1x SATA 60GB Drive for the Proxmox base system (Ext4 with GRUB)
2x Nvme drives 500GB each (WD SN570 + Samsung 970 Evo Plus in LVM-Thin arrangement)

The system literally crashes every ~5 minutes.. Some times it crashes once it boots, 10 seconds after it passes the GRUB selection screen. It happened to crash 5-10 times in row without having enough time to even login via SSH.

This is completely nuts and very strange behaviour as this system runs SUPER SOLID on any other operating system, including pure debian bookworm installation (I did it for testing). I used to have Windows Server 2022 in this system and it was running for months without a single glitch..
 
I just posted this into a new kernel freeze thread... we hade sporadic kernel freezes for a long time (happening every few days to every few weeks) on many servers going back to Proxmox 6 and then 7, using several different kernel versions. We tried many things, but the one thing that ended all freezes was disabling X2APIC both in BIOS/UEFI and grub.

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet consoleblank=0 nox2apic"
GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/pve-1 boot=zfs consoleblank=0 nox2apic"

We are currently running only dual Intel Xeons (several different generations) with Proxmox 8.2, kernel 6.8.12-1, and so far no unexplainable crashes since disabling X2APIC last year.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!