[SOLVED] Random Crashes/Reboots mit Proxmox VE 6.1 auf EX62-NVME (Hetzner)

Thanks for the feedback guys. I am thereby marking this thread as solved. Please report if you still face the same crashes after correctly applying the kernel parameter change repeated below:

Edit the file /etc/default/grub and modify GRUB_CMDLINE_LINUX_DEFAULT, I also use consoleblank=0 so I can ask Hetzner to connect a KVM in case of a crash and still be able to see the console should the system be unresponsive.

Code:
GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 intel_idle.max_cstate=1"


Next apply the grub configuration:

Code:
# update-grub

Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.3.18-1-pve
Found initrd image: /boot/initrd.img-5.3.18-1-pve
Found linux image: /boot/vmlinuz-5.3.13-3-pve
Found initrd image: /boot/initrd.img-5.3.13-3-pve
done


Reboot the machine when done. To double-check that no other cstates than C0 and C1 are used after the reboot, try the following command (part of the linux-cpupower package):

Code:
# turbostat -S --debug sleep 10
<...>
usec    Time_Of_Day_Seconds     APIC    X2APIC  Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     POLL    C1      POLL%   C1%     CPU%c1  CPU%c3  CPU%c6  CPU%c7  CoreTmp PkgTmp  GFX%rc6 Totl%C0 Any%C0  GFX%C0  CPUGFX%      Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt CorWatt GFXWatt RAMWatt PKG_%   RAM_%
  535   1581501460.693717       -       -       76      1.63    4663    3600    48083   0       1037    77067   0.00    98.34   98.37   0.00    0.00    0.00    46      47      99.59   25.44   18.26   0.00    0.000.00     0.00    0.00    0.00    0.00    0.00    0.00    27.37   26.19   0.00    0.00    0.00    0.00


As can be seen all cstate time higher than C0 (Busy%) is listed in C1%.
 
  • Like
Reactions: Licht
Do you have a Windows VM there?
Of course.
Interestingly, only two of three identical hosts are subject to sudden reboots. I did not change the kernel options on third, it already works stably. Yes, Windows is running on each of them.
 
Of course.
Interestingly, only two of three identical hosts are subject to sudden reboots. I did not change the kernel options on third, it already works stably. Yes, Windows is running on each of them.
I guess everyone here with the problems was running a Windows VM. I am running Windows Server 2019 with the latest updates.
(Maybe windows virtualization uses a weird instruction which is unstable on the EX systems?)

On linux vm I had no problems without disabling c state.
 
No crashes since change at 12 February. Yes, I use a Windows Server VM (with very low load). Thank you all!
 
I run a Windows 10 1909 VM with latest virtio drivers as well. I don't think it's related though, I'd guess this is a coincidence. I've also had no further issues the last time my server got replaced, without applying the cstate kernel parameter change (still running the Windows 10 VM). Anyway, glad everyone is happy now. Remember, report back if you still experience related issues after applying the change.
 
Hi guys,

We are having the same issue on one of our EX62 servers bought in october 2019. The other bought in feb 2020, does not have any issues (so far).

We are using CentOS.

Has anyone tried the solution described by chotaire on CentOS? If yes, did it work?

Thanks!
 
Hi guys,

We are having the same issue on one of our EX62 servers bought in october 2019. The other bought in feb 2020, does not have any issues (so far).

We are using CentOS.

Has anyone tried the solution described by chotaire on CentOS? If yes, did it work?

Thanks!
Yes it will work, you can either do it in the boot params or just disable the C state in the bios.
 
Yes it will work, you can either do it in the boot params or just disable the C state in the bios.

Hmmm... we now tried diasbling c-states, but last night we had a crash again, and we had to manually restart the server at Hetzner :-( Any ideas?
 
Check if they are actually disabled in the idle state log.
We did this, and it seem to be disable correctly. We have done some tweaks and await if the server will restart after this. If it does, I guess we need to request a BIOS update / access at Hetzner...
 
Hi,
Searching for solution i found this topic with the same issue.
We have 3 x EX62-NVME servers and we had experience the same issue with all 3 of them.
One of them after hardware replace with keeping the disk drives is OK for 178 days for now but the other 2 even with server change the issue is still present.

On all servers we are running CloudLinux 7 ( no proxmox ).

I was trying to apply the fix recommended here to edit /etc/default/grub and modify GRUB_CMDLINE_LINUX_DEFAULT .
But in my /etc/default/grub file i do not see GRUB_CMDLINE_LINUX_DEFAULT .
Do i need to add that line ? In the post it`s saying to modify, not add the line so i`m a little confused because GRUB_CMDLINE_LINUX_DEFAULT is missing from my file.

Here is how my /etc/default/grub file looks like now without any modification:

Code:
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="biosdevname=0 crashkernel=auto nomodeset rd.auto=1 consoleblank=0"
GRUB_DISABLE_RECOVERY="true"

Any suggestions ?
 
We run the same hardware, Proxmox 6.x, today we had also a crash.
Hetzner did offer to update the BIOS.

Its not the first crash tho, will see how it goes.
 
I'm still getting this error after applying this...


32 x Intel(R) Xeon(R) CPU D-1581 @ 1.80GHz (1 Socket)

Kernel Version

Linux 5.15.126-1-pve #1 SMP PVE 5.15.126-1 (2023-10-03T17:24Z)

PVE Manager Version

pve-manager/7.4-17/513c62be

Please report if you still face the same crashes after correctly applying the kernel parameter change repeated below:

help please..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!