AMD Ryzen 9 5950X 8.2.2 Kernel 6.8.4-3-pve crashing/rebooting every 2-3 days

May 21, 2024
12
2
3
Switzerland
Hello

I have 2 identical Hetzner root servers (128GB RAM) in a cluster and one of them (the master) is crashing every 2-3 days, the other one is fine for 9 days. Both have a handful of Linux VMs, nothing special just webservers.
journalctl doesn't show something special.

# uname -a
Linux n1 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
root@n1:/var/crash# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.4-3-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet crashkernel=384M-:128M

I tried to enable kdump but not sure if I should follow https://www.cyberciti.biz/faq/how-to-on-enable-kernel-crash-dump-on-debian-linux/ or anything special for pve kernel?

thanks
Patrick
 
Not NFS but CIFS, I'm mounting Hetzner storagebox for backup
Yeah, no.
There are some issues related to nfs i seen in this forum, but cifs is fine, im using the storage box either.

Sorry, then i don't know, maybe someone else.
 
@poberholzer Have you looked at downgrading the kernel to the 6.5 series, as lots of people are having issues with the 6.8 series?

As a data point, my desktop/workstation is using a 5950X and runs Proxmox as the main OS.

It's using kernel 6.5 currently (due to Nvidia card not liking 6.8 kernel), and has been really stable:

Bash:
# uname -a
Linux home1 6.5.13-5-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-5 (2024-04-05T11:03Z) x86_64 GNU/Linux
 
I;m working normally on Pve 8,kernel 6.8 , 5950x and nvidia 3900 , with passthrough working. Can you update BIOS?
 
BIOS is rather recent 3606 03/11/2024.
Same Motherboard and CPU as the other host which is fine. I now gave it a complete reinstall and moved only non critical VMs to this node

Any hint on enabling crash dump under proxmox?
 
Hello,

Please consider to either ping the 6.5 kernel or to upgrade to the 6.8.8 kernel which is, as of today, only available in the no-subscription repository. We have received positive user feedback that that fixes multiple freezes with kernel version 6.8.

If you want to pin an older kernel version, we have instructions for it in our documentation [1].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_kernel_pin
 
  • Like
Reactions: justinclift
Hi,
Hello

I have 2 identical Hetzner root servers (128GB RAM) in a cluster and one of them (the master) is crashing every 2-3 days, the other one is fine for 9 days. Both have a handful of Linux VMs, nothing special just webservers.
journalctl doesn't show something special.

# uname -a
Linux n1 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
root@n1:/var/crash# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.8.4-3-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet crashkernel=384M-:128M

I tried to enable kdump but not sure if I should follow https://www.cyberciti.biz/faq/how-to-on-enable-kernel-crash-dump-on-debian-linux/ or anything special for pve kernel?

thanks
Patrick
are you using host CPU type for the VMs and have assigned more cores to VMs than the host has? I'm asking, because there is a thread where a similar CPU model had issues with that: https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
Also contains references for using kdump with ZFS: https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-677771
 
  • Like
Reactions: justinclift

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!