[SOLVED] PVE unexpected reboots

c3sro

New Member
Apr 15, 2024
5
1
3
Hi,

a PVE node reboots unexpectedly.

PVE Facts:
  • Kernel: Linux 6.5.13-5-pve
  • Storage: 2xSSD (ZFS Raid 1)
  • CPU: AMD Ryzen 9 7950X3D (16Core)
  • RAM: 128GB
The crashes started after adding another VM that does some nested virtualization (VirtualBox inside PVE).

VMs on the PVE node:
  • 6x VM with 4 CPUs each (Processor type: host)
  • 1x VM with 8 CPUs doing nested virtualization (Processor type: host)
So all in all the PVE node is is overcommitted regarding CPUs (32 vCPUs, 16 Core CPU, 32 Threads CPU). All other resource consumption (RAM, Disk space) is low and not overcommitted.

Somewhere between 5 Minutes and 24 hours the PVE node unexpectedly reboots. There are no log entries in journalctl or /var/log/ regarding the crash (only boot of PVE node with filesystem checks).

Steps done to solve the problem:
Even though the workaround with reducing the total amount of vCPUs helped to stabilize the system this still isn't an acceptable solution for me as a lot of CPU resources are unused and the flexibility of the PVE node is quite limited.

The main problem I'm facing right now is that I can't tell what the root cause of the problem is. A kernel problem? A faulty CPU? A faulty PSU? ...

Has anyone had similar problems and solved them? Any further ideas to find the root cause?
 
Hi,
do you have the latest BIOS updates and CPU microcode installed? Do you have any special (CPU-related) settings enabled in BIOS? The opt-in 6.8 kernel might also be worth giving a shot.
 
BIOS updates are managed by the provider, so thats nothing I can change (Pro WS 665-ACE, BIOS 1711 10/06/2023).

The server is running on microcode version 0x0a601206, which is the latest available version for this CPU.

As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?

The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?
 
As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?
I don't know what issue you have exactly, so unfortunately, I don't have any specific ones in mind.
The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?
Again, I don't know any specific ones.
 
Update: The provider replaced the whole server (new CPU, memory, PSU, mainboard, ...). Unfortunately the server still crashes regularly.

As some people report similar problems (https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-2) I think the reason isn't some faulty hardware (mainboard, memory, ...) or BIOS misconfiguration. A problem of the AMD Ryzen 9 7950X3D CPU seems much more likely.

It could be some CPU bug, Microcode bug or just a problem of the kernel. I've got not idea how to get to the root cause of the random crashes.
 
We also have this issue on 5 servers running Ryzen 9 7950x3d, each randomly lock up and the Proxmox login console is froze.
 
We have had several reboots and lockups on Intel NUCs with i5 and i7 CPUs.

We'll upgrade to 8.2 (Kernel 6.8) in a few weeks and hope the problem will be gone ...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!