PVE unexpected reboots

c3sro

New Member
Apr 15, 2024
7
1
3
Hi,

a PVE node reboots unexpectedly.

PVE Facts:
  • Kernel: Linux 6.5.13-5-pve
  • Storage: 2xSSD (ZFS Raid 1)
  • CPU: AMD Ryzen 9 7950X3D (16Core)
  • RAM: 128GB
The crashes started after adding another VM that does some nested virtualization (VirtualBox inside PVE).

VMs on the PVE node:
  • 6x VM with 4 CPUs each (Processor type: host)
  • 1x VM with 8 CPUs doing nested virtualization (Processor type: host)
So all in all the PVE node is is overcommitted regarding CPUs (32 vCPUs, 16 Core CPU, 32 Threads CPU). All other resource consumption (RAM, Disk space) is low and not overcommitted.

Somewhere between 5 Minutes and 24 hours the PVE node unexpectedly reboots. There are no log entries in journalctl or /var/log/ regarding the crash (only boot of PVE node with filesystem checks).

Steps done to solve the problem:
Even though the workaround with reducing the total amount of vCPUs helped to stabilize the system this still isn't an acceptable solution for me as a lot of CPU resources are unused and the flexibility of the PVE node is quite limited.

The main problem I'm facing right now is that I can't tell what the root cause of the problem is. A kernel problem? A faulty CPU? A faulty PSU? ...

Has anyone had similar problems and solved them? Any further ideas to find the root cause?
 
Hi,
do you have the latest BIOS updates and CPU microcode installed? Do you have any special (CPU-related) settings enabled in BIOS? The opt-in 6.8 kernel might also be worth giving a shot.
 
BIOS updates are managed by the provider, so thats nothing I can change (Pro WS 665-ACE, BIOS 1711 10/06/2023).

The server is running on microcode version 0x0a601206, which is the latest available version for this CPU.

As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?

The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?
 
As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?
I don't know what issue you have exactly, so unfortunately, I don't have any specific ones in mind.
The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?
Again, I don't know any specific ones.
 
Update: The provider replaced the whole server (new CPU, memory, PSU, mainboard, ...). Unfortunately the server still crashes regularly.

As some people report similar problems (https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-2) I think the reason isn't some faulty hardware (mainboard, memory, ...) or BIOS misconfiguration. A problem of the AMD Ryzen 9 7950X3D CPU seems much more likely.

It could be some CPU bug, Microcode bug or just a problem of the kernel. I've got not idea how to get to the root cause of the random crashes.
 
We also have this issue on 5 servers running Ryzen 9 7950x3d, each randomly lock up and the Proxmox login console is froze.
 
We have had several reboots and lockups on Intel NUCs with i5 and i7 CPUs.

We'll upgrade to 8.2 (Kernel 6.8) in a few weeks and hope the problem will be gone ...
 
Well problem still exists...
There was an uptime of 14 days and the server crashed again. About 15 minutes after the first crash with PVE 8.2 another crash.

So all in all update to PVE 8.2 didn't solve the problem. I've got no idea why there was no crash for two weeks, but especially the second crash right after the first one just proves that the problem still exists :(
 
Our i5-based 7nd-gen NUC reboots almost every 24h hours, with PVE 8.2.

The remote kernel log we set up using the netconsole module did not provide any useful information.

Code:
journaltl -b -1 -n 50
might and the like might provide hints:

Code:
May 21 01:55:20 pvehq02 systemd[1]: Stopping user@0.service - User Manager for UID 0...
May 21 01:55:20 pvehq02 systemd[1737042]: Activating special unit exit.target...
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target default.target - Main User Target.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target basic.target - Basic System.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target paths.target - Paths.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target sockets.target - Sockets.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target timers.target - Timers.
May 21 01:55:20 pvehq02 systemd[1737042]: Closed dirmngr.socket - GnuPG network certificate management daemon.
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
May 21 01:55:20 pvehq02 systemd[1737042]: Removed slice app.slice - User Application Slice.
May 21 01:55:20 pvehq02 systemd[1737042]: Reached target shutdown.target - Shutdown.
May 21 01:55:20 pvehq02 systemd[1737042]: Finished systemd-exit.service - Exit the Session.
May 21 01:55:20 pvehq02 systemd[1737042]: Reached target exit.target - Exit the Session.
May 21 01:55:20 pvehq02 systemd[1]: user@0.service: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: Stopped user@0.service - User Manager for UID 0.
May 21 01:55:20 pvehq02 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
May 21 01:55:20 pvehq02 systemd[1]: run-user-0.mount: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
May 21 01:55:20 pvehq02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
May 21 01:55:20 pvehq02 systemd[1]: user-0.slice: Consumed 2.979s CPU time.
May 21 01:57:15 pvehq02 pmxcfs[971]: [status] notice: received log

Does something reboot the system intentionally?

Because of other reports around intel e1000 cards we tried
Code:
ethtool -K eno1 tx off rx off
but that didn't help, but the last log changed:

Code:
May 22 00:50:08 pvehq02 sshd[486038]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
May 22 00:50:08 pvehq02 systemd-logind[652]: New session 2571 of user root.
May 22 00:50:08 pvehq02 systemd[1]: Started session-2571.scope - Session 2571 of User root.
May 22 00:50:08 pvehq02 sshd[486038]: pam_env(sshd:session): deprecated reading of user environment enabled
May 22 00:50:09 pvehq02 sshd[486038]: Received disconnect from 192.168.13.17 port 52422:11: disconnected by user
May 22 00:50:09 pvehq02 sshd[486038]: Disconnected from user root 192.168.13.17 port 52422
May 22 00:50:09 pvehq02 sshd[486038]: pam_unix(sshd:session): session closed for user root
May 22 00:50:09 pvehq02 systemd[1]: session-2571.scope: Deactivated successfully.
May 22 00:50:09 pvehq02 systemd-logind[652]: Session 2571 logged out. Waiting for processes to exit.
May 22 00:50:09 pvehq02 systemd-logind[652]: Removed session 2571.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!