PVE unexpected reboots

c3sro · Apr 15, 2024

Hi,

a PVE node reboots unexpectedly.

PVE Facts:

Kernel: Linux 6.5.13-5-pve
Storage: 2xSSD (ZFS Raid 1)
CPU: AMD Ryzen 9 7950X3D (16Core)
RAM: 128GB

The crashes started after adding another VM that does some nested virtualization (VirtualBox inside PVE).

VMs on the PVE node:

6x VM with 4 CPUs each (Processor type: host)
1x VM with 8 CPUs doing nested virtualization (Processor type: host)

So all in all the PVE node is is overcommitted regarding CPUs (32 vCPUs, 16 Core CPU, 32 Threads CPU). All other resource consumption (RAM, Disk space) is low and not overcommitted.

Somewhere between 5 Minutes and 24 hours the PVE node unexpectedly reboots. There are no log entries in journalctl or /var/log/ regarding the crash (only boot of PVE node with filesystem checks).

Steps done to solve the problem:

Enabled and tested kernel crash dumps: no crash dumps written
Disabled reboot if kernel crash dump can't be written: still automatically reboots
Provider ran a stress test on all components: no problems detected
Added a custom CPU Type (x86-64-v4 with svm flag) (see https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/#post-642039): still crashes / reboots
Reduced the total number of vCPUs to 8 (see https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-643308): PVE node doesn't crash anymore

Even though the workaround with reducing the total amount of vCPUs helped to stabilize the system this still isn't an acceptable solution for me as a lot of CPU resources are unused and the flexibility of the PVE node is quite limited.

The main problem I'm facing right now is that I can't tell what the root cause of the problem is. A kernel problem? A faulty CPU? A faulty PSU? ...

Has anyone had similar problems and solved them? Any further ideas to find the root cause?

fiona · Apr 15, 2024

Hi,
do you have the latest BIOS updates and CPU microcode installed? Do you have any special (CPU-related) settings enabled in BIOS? The opt-in 6.8 kernel might also be worth giving a shot.

c3sro · Apr 15, 2024

BIOS updates are managed by the provider, so thats nothing I can change (Pro WS 665-ACE, BIOS 1711 10/06/2023).

The server is running on microcode version 0x0a601206, which is the latest available version for this CPU.

As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?

The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?

fiona · Apr 15, 2024

c3sro said:
As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?

I don't know what issue you have exactly, so unfortunately, I don't have any specific ones in mind.

c3sro said:
The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?

Again, I don't know any specific ones.

c3sro · Apr 22, 2024

Update: The provider replaced the whole server (new CPU, memory, PSU, mainboard, ...). Unfortunately the server still crashes regularly.

As some people report similar problems (https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-2) I think the reason isn't some faulty hardware (mainboard, memory, ...) or BIOS misconfiguration. A problem of the AMD Ryzen 9 7950X3D CPU seems much more likely.

It could be some CPU bug, Microcode bug or just a problem of the kernel. I've got not idea how to get to the root cause of the random crashes.

JerryOH · Apr 26, 2024

We also have this issue on 5 servers running Ryzen 9 7950x3d, each randomly lock up and the Proxmox login console is froze.

Christoph Lechleitner · Apr 29, 2024

We have had several reboots and lockups on Intel NUCs with i5 and i7 CPUs.

We'll upgrade to 8.2 (Kernel 6.8) in a few weeks and hope the problem will be gone ...

c3sro · May 3, 2024

Since upgrading from PVE 8.1 to 8.2 the server doesn't crash anymore.
Still have no clue what the root cause of the problem was.

c3sro · May 10, 2024

Well problem still exists...
There was an uptime of 14 days and the server crashed again. About 15 minutes after the first crash with PVE 8.2 another crash.

So all in all update to PVE 8.2 didn't solve the problem. I've got no idea why there was no crash for two weeks, but especially the second crash right after the first one just proves that the problem still exists

Christoph Lechleitner · Wednesday at 14:33

Our i5-based 7nd-gen NUC reboots almost every 24h hours, with PVE 8.2.

The remote kernel log we set up using the netconsole module did not provide any useful information.

Code:

journaltl -b -1 -n 50

might and the like might provide hints:

Code:

May 21 01:55:20 pvehq02 systemd[1]: Stopping user@0.service - User Manager for UID 0...
May 21 01:55:20 pvehq02 systemd[1737042]: Activating special unit exit.target...
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target default.target - Main User Target.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target basic.target - Basic System.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target paths.target - Paths.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target sockets.target - Sockets.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target timers.target - Timers.
May 21 01:55:20 pvehq02 systemd[1737042]: Closed dirmngr.socket - GnuPG network certificate management daemon.
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
May 21 01:55:20 pvehq02 systemd[1737042]: Removed slice app.slice - User Application Slice.
May 21 01:55:20 pvehq02 systemd[1737042]: Reached target shutdown.target - Shutdown.
May 21 01:55:20 pvehq02 systemd[1737042]: Finished systemd-exit.service - Exit the Session.
May 21 01:55:20 pvehq02 systemd[1737042]: Reached target exit.target - Exit the Session.
May 21 01:55:20 pvehq02 systemd[1]: user@0.service: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: Stopped user@0.service - User Manager for UID 0.
May 21 01:55:20 pvehq02 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
May 21 01:55:20 pvehq02 systemd[1]: run-user-0.mount: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
May 21 01:55:20 pvehq02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
May 21 01:55:20 pvehq02 systemd[1]: user-0.slice: Consumed 2.979s CPU time.
May 21 01:57:15 pvehq02 pmxcfs[971]: [status] notice: received log

Does something reboot the system intentionally?

Because of other reports around intel e1000 cards we tried

Code:

ethtool -K eno1 tx off rx off

but that didn't help, but the last log changed:

Code:

May 22 00:50:08 pvehq02 sshd[486038]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
May 22 00:50:08 pvehq02 systemd-logind[652]: New session 2571 of user root.
May 22 00:50:08 pvehq02 systemd[1]: Started session-2571.scope - Session 2571 of User root.
May 22 00:50:08 pvehq02 sshd[486038]: pam_env(sshd:session): deprecated reading of user environment enabled
May 22 00:50:09 pvehq02 sshd[486038]: Received disconnect from 192.168.13.17 port 52422:11: disconnected by user
May 22 00:50:09 pvehq02 sshd[486038]: Disconnected from user root 192.168.13.17 port 52422
May 22 00:50:09 pvehq02 sshd[486038]: pam_unix(sshd:session): session closed for user root
May 22 00:50:09 pvehq02 systemd[1]: session-2571.scope: Deactivated successfully.
May 22 00:50:09 pvehq02 systemd-logind[652]: Session 2571 logged out. Waiting for processes to exit.
May 22 00:50:09 pvehq02 systemd-logind[652]: Removed session 2571.

c3sro · Wednesday at 15:21

I think the bug affecting the system from this thread is based on a CPU specific problem (see https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-3#post-664460) causing a hard crash.

You seem to have a different problem, cause you don't have this specific CPU and your system was shutdown cleanly (even if unintended)

Christoph Lechleitner said:
...

Code:

... May 21 01:55:20 pvehq02 systemd[1737042]: Reached target shutdown.target - Shutdown. ...

...

Search

Search

PVE unexpected reboots

c3sro

New Member

fiona

Proxmox Staff Member

c3sro

New Member

fiona

Proxmox Staff Member

c3sro

New Member

JerryOH

Member

Christoph Lechleitner

Active Member

c3sro

New Member

c3sro

New Member

Christoph Lechleitner

Active Member

c3sro

New Member