PVE unexpected reboots

c3sro

New Member
Apr 15, 2024
7
1
3
Hi,

a PVE node reboots unexpectedly.

PVE Facts:
  • Kernel: Linux 6.5.13-5-pve
  • Storage: 2xSSD (ZFS Raid 1)
  • CPU: AMD Ryzen 9 7950X3D (16Core)
  • RAM: 128GB
The crashes started after adding another VM that does some nested virtualization (VirtualBox inside PVE).

VMs on the PVE node:
  • 6x VM with 4 CPUs each (Processor type: host)
  • 1x VM with 8 CPUs doing nested virtualization (Processor type: host)
So all in all the PVE node is is overcommitted regarding CPUs (32 vCPUs, 16 Core CPU, 32 Threads CPU). All other resource consumption (RAM, Disk space) is low and not overcommitted.

Somewhere between 5 Minutes and 24 hours the PVE node unexpectedly reboots. There are no log entries in journalctl or /var/log/ regarding the crash (only boot of PVE node with filesystem checks).

Steps done to solve the problem:
Even though the workaround with reducing the total amount of vCPUs helped to stabilize the system this still isn't an acceptable solution for me as a lot of CPU resources are unused and the flexibility of the PVE node is quite limited.

The main problem I'm facing right now is that I can't tell what the root cause of the problem is. A kernel problem? A faulty CPU? A faulty PSU? ...

Has anyone had similar problems and solved them? Any further ideas to find the root cause?
 
Hi,
do you have the latest BIOS updates and CPU microcode installed? Do you have any special (CPU-related) settings enabled in BIOS? The opt-in 6.8 kernel might also be worth giving a shot.
 
BIOS updates are managed by the provider, so thats nothing I can change (Pro WS 665-ACE, BIOS 1711 10/06/2023).

The server is running on microcode version 0x0a601206, which is the latest available version for this CPU.

As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?

The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?
 
As far as I know there are no special settings enabled in BIOS (primarily managed by provider again). Are there any specific settings I should check?
I don't know what issue you have exactly, so unfortunately, I don't have any specific ones in mind.
The 6.8 kernel might be worth a try. Do you know of any nested virtualization fixes in 6.6 - 6.8?
Again, I don't know any specific ones.
 
Update: The provider replaced the whole server (new CPU, memory, PSU, mainboard, ...). Unfortunately the server still crashes regularly.

As some people report similar problems (https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/page-2) I think the reason isn't some faulty hardware (mainboard, memory, ...) or BIOS misconfiguration. A problem of the AMD Ryzen 9 7950X3D CPU seems much more likely.

It could be some CPU bug, Microcode bug or just a problem of the kernel. I've got not idea how to get to the root cause of the random crashes.
 
We also have this issue on 5 servers running Ryzen 9 7950x3d, each randomly lock up and the Proxmox login console is froze.
 
We have had several reboots and lockups on Intel NUCs with i5 and i7 CPUs.

We'll upgrade to 8.2 (Kernel 6.8) in a few weeks and hope the problem will be gone ...
 
Well problem still exists...
There was an uptime of 14 days and the server crashed again. About 15 minutes after the first crash with PVE 8.2 another crash.

So all in all update to PVE 8.2 didn't solve the problem. I've got no idea why there was no crash for two weeks, but especially the second crash right after the first one just proves that the problem still exists :(
 
Our i5-based 7nd-gen NUC reboots almost every 24h hours, with PVE 8.2.

The remote kernel log we set up using the netconsole module did not provide any useful information.

Code:
journaltl -b -1 -n 50
might and the like might provide hints:

Code:
May 21 01:55:20 pvehq02 systemd[1]: Stopping user@0.service - User Manager for UID 0...
May 21 01:55:20 pvehq02 systemd[1737042]: Activating special unit exit.target...
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target default.target - Main User Target.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target basic.target - Basic System.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target paths.target - Paths.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target sockets.target - Sockets.
May 21 01:55:20 pvehq02 systemd[1737042]: Stopped target timers.target - Timers.
May 21 01:55:20 pvehq02 systemd[1737042]: Closed dirmngr.socket - GnuPG network certificate management daemon.
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
May 21 01:55:20 pvehq02 systemd[1737042]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
May 21 01:55:20 pvehq02 systemd[1737042]: Removed slice app.slice - User Application Slice.
May 21 01:55:20 pvehq02 systemd[1737042]: Reached target shutdown.target - Shutdown.
May 21 01:55:20 pvehq02 systemd[1737042]: Finished systemd-exit.service - Exit the Session.
May 21 01:55:20 pvehq02 systemd[1737042]: Reached target exit.target - Exit the Session.
May 21 01:55:20 pvehq02 systemd[1]: user@0.service: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: Stopped user@0.service - User Manager for UID 0.
May 21 01:55:20 pvehq02 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
May 21 01:55:20 pvehq02 systemd[1]: run-user-0.mount: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
May 21 01:55:20 pvehq02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
May 21 01:55:20 pvehq02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
May 21 01:55:20 pvehq02 systemd[1]: user-0.slice: Consumed 2.979s CPU time.
May 21 01:57:15 pvehq02 pmxcfs[971]: [status] notice: received log

Does something reboot the system intentionally?

Because of other reports around intel e1000 cards we tried
Code:
ethtool -K eno1 tx off rx off
but that didn't help, but the last log changed:

Code:
May 22 00:50:08 pvehq02 sshd[486038]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
May 22 00:50:08 pvehq02 systemd-logind[652]: New session 2571 of user root.
May 22 00:50:08 pvehq02 systemd[1]: Started session-2571.scope - Session 2571 of User root.
May 22 00:50:08 pvehq02 sshd[486038]: pam_env(sshd:session): deprecated reading of user environment enabled
May 22 00:50:09 pvehq02 sshd[486038]: Received disconnect from 192.168.13.17 port 52422:11: disconnected by user
May 22 00:50:09 pvehq02 sshd[486038]: Disconnected from user root 192.168.13.17 port 52422
May 22 00:50:09 pvehq02 sshd[486038]: pam_unix(sshd:session): session closed for user root
May 22 00:50:09 pvehq02 systemd[1]: session-2571.scope: Deactivated successfully.
May 22 00:50:09 pvehq02 systemd-logind[652]: Session 2571 logged out. Waiting for processes to exit.
May 22 00:50:09 pvehq02 systemd-logind[652]: Removed session 2571.
 
Did anyone discover anything recently with this issue? I just noticed that my box restarted on me today while I was using it- it's also a 7950X3D build....

Code:
Dec 12 00:00:15 pve systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Dec 12 00:00:15 pve systemd[1]: Starting logrotate.service - Rotate log files...
Dec 12 00:00:15 pve systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Dec 12 00:00:15 pve systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Dec 12 00:00:15 pve systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Dec 12 00:00:16 pve pveproxy[281830]: send HUP to 1355
Dec 12 00:00:16 pve pveproxy[1355]: received signal HUP
Dec 12 00:00:16 pve pveproxy[1355]: server closing
Dec 12 00:00:16 pve pveproxy[1355]: server shutdown (restart)
 
Hi,
Did anyone discover anything recently with this issue? I just noticed that my box restarted on me today while I was using it- it's also a 7950X3D build....
The original issue is resolved in newer kernel versions: https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-718198
Code:
Dec 12 00:00:15 pve systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Dec 12 00:00:15 pve systemd[1]: Starting logrotate.service - Rotate log files...
Dec 12 00:00:15 pve systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Dec 12 00:00:15 pve systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Dec 12 00:00:15 pve systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Dec 12 00:00:16 pve pveproxy[281830]: send HUP to 1355
Dec 12 00:00:16 pve pveproxy[1355]: received signal HUP
Dec 12 00:00:16 pve pveproxy[1355]: server closing
Dec 12 00:00:16 pve pveproxy[1355]: server shutdown (restart)
That doesn't look like the system hard rebooted. Just that the pveproxy service was reloaded. That happens e.g. during an upgrade.
 
Hi,

The original issue is resolved in newer kernel versions: https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/post-718198

That doesn't look like the system hard rebooted. Just that the pveproxy service was reloaded. That happens e.g. during an upgrade.
my mistake, you are correct... I seem to have an issue with a single guest and I thought the whole box was rebooting, For some reason, this single win10 guest with pcie gpu passthru keeps rebooting seemingly randomly and I dontknow why. ballooning is off, and I see "bulk start VMs" in proxmox's tasklog immediately after it happens for some reason... Im still trying to find time to look into it... anyideas or suggestions on where I can look appreciated!

thanks for reading
 
my mistake, you are correct... I seem to have an issue with a single guest and I thought the whole box was rebooting, For some reason, this single win10 guest with pcie gpu passthru keeps rebooting seemingly randomly and I dontknow why. ballooning is off, and I see "bulk start VMs" in proxmox's tasklog immediately after it happens for some reason... Im still trying to find time to look into it... anyideas or suggestions on where I can look appreciated!
Finding out what triggers the bulk start task might be a good first step. System Logs (in guest and on the host) and Task History are good places to start looking at. Do you have any third party (monitoring) scripts installed? Feel free to share the full logs, more eyes don't hurt when looking for such things.
 
Finding out what triggers the bulk start task might be a good first step. System Logs (in guest and on the host) and Task History are good places to start looking at. Do you have any third party (monitoring) scripts installed? Feel free to share the full logs, more eyes don't hurt when looking for such things.
the guest has the sudden power loss event:
The system has rebooted without cleanly shutting down first.

the host log shows "-- Reboot --" which I honestly cant explain... and it seems to be right after root auth via pam via the web UI.... no reboot cmd was ever entered via cli... unsure what triggers it...

Code:
Dec 17 03:17:01 pve CRON[373676]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 17 03:17:01 pve CRON[373677]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Dec 17 03:17:01 pve CRON[373676]: pam_unix(cron:session): session closed for user root
Dec 17 03:17:32 pve pveproxy[364133]: worker exit
Dec 17 03:17:32 pve pveproxy[1361]: worker 364133 finished
Dec 17 03:17:32 pve pveproxy[1361]: starting 1 worker(s)
Dec 17 03:17:32 pve pveproxy[1361]: worker 373761 started
Dec 17 03:19:48 pve pveproxy[373334]: Clearing outdated entries from certificate cache
Dec 17 03:20:12 pve pveproxy[367291]: worker exit
Dec 17 03:20:12 pve pveproxy[1361]: worker 367291 finished
Dec 17 03:20:12 pve pveproxy[1361]: starting 1 worker(s)
Dec 17 03:20:12 pve pveproxy[1361]: worker 374179 started
Dec 17 03:23:16 pve pveproxy[373761]: Clearing outdated entries from certificate cache
Dec 17 03:26:15 pve pvedaemon[1351]: <root@pam> successful auth for user 'root@pam'
-- Reboot --

[\CODE]

any ideas/advice appreciated!
thanks!
 
FYI: There is a 100% issue with PROXMOX and the Ryzen 9 7950X3D processor.

We have several Proxmox setups and FOUR of our servers are running the "Ryzen 9 7950X3D" processor and we have had nothing but issues.

We have tried EVERYTHING you could think of: Replace Motherboard's, replaced power supplies, New Memory (Different brand), Changed out the M.2, replaced the Ethernet board (few different brands, and even Intel's). Of course making sure the latest updates, latest firmware on M.2, latest firmware on Broadcom when we tried one of those and always have updated Proxmox.

In EVERY case we waited until we had a failure before replacing or trying something so we could be sure we fixed it.

The problem happens on FOUR different machines and the only thing unique is the "Ryzen 9 7950X3D" processor. We are now trying different processors.

I'm so pissed because NO ONE wants to accept any responsibility for this and we have over 100+ hours in IT time, trying to resolve it.
 
On a wild hunch, how many memory slots are populated and at what speeds is the memory configured?

I don't have that particular Ryzen CPU, but had similar issues with all 4 RAM slots populated and an XMP profile enabled. Only after setting the memory to the base speed without any XMP profile did the machine become stable.

if you check the specs (connectivity), it lists the following max mem speeds depending on how many slots are populated:
2x1R DDR5-5200
2x2R DDR5-5200
4x1R DDR5-3600
4x2R DDR5-3600
 
Interesting info, Thank you for your response and ANY additional ADVISE you can give. We will try anything.

Currently using 96 GB total DDR5-A1 & DDR5-A2 using G.Skill Flare X5 at the moment
Before replacing memory we had Corsair DDR5-6400 and of course that didn't work so we replaced it with above.

QUESTION: Our hardware guy told us that 5600 would downclock if the spec was 5200 and the BIOS shows the memory as DDR5-5200.
 
Just out of personal experience, filling all 4 slots with DIMMs and setting a faster XMP profile, default is 3600, everything above is overclocking with profile, would cause one host with an AMD 7900X to behave glitchy and reboot occasionally out of the blue. Setting the RAM speed slower, to 3600 or maybe even slightly slower, made the machine run stable.

Another similarly built machine can handle the higher speed of the RAM. My assumption is, that especially if you fill all DIMM slots and run a fast memory speed, the quality of the memory controller in the CPU matters, some can handle more, some not.

Also, I am not sure, been a while, if a memtest with a faster mem speed / XMP profile caused some errors. But if you haven't done so yet, running some memtest might also be something to consider.