Random reboots after upgrade from 8.4.5 to 9.0.10

Jacco

Member
Jul 4, 2023
13
6
8
Upgraded my Proxmox VE environment, consisting of 7 HP Elitedesk/Prodesk mini servers, a few days ago.
Was running 8.x.x for ages without any issues, now has 9.0.10 (latest and greatest).
Since the upgrade one of the servers (HP Elitedesk 800 G4 mini 65W) started rebooting randomly, sometimes more than once an hour.

Updated the BIOS to latest version.
Server was running for a day without issues (with DisplayPort monitor and keyboard attached). So I thought the BIOS update fixed it and disconnected the monitor and keyboard (hey, it's a server not a desktop). And ... the system started rebooting again randomly.
Then I remembered an issue with i915 drivers causing kernel panics when no DP display is attached. See https://forum.proxmox.com/threads/p...ues-with-hardware-transcoding-in-plex.132187/

So, I added 'i915.enable_dc=0' to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and executed "update-grub" + reboot.
Since that modification no reboots anymore.
Perhaps a regression with the i915 driver?
 
Same issue now with HP Prodesk 400 G5 mini (same cluster, same versions), while restoring a VM from Proxmox Backup Server.
After adding "i915.enable_dc=0" restore went fine.
 
  • Like
Reactions: leesteken
Same issues here.. I have random restarts of the whole Proxmox VE environment v9.0.10 with kernel 6.14.11-3-pve. I believe I already have this for a couple of weeks now.....

However, I do NOT have HP.. I just have a self-build server/computer using AMD Ryzen 9 5950X, Corsair Vengeance LPX CMK128GX4M4Z3600C18 memory, MSI MEG X570 UNIFY mobo. OS is running on SSD Crucial MX500 and VMdata on NVMe WD_Black SN850.

Using an AMD Radeon video card.. So i915 driver is NOT loaded either. So that can't be the issue here.

Problem is.. I can not find any coredump or any other errors in the previous kernel logs.. etc. Nothing!? HELP!
 
Last edited:
Same issue after upgrading to 9.x
I am having issues on a ThreadRipper system, a 9950x and an AMD 8945HS

I did an in-place upgrade and throught something may have gone wrong with the upgrade, so I re-installed all the nodes, but I am am still having the same issue.

The strange thing is that on the same host, for example the ThreadRipper, the same 3 or 4 Windows 11 VMs would reboot multiple times per day, while other Windows 11 VMs on the same host remain active.
 
The strange thing is that on the same host, for example the ThreadRipper, the same 3 or 4 Windows 11 VMs would reboot multiple times per day, while other Windows 11 VMs on the same host remain active.
THis issue is related to rebooting/crashing of the whole PVE instance.. not just some VMs.. If the whole PVE node crashes all VMs are gone.
 
Same issues here.. I have random restarts of the whole Proxmox VE environment v9.0.10 with kernel 6.14.11-3-pve. I believe I already have this for a couple of weeks now.....

However, I do NOT have HP.. I just have a self-build server/computer using AMD Ryzen 9 5950X, Corsair Vengeance LPX CMK128GX4M4Z3600C18 memory, MSI MEG X570 UNIFY mobo. OS is running on SSD Crucial MX500 and VMdata on NVMe WD_Black SN850.

Using an AMD Radeon video card.. So i915 driver is NOT loaded either. So that can't be the issue here.

Problem is.. I can not find any coredump or any other errors in the previous kernel logs.. etc. Nothing!? HELP!
If you cannot find any relevant logging, it might be that logging isn't written to disk anymore when the issue pops up.
Following trick might help you to get additional logging, as suggested in another post:
As it's likely that there are more errors logged, but that they cannot be synced to disk before the host hangs up completely.
One easy way to try is to connect from another Linux system to the problemhost via SSH and run journalctl -f there. Sometimes the network stack still works, at least longer than the regular disk sync intervals, so one might manage to see an actual error there.
Or you can use netconsole to get addional logging, but that is more complex to setup.
 
  • Like
Reactions: melroy89
If you cannot find any relevant logging, it might be that logging isn't written to disk anymore when the issue pops up.
Following trick might help you to get additional logging, as suggested in another post:

"One easy way to try is to connect from another Linux system to the problem host via SSH and run journalctl -f there"
I just recently thought of external factors as a cause. No, I do not mean power current issues, since power was still up and fine, no spikes.

But I mean the motherboard can trigger a reboots automatically or create instability. Thinks like "C-state control" in the BIOS or "Watchdog time-outs" or many other weird thinks. Maybe even "Precision Boost Overdrive", who knows? And no I do not overclock my memory (these are running stock freq.).

This is really hard to debug, especially if it doesn't reboot that often. That being said, I never had these instabilities before. Just recently since Proxmox
Virtual Environment v9 it seems.

Or you can use netconsole to get addional logging, but that is more complex to setup.
Thanks, both SSH and systemctl logging or via netconsole are good ideas.
 
Last edited: