Proxmox Mystery Random Reboots

same problem here, latest pve 8.1.4, new machine Ryzen 7900X3D 192GB RAM... DDR5 memory test ok for 24hours... in bios i tried to set all defaults disable all power management, reduce ram speed... i changed rams, i changed motherboard and power supply... nothing!!!

randomly reboots 1 day - 1 hour...no logs.

At this point there are certainly no hardware problems, I think it's a kernel problem in managing apic or energy saving
 
I have the same problem, proxmox 8.1.4 fully updated, ryzen 7840HS, 64GB RAM DDR5. I have tested disk, ram, cpu, they are all ok.
The pc reboots randomly without any warning and without any trace in the logs...

On the same pc i have installed Windows 11 with Hyper-V, with more or less the same virtual machines, and it works, then i have tried with ubuntu 23.10 kernel 6.5.0-25, with virtualbox and similar virtual machines, no problems too.

Now i have moved proxmox server and vm to an older pc with ryzen 2600x / 64GB ddr4 ram and works without problems.

So i think the problem is with Proxmox/Debian and ryzen 7840HS.

Manlio
 
I am able to reproduce random host reboots by running Aida64 stress in 3+ vm at the same time. It resolves by changing the guest CPU type to something other than host. My experience here.

Check if you can reproduce the reboot by running Aida64 stress in 3+ Windows VMs at the same time. Then change the CPU type to something meaningful other than host and check if the host reboots disappear.

A drawback of non-host type CPU is the loss of svm - Nested Virtualization (needed for WSL etc).

Let me know if you know of a CPU type with nested virtualization support that is not of type host for Ryzen 7950x!
 
Last edited:
I have this MoBo - X570D4U-2L2T
https://www.asrockrack.com/general/productdetail.asp?Model=X570D4U-2L2T#Specifications in server.

Tried to do memtest - OK
Smart of disks + nvme looks fine
Tried to switch into new hw to same server type

and

Reboots doing ocassionally ~3mo

2022-12-08 09:40:25
2023-04-05 14:51:34
2023-07-14 10:11:30

But we have lots of these servers (10+) and this one doing nasty things. Latest firmware was looking fine, until now, reboots happened again. Ill try these cmdlines.

Proxmox runs mix of WIN+LINUX SERVER
Code:
     11 ostype: l26
      7 ostype: win10
      1 ostype: wxp
Hi, are your problems still solved? We have some X570D4U-2L2T out there and had reboots even after 1.70.. Interestingly, X470D4U Mainboards are stable with the same hardware/OS. And we also Run Asus Pro WS 565-ACE with same issues. BTW: We run XCP-NG, so maybe its not only limited to Proxmox...
 
im also having this problem, running the latest proxmox - wasn't an issue with earlier versions so i might just downgrade to try fix this

has anybody found anything definitive as to why? esxi had reams of information whenever a random reboot happened, proxmox just says 'reboot' lol
whole log is here, nothing prior to it just the words reboot

-- Reboot --
May 14 19:36:37 pve kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
May 14 19:36:37 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=UUID=xxxx ro ro quiet amd_iommu=on iommu=pt initcall_blacklist=sysfb_init
May 14 19:36:37 pve kernel: KERNEL supported cpus:
May 14 19:36:37 pve kernel: Intel GenuineIntel
May 14 19:36:37 pve kernel: AMD AuthenticAMD
May 14 19:36:37 pve kernel: Hygon HygonGenuine
May 14 19:36:37 pve kernel: Centaur CentaurHauls
May 14 19:36:37 pve kernel: zhaoxin Shanghai
 
Hi, are your problems still solved? We have some X570D4U-2L2T out there and had reboots even after 1.70.. Interestingly, X470D4U Mainboards are stable with the same hardware/OS. And we also Run Asus Pro WS 565-ACE with same issues. BTW: We run XCP-NG, so maybe its not only limited to Proxmox...
Probably yes, now the server has uptime over 40weeks

Try this

GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"

and update grub
 
``
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"
```
THANKYOU!!! One million times Thankyou!!!

I was having random reboots on my HP Elitedesk and I've spent hours trying to figure out the issue. Since I modified the GRUB options as per your post, it's been rock solid.
 
Cross posting here from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
We have the same exct issue ( Proxmox server running fine then suddenly rebooting without explanation).
We tested all components separately, did memtest and prime95 tests, all is stable separateley.

We use Asrockrack B650D4U motherboards and Ryzen 7900 / 7950X3D cpu.

So far we have tested multiple kernels, Proxmox v8.1 and 8.2, the GRUB CMDLINE fix, and the microcode update. All guest cpu's have been switched to kvm64.
We have 10 machines, of nearly the same hardware, 6 of them have 100+ days of uptime, 4 of them are randomly rebooting, after 6, 18 , 32 hours, even without any VM running on the host.

Has anyone on this topic found a solution ?
 
Cross posting here from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
We have the same exct issue ( Proxmox server running fine then suddenly rebooting without explanation).
We tested all components separately, did memtest and prime95 tests, all is stable separateley.

We use Asrockrack B650D4U motherboards and Ryzen 7900 / 7950X3D cpu.

So far we have tested multiple kernels, Proxmox v8.1 and 8.2, the GRUB CMDLINE fix, and the microcode update. All guest cpu's have been switched to kvm64.
We have 10 machines, of nearly the same hardware, 6 of them have 100+ days of uptime, 4 of them are randomly rebooting, after 6, 18 , 32 hours, even without any VM running on the host.

Has anyone on this topic found a solution ?
I was having this issue, popped up out of the blue after running smoothly for a month or so.

I updated PVE, didn’t fix it. So I turned off all my VMs, and it still was rebooting.

I tore the MB down to nothing, and rebuilt it (essentially reseating all components). Still had the issue.

Finally I pulled all the PCIe and NVME cards. Then I had no issues.

After slowly introducing them back in, I’ve narrowed my issue down to my HBA card, and it’s likely overheating.

All that to say: have you done this process yet? Reseat all components, pull all “extra” components leaving you with a bare minimum system? I’d recommend not running any VMs to keep them from throwing false flags.
 
Cross posting here from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
We have the same exct issue ( Proxmox server running fine then suddenly rebooting without explanation).
We tested all components separately, did memtest and prime95 tests, all is stable separateley.

We use Asrockrack B650D4U motherboards and Ryzen 7900 / 7950X3D cpu.

So far we have tested multiple kernels, Proxmox v8.1 and 8.2, the GRUB CMDLINE fix, and the microcode update. All guest cpu's have been switched to kvm64.
We have 10 machines, of nearly the same hardware, 6 of them have 100+ days of uptime, 4 of them are randomly rebooting, after 6, 18 , 32 hours, even without any VM running on the host.

Has anyone on this topic found a solution ?
Ive faced for problems with this mobo - but with nvme - after grub boot ive got one nvme down and message to set latency in dmesg. Probably something with ASPM. Try a shot with these kernel params. Probably not helps, but ¯\_(ツ)_/¯
Code:
pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0

PS: Asrock mobos are fine, but im consider failing ram after some amount of time.
 
I was having this issue, popped up out of the blue after running smoothly for a month or so.

I updated PVE, didn’t fix it. So I turned off all my VMs, and it still was rebooting.

I tore the MB down to nothing, and rebuilt it (essentially reseating all components). Still had the issue.

Finally I pulled all the PCIe and NVME cards. Then I had no issues.

After slowly introducing them back in, I’ve narrowed my issue down to my HBA card, and it’s likely overheating.

All that to say: have you done this process yet? Reseat all components, pull all “extra” components leaving you with a bare minimum system? I’d recommend not running any VMs to keep them from throwing false flags.
My own solution was to scrap the HBA card (I even bought a new one and had a desk fan constantly blowing on it), and just switching to SATA drives.
 
This is weird! Since last Friday my mini PC with i3 N305 started to reboot randomly after kernel and firmware upgrade. Same logs as described above... Nothing just "reboot". Reverting back to old kernels did not help.

Time between reboot became shorter and shorter and in the end, it rebooted about every 10 minutes.

Have ordered new memory and new barebone box... but as delivery from China is 3 weeks, started troubleshooting as well.
Added the string "pci=assign-busses apicmaintimer idle=poll reboot=cold,hard" to UEFI boot and uptime now is 2 hours without any reboot... Currently only running proxmox standalone without any peripheral devices connected and only one ethernet hw dedicated to pfSense VM.

I don't have a AMD CPU so, thought this might not work... but so far it looks much better. I was previously focused on that this was HW issue, most likely motherboard.
 
Last edited:
Just FYI, I think idle=poll is the key here: your CPU (AMD I'm assuming in the previous posts) does not support proper C-state handling, this has been well known to be an issue with Linux and AMD and is a firmware "bug", AMD has attempted to fix it many times, but it is an inherent Zen architectural issue which also crops up on Windows occasionally, Microsoft/AMD has "fixed" this by disabling certain transitions between power saving modes but best is to disable it in UEFI/BIOS instead of using the idle=poll which has other side effects.

Some newer Intel desktop CPU (12th-14th gen) have recently had similar issues with instability but that seems to be a manufacturing or code defect, according to Intel, just update your firmware (update BIOS and install the Intel CPU firmware package in Proxmox) or if you have damaged the CPU, test and get a replacement CPU from Intel - https://www.intel.com/content/www/us/en/download/15951/intel-processor-diagnostic-tool.html
 
Last edited:
  • Like
Reactions: Ancoor
Just FYI, I think idle=poll is the key here: your CPU (AMD I'm assuming in the previous posts) does not support proper C-state handling, this has been well known to be an issue with Linux and AMD and is a firmware "bug", AMD has attempted to fix it many times, but it is an inherent Zen architectural issue which also crops up on Windows occasionally, Microsoft/AMD has "fixed" this by disabling certain transitions between power saving modes but best is to disable it in UEFI/BIOS instead of using the idle=poll which has other side effects.

Some newer Intel desktop CPU (12th-14th gen) have recently had similar issues with instability but that seems to be a manufacturing or code defect, according to Intel, just update your firmware (update BIOS and install the Intel CPU firmware package in Proxmox) or if you have damaged the CPU, test and get a replacement CPU from Intel - https://www.intel.com/content/www/us/en/download/15951/intel-processor-diagnostic-tool.html
Thank you for clarification. CPU is really not replaceable as it's fixed on the motherboard that is not bigger than it fits in the palm of my hand... ;-)
I will contact Hunsn for upgrade of the AMI BIOS if possible certainly install the intel package as suggested.
If it happens again, create a windows bootable USB and run the CPU test from Intel from there and if broken, scrap the hardware or run something else on it....

N305 is indeed a low power CPU but I never put it to sleep mode. I will need to find more info about it "idle=poll" and the "side effects" you mention and will ask professor Google. ;)
 
Well, if the chip is damaged, it is damaged, no OS will function properly. idle=poll basically is the kernel telling the CPU, when you're done executing the command and ask me for more work, if there is no work, here are some instructions so you don't spin down into a sleep state. So basically, you're always at "100%" CPU usage (they are light threads)

In general, if there is no work, the CPU would just idle and go in lower power states. The problem is that while that code is executing, actual 'real' threads can't schedule and if it's the same thread, your CPU will have switched to it's new "do nothing" task and flushed caches so you may see performance issues, hyperthreading etc will collapse. It is really intended to debug, not to be really used in production.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!