Proxmox Mystery Random Reboots

blackangel84 · Mar 7, 2024

same problem here, latest pve 8.1.4, new machine Ryzen 7900X3D 192GB RAM... DDR5 memory test ok for 24hours... in bios i tried to set all defaults disable all power management, reduce ram speed... i changed rams, i changed motherboard and power supply... nothing!!!

randomly reboots 1 day - 1 hour...no logs.

At this point there are certainly no hardware problems, I think it's a kernel problem in managing apic or energy saving

blotti · Mar 10, 2024

I have the same problem, proxmox 8.1.4 fully updated, ryzen 7840HS, 64GB RAM DDR5. I have tested disk, ram, cpu, they are all ok.
The pc reboots randomly without any warning and without any trace in the logs...

On the same pc i have installed Windows 11 with Hyper-V, with more or less the same virtual machines, and it works, then i have tried with ubuntu 23.10 kernel 6.5.0-25, with virtualbox and similar virtual machines, no problems too.

Now i have moved proxmox server and vm to an older pc with ryzen 2600x / 64GB ddr4 ram and works without problems.

So i think the problem is with Proxmox/Debian and ryzen 7840HS.

Manlio

ProxyUser · Mar 11, 2024

I am able to reproduce random host reboots by running Aida64 stress in 3+ vm at the same time. It resolves by changing the guest CPU type to something other than host. My experience here.

Check if you can reproduce the reboot by running Aida64 stress in 3+ Windows VMs at the same time. Then change the CPU type to something meaningful other than host and check if the host reboots disappear.

A drawback of non-host type CPU is the loss of svm - Nested Virtualization (needed for WSL etc).

Let me know if you know of a CPU type with nested virtualization support that is not of type host for Ryzen 7950x!

blotti · Mar 12, 2024

Changing cpu type fixed the problem! I will try this to enable nested virtualization.
Thanks.

itsdaveit · May 14, 2024

pschonmann said:
I have this MoBo - X570D4U-2L2T
https://www.asrockrack.com/general/productdetail.asp?Model=X570D4U-2L2T#Specifications in server.

Tried to do memtest - OK
Smart of disks + nvme looks fine
Tried to switch into new hw to same server type

and

Reboots doing ocassionally ~3mo

2022-12-08 09:40:25
2023-04-05 14:51:34
2023-07-14 10:11:30

But we have lots of these servers (10+) and this one doing nasty things. Latest firmware was looking fine, until now, reboots happened again. Ill try these cmdlines.

Proxmox runs mix of WIN+LINUX SERVER

Code:

11 ostype: l26 7 ostype: win10 1 ostype: wxp

Hi, are your problems still solved? We have some X570D4U-2L2T out there and had reboots even after 1.70.. Interestingly, X470D4U Mainboards are stable with the same hardware/OS. And we also Run Asus Pro WS 565-ACE with same issues. BTW: We run XCP-NG, so maybe its not only limited to Proxmox...

mrpops2ko · May 14, 2024

im also having this problem, running the latest proxmox - wasn't an issue with earlier versions so i might just downgrade to try fix this

has anybody found anything definitive as to why? esxi had reams of information whenever a random reboot happened, proxmox just says 'reboot' lol
whole log is here, nothing prior to it just the words reboot

-- Reboot --
May 14 19:36:37 pve kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
May 14 19:36:37 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=UUID=xxxx ro ro quiet amd_iommu=on iommu=pt initcall_blacklist=sysfb_init
May 14 19:36:37 pve kernel: KERNEL supported cpus:
May 14 19:36:37 pve kernel: Intel GenuineIntel
May 14 19:36:37 pve kernel: AMD AuthenticAMD
May 14 19:36:37 pve kernel: Hygon HygonGenuine
May 14 19:36:37 pve kernel: Centaur CentaurHauls
May 14 19:36:37 pve kernel: zhaoxin Shanghai

pschonmann · May 15, 2024

itsdaveit said:
Hi, are your problems still solved? We have some X570D4U-2L2T out there and had reboots even after 1.70.. Interestingly, X470D4U Mainboards are stable with the same hardware/OS. And we also Run Asus Pro WS 565-ACE with same issues. BTW: We run XCP-NG, so maybe its not only limited to Proxmox...

Probably yes, now the server has uptime over 40weeks

Try this

GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"

and update grub

sonalita · Jun 8, 2024

andre.lackmann said:
``
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"
```

THANKYOU!!! One million times Thankyou!!!

I was having random reboots on my HP Elitedesk and I've spent hours trying to figure out the issue. Since I modified the GRUB options as per your post, it's been rock solid.

pschonmann · Aug 7, 2024

Is there any way how to debug that ? Because if server reboots, nothing in syslog, kern.log even ipmi ? Is this PVE kernel problem or somethin else ? Any ideas ?

netswitch · Aug 17, 2024

Cross posting here from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
We have the same exct issue ( Proxmox server running fine then suddenly rebooting without explanation).
We tested all components separately, did memtest and prime95 tests, all is stable separateley.

We use Asrockrack B650D4U motherboards and Ryzen 7900 / 7950X3D cpu.

So far we have tested multiple kernels, Proxmox v8.1 and 8.2, the GRUB CMDLINE fix, and the microcode update. All guest cpu's have been switched to kvm64.
We have 10 machines, of nearly the same hardware, 6 of them have 100+ days of uptime, 4 of them are randomly rebooting, after 6, 18 , 32 hours, even without any VM running on the host.

Has anyone on this topic found a solution ?

chrsgrhmgrhm · Aug 22, 2024

netswitch said:
Cross posting here from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
We have the same exct issue ( Proxmox server running fine then suddenly rebooting without explanation).
We tested all components separately, did memtest and prime95 tests, all is stable separateley.

We use Asrockrack B650D4U motherboards and Ryzen 7900 / 7950X3D cpu.

So far we have tested multiple kernels, Proxmox v8.1 and 8.2, the GRUB CMDLINE fix, and the microcode update. All guest cpu's have been switched to kvm64.
We have 10 machines, of nearly the same hardware, 6 of them have 100+ days of uptime, 4 of them are randomly rebooting, after 6, 18 , 32 hours, even without any VM running on the host.

Has anyone on this topic found a solution ?

I was having this issue, popped up out of the blue after running smoothly for a month or so.

I updated PVE, didn’t fix it. So I turned off all my VMs, and it still was rebooting.

I tore the MB down to nothing, and rebuilt it (essentially reseating all components). Still had the issue.

Finally I pulled all the PCIe and NVME cards. Then I had no issues.

After slowly introducing them back in, I’ve narrowed my issue down to my HBA card, and it’s likely overheating.

All that to say: have you done this process yet? Reseat all components, pull all “extra” components leaving you with a bare minimum system? I’d recommend not running any VMs to keep them from throwing false flags.

pschonmann · Sep 10, 2024

netswitch said:
Cross posting here from https://forum.proxmox.com/threads/sudden-bulk-stop-of-all-vms.139500/
We have the same exct issue ( Proxmox server running fine then suddenly rebooting without explanation).
We tested all components separately, did memtest and prime95 tests, all is stable separateley.

We use Asrockrack B650D4U motherboards and Ryzen 7900 / 7950X3D cpu.

So far we have tested multiple kernels, Proxmox v8.1 and 8.2, the GRUB CMDLINE fix, and the microcode update. All guest cpu's have been switched to kvm64.
We have 10 machines, of nearly the same hardware, 6 of them have 100+ days of uptime, 4 of them are randomly rebooting, after 6, 18 , 32 hours, even without any VM running on the host.

Has anyone on this topic found a solution ?

Ive faced for problems with this mobo - but with nvme - after grub boot ive got one nvme down and message to set latency in dmesg. Probably something with ASPM. Try a shot with these kernel params. Probably not helps, but ¯\_(ツ)_/¯

Code:

pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0

PS: Asrock mobos are fine, but im consider failing ram after some amount of time.

chrsgrhmgrhm · Sep 10, 2024

chrsgrhmgrhm said:
I was having this issue, popped up out of the blue after running smoothly for a month or so.

I updated PVE, didn’t fix it. So I turned off all my VMs, and it still was rebooting.

I tore the MB down to nothing, and rebuilt it (essentially reseating all components). Still had the issue.

Finally I pulled all the PCIe and NVME cards. Then I had no issues.

After slowly introducing them back in, I’ve narrowed my issue down to my HBA card, and it’s likely overheating.

All that to say: have you done this process yet? Reseat all components, pull all “extra” components leaving you with a bare minimum system? I’d recommend not running any VMs to keep them from throwing false flags.

My own solution was to scrap the HBA card (I even bought a new one and had a desk fan constantly blowing on it), and just switching to SATA drives.

Ancoor · Oct 1, 2024

This is weird! Since last Friday my mini PC with i3 N305 started to reboot randomly after kernel and firmware upgrade. Same logs as described above... Nothing just "reboot". Reverting back to old kernels did not help.

Time between reboot became shorter and shorter and in the end, it rebooted about every 10 minutes.

Have ordered new memory and new barebone box... but as delivery from China is 3 weeks, started troubleshooting as well.
Added the string "pci=assign-busses apicmaintimer idle=poll reboot=cold,hard" to UEFI boot and uptime now is 2 hours without any reboot... Currently only running proxmox standalone without any peripheral devices connected and only one ethernet hw dedicated to pfSense VM.

I don't have a AMD CPU so, thought this might not work... but so far it looks much better. I was previously focused on that this was HW issue, most likely motherboard.

guruevi · Oct 1, 2024

Just FYI, I think idle=poll is the key here: your CPU (AMD I'm assuming in the previous posts) does not support proper C-state handling, this has been well known to be an issue with Linux and AMD and is a firmware "bug", AMD has attempted to fix it many times, but it is an inherent Zen architectural issue which also crops up on Windows occasionally, Microsoft/AMD has "fixed" this by disabling certain transitions between power saving modes but best is to disable it in UEFI/BIOS instead of using the idle=poll which has other side effects.

Some newer Intel desktop CPU (12th-14th gen) have recently had similar issues with instability but that seems to be a manufacturing or code defect, according to Intel, just update your firmware (update BIOS and install the Intel CPU firmware package in Proxmox) or if you have damaged the CPU, test and get a replacement CPU from Intel - https://www.intel.com/content/www/us/en/download/15951/intel-processor-diagnostic-tool.html

Ancoor · Oct 1, 2024

guruevi said:
Just FYI, I think idle=poll is the key here: your CPU (AMD I'm assuming in the previous posts) does not support proper C-state handling, this has been well known to be an issue with Linux and AMD and is a firmware "bug", AMD has attempted to fix it many times, but it is an inherent Zen architectural issue which also crops up on Windows occasionally, Microsoft/AMD has "fixed" this by disabling certain transitions between power saving modes but best is to disable it in UEFI/BIOS instead of using the idle=poll which has other side effects.

Some newer Intel desktop CPU (12th-14th gen) have recently had similar issues with instability but that seems to be a manufacturing or code defect, according to Intel, just update your firmware (update BIOS and install the Intel CPU firmware package in Proxmox) or if you have damaged the CPU, test and get a replacement CPU from Intel - https://www.intel.com/content/www/us/en/download/15951/intel-processor-diagnostic-tool.html

Thank you for clarification. CPU is really not replaceable as it's fixed on the motherboard that is not bigger than it fits in the palm of my hand... ;-)
I will contact Hunsn for upgrade of the AMI BIOS if possible certainly install the intel package as suggested.
If it happens again, create a windows bootable USB and run the CPU test from Intel from there and if broken, scrap the hardware or run something else on it....

N305 is indeed a low power CPU but I never put it to sleep mode. I will need to find more info about it "idle=poll" and the "side effects" you mention and will ask professor Google.

guruevi · Oct 1, 2024

Well, if the chip is damaged, it is damaged, no OS will function properly. idle=poll basically is the kernel telling the CPU, when you're done executing the command and ask me for more work, if there is no work, here are some instructions so you don't spin down into a sleep state. So basically, you're always at "100%" CPU usage (they are light threads)

In general, if there is no work, the CPU would just idle and go in lower power states. The problem is that while that code is executing, actual 'real' threads can't schedule and if it's the same thread, your CPU will have switched to it's new "do nothing" task and flushed caches so you may see performance issues, hyperthreading etc will collapse. It is really intended to debug, not to be really used in production.

Ancoor · Oct 9, 2024

Actually idle=poll + Intel-microcode package only worked for a day or two. But I found this thread that spontaneous reboots/crashes can be related to (in my case) the kernel 6.8.12-2... Basically where is thousands of hosts or there that need a kernel upgrade to 6.10 or 6.11.
https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/

pschonmann · Jan 8, 2025

Is there a way how to debug randomly rebooting machines, some tools thats catch why server reboots. Set and forget, but when server randomly reboots, you know what happend ?

Example

We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15
Kernel, IPMI logs have nothing interesting, just server go down

/proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-6.8.8-3-pve root=UUID=9342a4c5-b779-486d-b9c0-c42184f02c5b ro quiet pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0

Tried memtest, no error, tried hirensbootcd with prime95 blend test, still running and nothing wrong

I dont know, dealer dont know, probably Asrock TECH SUPPORT dont know

( they are horrible )

Do you have some examples/stories which are you using to debug these server fuckups ?
Thanks

mrpops2ko · Jan 8, 2025

pschonmann said:
Is there a way how to debug randomly rebooting machines, some tools thats catch why server reboots. Set and forget, but when server randomly reboots, you know what happend ?

Example

We have new machines from asrock - 1U2S-B650 which randomly reboots, its mobo B650D4U FW @ 10.15
Kernel, IPMI logs have nothing interesting, just server go down

/proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-6.8.8-3-pve root=UUID=9342a4c5-b779-486d-b9c0-c42184f02c5b ro quiet pcie_port_pm=off pcie_aspm.policy=performance nvme_core.default_ps_max_latency_us=0

Tried memtest, no error, tried hirensbootcd with prime95 blend test, still running and nothing wrong

I dont know, dealer dont know, probably Asrock TECH SUPPORT dont know ( they are horrible )

Do you have some examples/stories which are you using to debug these server fuckups ?
Thanks

i dont, and the responses i've seen from proxmox staff have been not very good either.

some say 'try SSH in and catch it when it happens'

it has to be super low level because we have no logs at all. i'm approaching the magic reset time, i'm almost 27 days uptime so within the next 3 i will end up crashing / rebooting.

Proxmox Mystery Random Reboots

New Member

New Member

Member

New Member

New Member

New Member

Member

New Member

Member

Member

New Member

Member

New Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Member

New Member

We value your privacy