Proxmox Mystery Random Reboots

peterbrockie · Mar 29, 2023

I've had a Proxmox install running rock solid for months and all of a sudden it's randomly rebooting. It can run for 5 minutes or 5 hours. The weird thing is, I've basically u
sed up all of the collective posts on the subject and still can't track it down. I'm now at the point where I really just want to know what the cause is for my own sanity. haha

I can't find anything in syslog. I literally watched it reboot and it just goes right from nothing to --REBOOT--.

Although it seems random, it does seem to do it more during the time I'm around, so it might be tied to load, however, it does still reboot without the fast NICs installed and only a 1 gig web interface going.

Ryzen 5650G Pro
4x Timetec 16GB 2666 MHz ECC
ASRock X570 Steel Legend motherboard
Aquantia 10 GbE NIC
LSI SAS controller (passed through to a VM - system crashes without the VM running though)
Mellanox ConnectX-4 25 GbE NIC (10 and 25 GbE NICs are bridged in Proxmox to act as a switch)
Misc SSDs for booting, VM storage. All mirrored.
Originally a 850W Seasonic PSU, now an 860W Fractal
Not that it matters - Rosewill 4U server case.
All the PCIe cards have a fan directly blowing on them since they all tend to get toasty.

Things I have tried already:
Updated Proxmox to newest version. apt reinstalled the Proxmox specific packages.
Firmware on all the NICs and SAS card are up to date.
BIOS is up to date.
Tried both stock 5.x kernel and the optional 6.2 kernel.
Tested RAM. No issues.
Replaced PSU. (seems to be the most commonly suggested issue)
Power is rock solid and is being fed by a line-interactive UPS.
Removed NICs
Restarts even if no VMs are running.
Changed the watchdog settings to 0 in... whatever config has them - I don't even remember anymore.

Motherboard doesn't have a built in watchdog reset timer from what I can tell.

I can provide any logs or outputs anyone can think of to try and figure this out. The last resort is to reinstall Proxmox, but I think that's avoiding the problem (which may still exist if it's somehow related to hardware).

Dunuin · Mar 29, 2023

You could try to remove all PCIe cards and then see if the reboots will stop. Not that one of them is faulty crashing your system.
Here, last week for example one of my LSI SAS HBAs started to fail. Sometimes it will work for some hours and then stop working, Sometime it won't even be recognized at boot. It didn`t crash the server, as the system disks aren't on that HBA, but you never know how the hardware will behave when it is failing. Maybe it is even shorting something triggering the short circuit protection of the PSU causing a reboot.

In case you aren't running HDDs you could shake the server while running to see if a bad cable, socket connection or microfissures are causing troubles.

peterbrockie · Mar 31, 2023

Dunuin said:
You could try to remove all PCIe cards and then see if the reboots will stop. Not that one of them is faulty crashing your system.
Here, last week for example one of my LSI SAS HBAs started to fail. Sometimes it will work for some hours and then stop working, Sometime it won't even be recognized at boot. It didn`t crash the server, as the system disks aren't on that HBA, but you never know how the hardware will behave when it is failing. Maybe it is even shorting something triggering the short circuit protection of the PSU causing a reboot.

In case you aren't running HDDs you could shake the server while running to see if a bad cable, socket connection or microfissures are causing troubles.

Sorry for the delay in the reply - stability testing is never fast.

I thought I'd try the "test all hardware at once method" by just installing Proxmox 7.3 on a flash drive and running it with the same cards installed, along with the same network bridge. No crashes.

I updated to 7.4 and still no crashes.

Which is damn weird seeing as it was crashing with zero VMs running. SMART data on the mirrored boot SSDs looks good. I think what I'm really after at this point is a way to pinpoint in the logs what's crashing it when booted from the normal drives - I'm not a super Linux user, so I'm not quite sure on where it could be hiding in the logs. I know syslog shows nothing.

thex · Apr 2, 2023

Same here after update to 7.4 from 6.4, was running rock solid for months on 6.4, nothing else changed
https://forum.proxmox.com/threads/r...-of-host-after-update-from-6-4-to-7-4.125177/

peterbrockie · Apr 8, 2023

Removed all PCIe cards. Still restarting.
I cloned my boot drives to new SSDs. Still restarting. One note about cloning is I did mange to get it to crash in Clonezilla during the clone, but not when I ran the drive in another system. Implying it was either the controller or the cable.
Moved my boot SSDs to a cheap PCIe SATA card and started getting lots of SATA errors. Granted, it could be the crap card, but it might be that it wasn't reporting the errors correctly using the on-board SATA.
Replaced the SATA cables for the boot drives and went back to the on-board and it still rebooted.

Going to pop in an X299 motherboard and CPU and see if that stops the issues. I mean, at this point it's only the CPU/RAM/Mobo (RAM and CPU are generally pretty verified as working by memtest).

andre.lackmann · Apr 9, 2023

@peterbrockie I'm having a similar issue since upgrading to 7.3+. The more I upgrade the worse it seems to get.

I notice we're both using Ryzen CPUs and x570 motherboards. Coincidence? https://forum.proxmox.com/threads/proxmox-restarting-regularly-since-7-3-7-4-upgrade.125499/

I'm going to test a downgrade to 7.2 shortly and see if this makes any difference.

peterbrockie · Apr 9, 2023

andre.lackmann said:
@peterbrockie I'm having a similar issue since upgrading to 7.3+. The more I upgrade the worse it seems to get.

I notice we're both using Ryzen CPUs and x570 motherboards. Coincidence? https://forum.proxmox.com/threads/proxmox-restarting-regularly-since-7-3-7-4-upgrade.125499/

I'm going to test a downgrade to 7.2 shortly and see if this makes any difference.

I'm inclined to agree. It does seem to be related to the motherboard and/or CPU. I swapped the board for a X299 + 7980XE combo and it has been solid for 12 hours now with all the VMs running.

What really bothers me about all this is that there doesn't seem to be any logging of the reboots. No panics or weird messages. Just reboots.

andre.lackmann · Apr 10, 2023

Well, I don't have a spare mobo and intel cpu, so I couldn't do that switch. I DID do some further reading / googling and found a few posts of Ubuntu and Debian users having issues with X570 and/or Ryzen CPUs (here and here) and these have been helpful!

I've since done the following:

editing `/etc/default/grub` replacing the default CMDLINE as per the following:

```
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"
```

and then rebuilt the kernel by running `update-grub`. I'm no expert, but at a pinch I'd say the `apicmaintimer` directive (more here) is largely doing the heavy lifting here. Here's what ChatGPT tells me each is doing:

pci=assign-busses: This option instructs the kernel to assign bus numbers to PCI devices in a deterministic manner. This can be useful for ensuring that device enumeration is consistent across reboots.

apicmaintimer: This option specifies the use of the Advanced Programmable Interrupt Controller (APIC) timer for system timing. This can provide more accurate and reliable timing information than other timers.

idle=poll: This option specifies the use of the CPU's polling mechanism for idle tasks. This can improve performance in some cases, but may increase power consumption.

reboot=cold,hard: This option specifies the behavior of the system when a reboot command is issued. The cold option specifies a full system reset, while the hard option specifies a hard reset of the system. This can be useful for troubleshooting system stability issues.

The good news is - after a reboot, I've not had a single crash. I'm 12hrs and counting, where I wasn't able to be stable for more than 2 hours for days. I'm now running the most recent kernel opt-in `6.2.9-1-pve` as well, though I'm not sure this is required or not.

It would be interesting to hear if the same changes fixes your issues too. Hope this is helpful.

peterbrockie · Apr 11, 2023

andre.lackmann said:
Well, I don't have a spare mobo and intel cpu, so I couldn't do that switch. I DID do some further reading / googling and found a few posts of Ubuntu and Debian users having issues with X570 and/or Ryzen CPUs (here and here) and these have been helpful!

I've since done the following:

editing `/etc/default/grub` replacing the default CMDLINE as per the following:

```
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"
```

and then rebuilt the kernel by running `update-grub`. I'm no expert, but at a pinch I'd say the `apicmaintimer` directive (more here) is largely doing the heavy lifting here. Here's what ChatGPT tells me each is doing:

The good news is - after a reboot, I've not had a single crash. I'm 12hrs and counting, where I wasn't able to be stable for more than 2 hours for days. I'm now running the most recent kernel opt-in `6.2.9-1-pve` as well, though I'm not sure this is required or not.

It would be interesting to hear if the same changes fixes your issues too. Hope this is helpful.

I have my pulled X570 board. I installed a fresh Proxmox 7.4 and fully updated. Nothing is installed other than the SATA SSD for booting. I let it run for a day with a dummy Win 11 VM running and verified it's rebooting. I'm going to make the changes and let you know if it changes anything.

UPDATE: Made the changes and no crashes in over a day. Looks like the fix works.

Note: If you're using UEFI like I am, you probably don't have GRUB installed and are using systemd-boot. To update your boot settings for that simply run:

nano /etc/kernel/cmdline and add quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard to the end of the first line.

Then run proxmox-boot-tool refresh and reboot.

andre.lackmann · Apr 13, 2023

That's great. Im at almost 4 days uptime now so I'm counting that as fixed too. Not sure why that started to cause issues in 7.3+ but it's obvs some and CPU or GPU bug. Best of luck with it all.

mediaklan · May 27, 2023

peterbrockie said:
What really bothers me about all this is that there doesn't seem to be any logging of the reboots. No panics or weird messages. Just reboots.

Same thing here as anyone else it seems : magic reboot out of nowhere.
But I think this started for me around the time I switched to the new opt-in kernel 6.2 (I don´t think I had the problem under proxmox 7.4). And what puzzles me too is the lack of logs. This is the most verbose I could get :
-- Boot feaaeea8ab5340ca9f7f4624b47f748a --
Maybe that is just journalctl being shy or something.

Unfortunately the fix (I've tried pci=assign-busses apicmaintimer idle=poll) is causing another issue on my machine : k10temp-pci-00c3 suddenly rise up to a consistant 68 to 75° C without CPU doing anything heavy (it's usually 48 to 55° C idle). I'm guessing idle=poll is the culprit here.
Did you guys noted a similar behavior ?
For now I've revert back to 5.15. If there's another unexpected reboot, I suppose I should try again the grub options.

PigLover · May 27, 2023

Have you tried disabling C-state 6 (and lower)? There seems to be a running issue with random restarts using AMD systems and Linux 5+ kernel. All of them open and describe similar symptoms. Disabling C-state 6 is suggested in few of them as stabilizing their systems.

https://bugzilla.kernel.org/show_bug.cgi?id=206487 and a few older ones.

Add the following to your kernel boot command: "processor.max_cstate=5 intel_idle.max_cstate=0" and see if it stabilizes.

I have a system that showed similar symptoms (running a Ryzen 5825u) and it has been running rock solid since making this change. The change didn't materially effect power consumption since C6 is a pretty low power state and it really won't be missed.

mediaklan · May 27, 2023

Thank you for your reply, PigLover

PigLover said:
Have you tried disabling C-state 6 (and lower)? There seems to be a running issue with random restarts using AMD systems and Linux 5+ kernel. All of them open and describe similar symptoms. Disabling C-state 6 is suggested in few of them as stabilizing their systems.

No, I didn't try to disable C-state but I don't think I'm concerned by this bug (still, good to know about it). If that was the case, I would have had this issue already a long time ago, while I'm "only" concerned about these unexpected reboots since 2 weeks ago (rough estimation), which is also about the time I've switched to the 6.2 kernel.
While still a bit too soon to declare that I'm safe, but proxmox is running 5.15 kernel last straight 13 hours, which is a breaking record of sort.

Nuke Bloodaxe · May 27, 2023

Let's try something different: lower the RAM speed to 2400. I have, occasionally, found it'll fix things; perhaps even 6.2.

mediaklan · May 28, 2023

@Nuke Bloodaxe
Nope, I'm good with my rollback, thanks. That's 25 hours now running without a reboot. I (hopefully) think my case is closed. Maybe that will need to be adressed again when I switched back to the 6.2 kernel.
Either way, thank you for the tip about the ram speed.

pschonmann · Jul 17, 2023

I have this MoBo - X570D4U-2L2T
https://www.asrockrack.com/general/productdetail.asp?Model=X570D4U-2L2T#Specifications in server.

Tried to do memtest - OK
Smart of disks + nvme looks fine
Tried to switch into new hw to same server type

and

Reboots doing ocassionally ~3mo

2022-12-08 09:40:25
2023-04-05 14:51:34
2023-07-14 10:11:30

But we have lots of these servers (10+) and this one doing nasty things. Latest firmware was looking fine, until now, reboots happened again. Ill try these cmdlines.

Proxmox runs mix of WIN+LINUX SERVER

Code:

     11 ostype: l26
      7 ostype: win10
      1 ostype: wxp

babaz · Jan 22, 2024

pschonmann said:
I have this MoBo - X570D4U-2L2T
https://www.asrockrack.com/general/productdetail.asp?Model=X570D4U-2L2T#Specifications in server.

Tried to do memtest - OK
Smart of disks + nvme looks fine
Tried to switch into new hw to same server type

and

Reboots doing ocassionally ~3mo

2022-12-08 09:40:25
2023-04-05 14:51:34
2023-07-14 10:11:30

But we have lots of these servers (10+) and this one doing nasty things. Latest firmware was looking fine, until now, reboots happened again. Ill try these cmdlines.

Proxmox runs mix of WIN+LINUX SERVER

Code:

11 ostype: l26 7 ostype: win10 1 ostype: wxp

Hi, did you find any solution? I'm experiencing the same issue with an ASUS motherboard.

ctrlbrk · Feb 5, 2024

I'm having the same issue. Hardware unchanged for 13 months. 11 months zero problems but recently it's randomly rebooting after hours or days. Logs of course show nothing.

I just now updated to pvetest repo because this is becoming a serious production issue.

I'm using i7 13th gen 128gb ECC.

pschonmann · Feb 10, 2024

Not same CPU, but latest BIOS probably fixed the issues
https://www.asrockrack.com/general/productdetail.asp?Model=1U4LW-X570/2L2T#Download

1.70

11/21/2022

BIOS

Instant Flash

18.33MB

1.Update ComboAm4 V2 AGESA PI to 1.2.0.7
2.Support AMD Ryzen 7 5800X3D CPU.
3.Support Re-size BAR

eeeeb · Mar 5, 2024

Running into the same issue since for a few days now running an updated 8.1.4 PVE.
I haven't checked the hardware yet, but going to do that soon.

Proxmox Mystery Random Reboots

New Member

Distinguished Member

New Member

Member

New Member

Member

New Member

Member

New Member

Member

Member

Renowned Member

Member

Active Member

Member

Member

New Member

Member

Member

Member