Proxmox system reboots after between 19 and 20 hours.

Oats182 · Aug 20, 2025

Hi, First time poster and am not 100% sure what infomation is needed, please advise. I've attached a log of the on-time period. Open to all suggestions please and thanks! Basically I'm at my limit. These are all new system parts (including the parts that were replaced, listed below). I'm going to be overly thorough -- this is the weirdest thing I've ever seen!

Brand new system was built with:

Asrock Rack B650D4u
EVGA 1000 GQ GOLD 1000w
2 x NEMIX RAM 32GB (1X32GB) DDR5 4800MHZ PC5-38400 2Rx8 1.1V CL40 288-PIN ECC UDIMM Unbuffered Memory Compatible with Kingston KSM48E40BD8KI-32HA, --KSM48E40BD8KM-32HM totaling 64GB installed (this Kingston type is on the approved list from Asrock -- Nemix: i've heard.. never again, but not seemingly the problem)
AMD Ryzen 9 9900x
Seagate Ironwolf Pro 16TB ST16000NTZ01 (proxmox backup snapshots)
Segate Ironwolf Pro 8TB ST80000NTZ01(vm storage - Ubuntu 24.04 as a BackupPC server of approx 6TB)
Kingston DC600M 3840GB 2.5" (vm storage - Ubuntu 24.04 as a file server of approx 3.2TB)
Kingston DC600M 480GB 2.5" Proxmox boot drive
Samsung 9100 Pro NVMe (vm storage 2x instances of Windows 11 Pro each of approx 400 GB size)

Other Notes:

Only using 1 of the onboard nics for Proxmox AND one BMC nic cable is attached. Both are plugged into the same switch in test scenarios.
Started with a fresh ISO last week of Proxmox 8.4 with its default kernel.
Full reinstall (wipe drive and start fresh) with Proxmox 9 iso default kernel.
Had to use nomodset param in grub for install in both instances.
Proxmox install on the asrock detected my network IPs. proxmox 9 on the supermicro doesn't but manual entery works fine.
BOTH motherboards happen to have i210 intel nics on board . (I was reading this can be an issue but a hard reboot on a schedule??)
Log has no errors out of the norm that I can see. (zfs kernel taint but I read that is to be excpected if not using??)
All drives have been formatted with EXT4. Other than Proxmox's default boot drive LVM, there are no LVM's in use, no raid. Nice and simple.
Temps are never high. Noctua cooler seems to work fine. Drive temps are never hot. System is basically idle.
I've tried grub with both 0 and 1 for options processor.max_cstate=0 amd_idle.max_cstate=0
I've tried cstate 0 and 1 in bios (based on what I had set in grub) and disabling anything power-savings wise that Supermicro gives access to in the bios.
Can't remember the exact syntax but also had disabled some power savings via grub for nvme and pcie_aspm that I read about somewhere. Stil reboots at 19 - 20 hr. this was removed.

TESTING AND CHANGES (NONE of which made a difference-- ie. The system either hard locked or spontaneously rebooted between 19 and 20(+/-5) minutes :

Asrock IPMI reports voltage issues in the coin battery out of box - replaced it - IPMI error cleared.
Asrock support provided a beta bios and bmc to try. It made no difference.
Memory was put through 4 passes of memtest86 - passed no errors.
CPU was replaced with a new AMD Ryzen 7 9700x. made no difference.
RAM was replaced with new TEAMGROUP T-Create Classic 10L DDR5 32GB Kit (2 x 16GB) 5600MHz (PC5-44800) CL46 Desktop Memory Module Ram, Supports Both Intel & AMD - CTCCD532G5600HC46DC01 (booted fine) Memtest done with 4 passes. Same problem after 19ish hours. RAM booted fine BTW if any one is looking for a cheap option for the Asrock B650D4U's
Boot drive was replaced with a 1 TB older Samsung Gen2 NVMe I had kicking around. Full proxmox 8.4 ISO I grabbed from last week. Problem still exists. Kingston 480GB boot was not connected during this period.
reverted back to Nemix RAM, CPU, Kinston DC600M 480GB 2.5" and installed the newly released Proxmox 9 (so much hope I had - good times)
did 10 more passes of memtest86 on the nemix - no errors - passed.
Bought a brand new Supermicro H13SAE-MF motherboard - bios understood the Ryzen 9 9000. It's at release 2.3. There is a 2.4 on the SuperMicro site but the a-holes don't show a change log anywhere. Haven't updated to the 2.4 bios as I'm not thinking it's the issue.
changed ALL SATA cables.
Did a test with only the Windows VMs (no linux). Same problem.
Removed the Samsung 9100 Pro NVMe for a test day with Windows VMs on Kingston 3840GB. Same problem.
installed without error and f-around with xcp-ng. Found out after several hours you can't create a VM of more than 2GB (like in the 90's) without LVM magic - that's annoying, so screw them. Much prefer to take an EXT drive from a system and move it if needed. Another fresh install of Proxmox 9. (sadly, I didn't wait 20 hours to see how XCP-NG played).Might be a new OS install in my testing future since I jumped the gun on removing XCP-NG
Noticed in the logs that nomodset was a thing on the running Proxmox box. Found an "installation.cfg" passing that in /etc/defaults/grub.d Removed that and now am seeing the AMD CPU items loading fine with no "nomodeset warnings" in journalctl. 20 hours later - still rebooted on schedule.
Replaced power supply with Superflower 850Watt Gold. Same issue.
Physically moved the system in my house to directly connect it (same ports LAN1 plugged in, LAN2 not, BMC plugged in with new cables). Into the Unifi 24-port switch. Originally, it's been on a 5-port Unifi switch to this point. I also replaced the power cord. Connected to an APC battery in this test. Ran it headless. No change - reboots after just shy of 20 hours.

The server has never crashed while in use. I had no problem passing a USB drive onto the Linux VM and rsyncing 700GB of files while doing the chaff removal from WinVMs.

Linux VMs have always had x86-64-v2-AES, 4 cores, OVMF UEFI, Q35 qcow. One has ballooning on with 1 to 4GB balloon RAM; the other same, except 4 to 8 GB RAM.

Win VMs have been tried with and without ballooning. 1 has 8GB. 1 has 16GB. OVMF UEFI, Q35 qcow and Win11 picked during install for the TPM/secure boot items. Both have 4 cores. For CPU types, I started with X86-64-v2-AES. Switched CPU to host and test. Now (active during previous crash) I've got them set to x86-64-v3. Same problem.
In one case, after wiping it, I quickly did 2 Linux and 2 Windows VM installations all simultaneously. This hardware is a champ!! (but only for 20 hours).

During the above tests, I completely wiped the VMs and started fresh installs. Ubuntu Server 24.04 and Win1124h2engX64 ISO are recent (last couple of weeks). virtio-win-0.1.271.iso re-download and installed in win11vms.

There is no PCI video card - no GPU pass-through in use or needed.

I have updated my UniFi equipment and rebooted my pfSense dedicated box. Same deal at around 19+ hours.

What's left?? Everything seems to be a stretch but I've built literally hundreds of systems over 2 decades! I've seen / done some pretty crazy and or stupid things. Not sure what I'm missing. Do I move the tower to the clients office (tplink omada gear there) in case of something weird on my network?? My current test with VMs off and drives unplugged will be completed in about 16 hours and ~50 minutes. (It's like knowing when you're going to die.) This last test will be a replacement of every single pc of hardware if your keeping score (except for the CPU cooler.)

I need to get this to a site and it's taking a tole on my mojo to have to make a change and wait for the fail in 20hrs.
Does anyone have any suggestions?? Please and thanks for your time or maybe related crazy story!

dcsapak · Aug 20, 2025

Hi

this sounds indeed very weird and unusual. Just to clarify do you mean it reboots every 19-20hours or every day between 19:00 and 20:00 ?
Where in the logs did the reboot occur ? (or was that only between reboots?)

Sorry if i have missed something, but it seems you replaced basically every part? So there is not a single thing that did not change between your tests?
The only things that would come to mind are
* the shared power grid/supply of your house (?)
Depending on what kind of APC battery you used, a weird glitch in the phase/frequency might propagate through?
* the same vm configurations
It could be some behavior trigger inside the vms that trigger some weird/unknown bug somehow? (it doesn't sound like it, but if you have the time, letting the system run without vms/containers to test if it still reboots/crashes would be interesting)

other than that, did you do anything custom? any scripts/customization/automation that was not done with the default gui?

Oats182 · Aug 20, 2025

dcsapak said:
Hi

this sounds indeed very weird and unusual. Just to clarify do you mean it reboots every 19-20hours or every day between 19:00 and 20:00 ?
Where in the logs did the reboot occur ? (or was that only between reboots?)

Sorry if i have missed something, but it seems you replaced basically every part? So there is not a single thing that did not change between your tests?
The only things that would come to mind are
* the shared power grid/supply of your house (?)
Depending on what kind of APC battery you used, a weird glitch in the phase/frequency might propagate through?
* the same vm configurations
It could be some behavior trigger inside the vms that trigger some weird/unknown bug somehow? (it doesn't sound like it, but if you have the time, letting the system run without vms/containers to test if it still reboots/crashes would be interesting)

other than that, did you do anything custom? any scripts/customization/automation that was not done with the default gui?

Hi dcspak, thank you so much for the reply. It's up time of the actual Proxmox OS. I get to work on it 4 hours earlier every day since the hardware was purchased July 23ish. It even crossed my mind it was a time zone thing. I'm in EST currently, 4 hours behind. BIOS is seemingly GMT, but TZ is set to EST in Proxmox, Windows, and Linux VMs.

The log is from the time it is booted until the crash. I was likely fiddling and rebooting so it isn't necessarily a log from crash - into a disk cleaning - into a crash if that makes sense.

It was running on just a power bar I have on my workbench for the first several weeks. The last test was on an APC. No change.
Home power seems weird to be the culprit... I have 2 personal use Proxmox servers running on the network (non-cluster) and 2 Linux servers, a couple of Windows machines - no problems for years. I've tested and installed, since December, 5, roughly similar Asrock Rack B650D4U motherboards for a few very small businesses since December - no problems. I always hit them with memtest and let them run for several days at a minimum while setting up. No problems there. I checked last night, and some of those were Proxmox 8.3 and still are. The 8.4 Proxmox of those all work fine, but in case not a coincidence, each had PCIe 10GB nics added in, and those would be configured with a 1GB mobo bmc connection and a DAC to the 10GB. Up time on those I checked last night, and all are running as expected.
VM configurations are very vanilla and nothing I've had to fiddle with on previous versions. (I'm not that adventurous) Ubuntu 24.40 install - nothing on it yet, but a folder that hasn't got a shared directory on it even configured yet. Ubuntu 24.04 with BackupPC installed. 2x Windows 11 24h2 Pro out of the box that hasn't been activated yet (I like to the activation after I know the system is stable to save several phone calls to MS explaining the hardware change triggers an activation problem). Updates are all installed on the Windows machines, but the first thing I do when I get a console for the Windows machines is to install the qemu guest utils and do a shutdown of the VM. Linux is the same, but the qemu-guest-tools are installed. Turn on QEMU in Proxmox, boot back up. So - nothing custom at all about the setup - no scripts - same things I've been doing on past builds - very vanilla.

I should mention on 8.4 it was a hard lockup minutes after 19 hours. Had to do a reboot by power switch or from the IPMI. On 9.x proxmox it reboots closer to the 20 hour mark. Maybe not relevant but who knows. IPMI always stable in both scenarios.

The IPMI never falters. I've watched the Proxmox OS just spontaneously reboot during the 19-20 hours of Proxmox on time while looking at the IPMI remote console. The IPMI I've had on for several hours before booting the Proxmox OS as well. I keep an eye on it to get the times and list of processes with htop in a terminal too to see the "Uptime" has stopped and see what processes are running and CPU load/mem levels -- Nothing consistent here - CPU is always not under load but it also has many hours of not-under-load before the 19 hour mark.

I've always had the Windows VMs running in tests, but have had the 2 Linux VMs off. Never tested with the opposite - Linux VMs running and Windows off.

The last 3 parts that have never been disconnected are unplugged for my latest test. That's 2x Seagate 7200RPMs and 1x Kingston Enterprise 3840GB. All on SATA. All VMs are off. All drives with the exception of the Proxmox boot drive (it doesn't have VMs on it) are unmounted as well (I left the NVME in the computer but not mounted.

Up time at the time of writing is 18hrs 01 mins. We shall see. I'll connect a term with bmon as well to see if there's any out-of-band shenanigans.

Thanks again,
Shawn.

pmt_cnq · Aug 20, 2025

Some tests you can do:
1- Reboot on a USB Live Ubuntu to isolate if the issue is related to the installation of Proxmox, or hardware. That way you will bypass the installed Proxmox.
2- If the issue is still present on a "live OS" from USB, then the issue is certainly hardware related.
3- Test with only 1 stick of ram.
3.1- If the issue is still present, bump up the RAM voltage.
4- Check if you can bump up the SoC voltage too.
5- Try to use a locked CPU frequency with fixed voltage instead of stock variable boost with dynamic voltage.

Oats182 · Aug 20, 2025

pmt_cnq said:
Some tests you can do:
1- Reboot on a USB Live Ubuntu to isolate if the issue is related to the installation of Proxmox, or hardware. That way you will bypass the installed Proxmox.
2- If the issue is still present on a "live OS" from USB, then the issue is certainly hardware related.
3- Test with only 1 stick of ram.
3.1- If the issue is still present, bump up the RAM voltage.
4- Check if you can bump up the SoC voltage too.
5- Try to use a locked CPU frequency with fixed voltage instead of stock variable boost with dynamic voltage.

Thanks - will researh your bios suggestions next.

Oats182 · Aug 20, 2025

It's rebooted again. @ 19hrs and 59 minutes this time. Last thing triggered on htop was pve-firewall but that hasn't been consitent - usually when running it's one of the VMs bouncing around at the top. Attaching screenshot of bmon - nothing out of the ordinary here but some dropped packets that were there for last couple of hours. Almost no load in HTOP view - screenshot attached. No VMs running at all. New log attached. Last thing registered in the log was ~6 minutes before the crash.
Will install vanilla ubuntu on the drive and test again.

Oats182 · Aug 20, 2025

I've just installed on bare metal Ubuntu server 22.04. Updated it. Same drive proxmox was booting from (but wiped). The 20-hour timer started.

UPDATE: it failed again - longest time yet @ ~22.5 hours (first time I've seen more than a minute or two past 20 hours). That was with Ubuntu Server 22.04LTS with its default Kernel 5.15.

Next test is IPMI set to not failover, add-on PCIe NIC as main Proxmox 9.0 interface (clean install again). So - avoiding onboard i210 nics except BMC.
I'm seeing in the BMC a fair bit of "Dedicated LAN link Up - Assertion and Dedicated LAN Link Down - Assertion.
Does not relate to the time of crash exactly but it's in the range.
Also, leaving off BIOS option "power on after power outage" to see if it shuts off completely vs the reset/reboot it's been doing. Not sure how that helps but It's something to look forward to. Good times.

All that said - obvoiusly this isn't a problem specific to Proxmox. I'll post back here if I find a solution. Thanks for everyone's input! If someone comes across this and has a solution, I'll keep an eye.

QUESTION - why is there an installer.cfg in /etc/default/grub.d
It's there again after the fresh install of proxmox 9.0-1. Maybe that's normal
It has the nomodeset I had to use again in the installer. This time I left the monitor unplugged (install via ipmi), but I still had a white screen in the installer in the IPMI, so I did the edit the Proxmox 9 graphical installer and added nomodeset.
Maybe I've doing it wrong? Do I erase installer.cfg from OR
#GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX nomodeset" (comment out) OR
make it look like this "GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX"
What's best practice? I've never noticed this in the past.

pmt_cnq · Aug 22, 2025

Great, now we've managed to isolate the issue and we know it's hardware related. I don't think it's related to Proxmox anyhow at this point.
You should try the suggested fix and tests I've mentioned earlier:
"3- Test with only 1 stick of ram.
3.1- If the issue is still present, bump up the RAM voltage.
4- Check if you can bump up the SoC voltage too.
4.1 If the issue is still present with 1 stick of RAM and higher voltage, then the problem is mostly with the CPU or the motherboard.
5- Try to use a locked CPU frequency with fixed voltage instead of stock variable boost with dynamic voltage. Start at low frequency to see if the system stabilize, then go higher until it crashes. "

I highly suspect a faulty CPU that is boosting higher than it should be. But you need to do the tests before we can confirm.
Let us know, i'm curious to see the outcome.

Oats182 · Aug 22, 2025

Another crash - this time at 19:55 minutes.
This test was at the client's site with: IPMI-no failover nic. (Only one Ethernet cable is plugged into the motherboard sockets.)
An extra NIC (Realtek) PCIe was installed in the system, and Proxmox 9 was freshly installed with it. Works perfectly until the 19+ hour mark.

Conclusion - nothing to do with onboard i210 nics or BMC that I can think of.

This time, I was actually working in the IPMI when the crash occurred at the time mentioned - the IPMI didn't become unresponsive or log me out.
The system ran like a champ while on last night - I did a bunch of scripts to copy data from the existing server and other misc setup options to give it some load. Very fast system to be sure. No problems at all until the 19hr+ on time.

I'll try as pmt_cnq suggests (appreciate your help and interest). These bios tweaks are NOT my strong suit.
I don't think I can adjust RAM Voltage or SOC voltage in the SuperMicro BIOS. I've attached screenshots of the only 2 relevant pages that I can see.

I read about Linux not playing well with AMD fTPM. Any reason to think that would cause an issue like this? Seems a stretch.

I'm going to go pick up the system to bring back to my bench. I'm thinking of trying another, more simplistic power supply that doesn't have the ECO settings that both the test EVGA 1000 GQ Gold and Superflower 850 Gold have (**TBD if this is done for next test today-will repurpose in some win desktops I have to do soon). I've sent an email to Supermicro support to see if they have any available "change log" for the 2.4 bios version, or if they've seen anything remotely close to this fiasco. Again - grasping at straws. I'll try going down to 1 stick of RAM for a "cycle" but I'm not sure how that helps (again, not at all proficient in BIOS RAM timing settings, so excuse my ignorance, please and thanks) since I've already test this on two motherboards with brand new ECC and non-ECC RAM from different mfgs. Same with the CPU test going to a brand new Ryzen 7 9700x.

pmt_cnq · Aug 25, 2025

At this point I think it falls outside of Proxmox support scope. But from my research, I think your motherboard is defective. It seems to be a very "problematic board" in general from what I've found.
Did you manage to do the tests?
Also, in the CPU configuration, you can disable the "Core performance boost", just to see how it goes. Let us know.

Oats182 · Aug 25, 2025

Agree - outside Proxmox. Your interest is appreciated. Current test is replacing the Noctua cooler!!??? CPU cooler is the only original part that I started with a month ago. I'm completely at a loss on this!
Last test was to put in a Asus B650 CSM desktop motherboard . Same issue. 19 hrs and 55 minutes - hard crash like on the Asrock Rack (not just the reboot like on the Supermicro.)
Previous test was to disconnect everything to do with the case it is mounted in. Same time range (missed exact time but close to 20 hr mark).
Before that, a dumbed-down Gigabyte 650W Gold PSU. Supermicro BMC fan speed settings set to max. Same thing 19 hours 55 minutes.
Day before that was disabling Ryzen AMDgpu and blacklist it in Linux and add an old Nvidia GPU PCIe - 19hrs 56mins reboot.
Ran burn-in tests on Windows machines again several hours after on time - runs like a champ temps good, fans appropriate ramp up as needed. Leave the system sit and it goes down around 20 hours with no load. Not sure why it's now 19 hours and 55 as the average - when this new build started a month ago it would vary a little more in the closer to 19 hour range. Crazy.

xmesaj2 · Sep 3, 2025

I had similar issues with my AMD system (not AsRock Rack/Supermicro but desktop regular consumer parts), the reboots could occur every 5mins, I tried a lot of solutions with chatGPT and such, also using AsRock B450M Pro4 and swapped to B500M itx AC for the same issues.
Some solutions helped for little longer uptime, but eventually failure was imminent.

I changed
- PSU (Corsair RMx750W to some random old Delta PSU 400W and PicoPSU 120W (my total draw is 80W Peak)) ,
- CPU (4300GE Ryzen PRO to 4650G PRO - both have ECC support per Ryzen PRO series)
- NVME drive (some small 128GB SiliconPower to 1TB WD Red),
- CPU fan,
- Took out RAM stick to run a single (was slightly better), ran memtest86+ for hours, 1 stick each separately, 2 sticks - no errors
- played with BIOS - disabling Cstate control, disabling CPPC, PSP, IOMMU, disabled ECC, even asked AsRock support for newest AGESA 1.2.0.F BIOS patch that they sent me via support request
- installed LSI HBA
- installed aqc107 10G NIC and Mellanoc ConnectX3, switched back to onboard realtek NIC,
- install fresh Proxmox9, even tried installing TrueNAS
- swapped to kernels 6.1, 6.8, 6.12 from Proxmox, 6.16 zabbly, 6.15 psycachy
- blacklisted amdgpu, added NONCQ to libata, set maxcstate=1, disabled watchdog etc
- played with direct_io on my VMs, LXC and host, this slightly improved things

but after all nothing !! changed. My sanity.

Lastly, my problem ended up to be a RAM, I use server ECC memory, it does not play well with Proxmox (and Linux overall),
I put the same 2x 32GB UDIMM ECC sticks into my personal gaming PC (B550M Pro4 motherboard, Ryzen 5950X) and moved my 2x 16GB non-ECC RAM into Proxmox server.

Results:
Proxmox runs 2d uptime now with non ECC "gaming" sticks
My personal Windows PC even with heavy gaming has no issues with the same ECC sticks that my Proxmox server(!) used at first. (the RAM is not listed as compatible officially, but detected and running)
If I recall correctly, no such issues been going with Proxmox 7, I ignored it when it started already with PVE8.

I have checked your log and there is no panic and not much, same used to be with mine, it just hang. Crashkernel does not catch anything at all.

I see you are running BIOS 2.3, there is 2.5 update available for your Supermicro.
https://www.supermicro.com/en/support/resources/downloadcenter/firmware/MBD-H13SAE-MF/BIOS

Oats182 · Sep 3, 2025

Thanks! Ended up being the same thing. RAM. Problem solved - still some mysteries. Very strange as I swapped it previously on the Asrock Rack board*.

Reboots over the previous days before changing RAM were at 19hrs 55min, next day at 19hr56min, day after that 19hr 57mins. Thought I was going nuts. I was literally dreaming about it!

Finally!! After 30+ days of a 19-20 hour reboot or hard lockup cycle, try something else, retest after failure... it's stable! I even tried another 9 9700x in the system - this led me back to RAM. 3 CPUs and 4 motherboards -- what else could it be...

I put different desktop/gaming RAM, same but new from the previous tests, Team Group T-Create 5600 DDR5 32 GB Kit RAM, back in the Supermicro and reset the BIOS (made so many changes I thought it pertinent). The motherboard this time rebooted itself several times (didn't before on the Asrock Rack) and acted like it was re-training on the RAM (long bios load time, reboot, long bios load time reboot, then a normal speed bios load -- maybe it wasn't retraining, but that's what happened, and Proxmox booted. During this last test, I installed kdump, put grub back to its defaults, and changed the hostname (again grasping at straws in case of something weird with networking). And it worked! I'm assuming the hostname and kdump had nothing to do with it. Shut it down and filled all 4 dimms with the same RAM, still fine. I left it that way, and it's been running fine ever since! Obviously, the speed of the RAM is down per Supermicro/filling-all-dimms drops to 3600MHz, but that isn't a concern for this build and what the system is used for, and as a bonus, I'm not having wet dreams about being at a Linux conference in Germany focused on ECC in recent kernels - I don't even speak ECC. WTF.

Lessons learned: Never assume RAM is fine just because the memtestx86 says PASS I guess?? Still, some unanswered questions on why the "gaming" RAM didn't retrain on the Asrock (if that was even the problem - is a bios reset required? It hasn't been on desktops in general, I've found...) and why the non-ECC still caused the reboot issue when replaced early on in troubleshooting at the same uptime interval.

Thanks to everyone for their input! It's been a ride!

Search

Search

Proxmox system reboots after between 19 and 20 hours.

Oats182

New Member

Attachments

dcsapak

Proxmox Staff Member

Oats182

New Member

pmt_cnq

New Member

Oats182

New Member

Oats182

New Member

Attachments

Oats182

New Member

pmt_cnq

New Member

Oats182

New Member

Attachments

pmt_cnq

New Member

Oats182

New Member

xmesaj2

Member

Oats182

New Member

We value your privacy