Hi, First time poster and am not 100% sure what infomation is needed, please advise. I've attached a log of the on-time period. Open to all suggestions please and thanks! Basically I'm at my limit. These are all new system parts (including the parts that were replaced, listed below). I'm going to be overly thorough -- this is the weirdest thing I've ever seen!
Brand new system was built with:
TESTING AND CHANGES (NONE of which made a difference-- ie. The system either hard locked or spontaneously rebooted between 19 and 20(+/-5) minutes :
Linux VMs have always had x86-64-v2-AES, 4 cores, OVMF UEFI, Q35 qcow. One has ballooning on with 1 to 4GB balloon RAM; the other same, except 4 to 8 GB RAM.
Win VMs have been tried with and without ballooning. 1 has 8GB. 1 has 16GB. OVMF UEFI, Q35 qcow and Win11 picked during install for the TPM/secure boot items. Both have 4 cores. For CPU types, I started with X86-64-v2-AES. Switched CPU to host and test. Now (active during previous crash) I've got them set to x86-64-v3. Same problem.
In one case, after wiping it, I quickly did 2 Linux and 2 Windows VM installations all simultaneously. This hardware is a champ!! (but only for 20 hours).
During the above tests, I completely wiped the VMs and started fresh installs. Ubuntu Server 24.04 and Win1124h2engX64 ISO are recent (last couple of weeks). virtio-win-0.1.271.iso re-download and installed in win11vms.
There is no PCI video card - no GPU pass-through in use or needed.
I have updated my UniFi equipment and rebooted my pfSense dedicated box. Same deal at around 19+ hours.
What's left?? Everything seems to be a stretch but I've built literally hundreds of systems over 2 decades! I've seen / done some pretty crazy and or stupid things. Not sure what I'm missing. Do I move the tower to the clients office (tplink omada gear there) in case of something weird on my network?? My current test with VMs off and drives unplugged will be completed in about 16 hours and ~50 minutes. (It's like knowing when you're going to die.) This last test will be a replacement of every single pc of hardware if your keeping score (except for the CPU cooler.)
I need to get this to a site and it's taking a tole on my mojo to have to make a change and wait for the fail in 20hrs.
Does anyone have any suggestions?? Please and thanks for your time or maybe related crazy story!
Brand new system was built with:
- Asrock Rack B650D4u
- EVGA 1000 GQ GOLD 1000w
- 2 x NEMIX RAM 32GB (1X32GB) DDR5 4800MHZ PC5-38400 2Rx8 1.1V CL40 288-PIN ECC UDIMM Unbuffered Memory Compatible with Kingston KSM48E40BD8KI-32HA, --KSM48E40BD8KM-32HM totaling 64GB installed (this Kingston type is on the approved list from Asrock -- Nemix: i've heard.. never again, but not seemingly the problem)
- AMD Ryzen 9 9900x
- Seagate Ironwolf Pro 16TB ST16000NTZ01 (proxmox backup snapshots)
- Segate Ironwolf Pro 8TB ST80000NTZ01(vm storage - Ubuntu 24.04 as a BackupPC server of approx 6TB)
- Kingston DC600M 3840GB 2.5" (vm storage - Ubuntu 24.04 as a file server of approx 3.2TB)
- Kingston DC600M 480GB 2.5" Proxmox boot drive
- Samsung 9100 Pro NVMe (vm storage 2x instances of Windows 11 Pro each of approx 400 GB size)
- Only using 1 of the onboard nics for Proxmox AND one BMC nic cable is attached. Both are plugged into the same switch in test scenarios.
- Started with a fresh ISO last week of Proxmox 8.4 with its default kernel.
- Full reinstall (wipe drive and start fresh) with Proxmox 9 iso default kernel.
- Had to use nomodset param in grub for install in both instances.
- Proxmox install on the asrock detected my network IPs. proxmox 9 on the supermicro doesn't but manual entery works fine.
- BOTH motherboards happen to have i210 intel nics on board . (I was reading this can be an issue but a hard reboot on a schedule??)
- Log has no errors out of the norm that I can see. (zfs kernel taint but I read that is to be excpected if not using??)
- All drives have been formatted with EXT4. Other than Proxmox's default boot drive LVM, there are no LVM's in use, no raid. Nice and simple.
- Temps are never high. Noctua cooler seems to work fine. Drive temps are never hot. System is basically idle.
- I've tried grub with both 0 and 1 for options processor.max_cstate=0 amd_idle.max_cstate=0
- I've tried cstate 0 and 1 in bios (based on what I had set in grub) and disabling anything power-savings wise that Supermicro gives access to in the bios.
- Can't remember the exact syntax but also had disabled some power savings via grub for nvme and pcie_aspm that I read about somewhere. Stil reboots at 19 - 20 hr. this was removed.
TESTING AND CHANGES (NONE of which made a difference-- ie. The system either hard locked or spontaneously rebooted between 19 and 20(+/-5) minutes :
- Asrock IPMI reports voltage issues in the coin battery out of box - replaced it - IPMI error cleared.
- Asrock support provided a beta bios and bmc to try. It made no difference.
- Memory was put through 4 passes of memtest86 - passed no errors.
- CPU was replaced with a new AMD Ryzen 7 9700x. made no difference.
- RAM was replaced with new TEAMGROUP T-Create Classic 10L DDR5 32GB Kit (2 x 16GB) 5600MHz (PC5-44800) CL46 Desktop Memory Module Ram, Supports Both Intel & AMD - CTCCD532G5600HC46DC01 (booted fine) Memtest done with 4 passes. Same problem after 19ish hours. RAM booted fine BTW if any one is looking for a cheap option for the Asrock B650D4U's
- Boot drive was replaced with a 1 TB older Samsung Gen2 NVMe I had kicking around. Full proxmox 8.4 ISO I grabbed from last week. Problem still exists. Kingston 480GB boot was not connected during this period.
- reverted back to Nemix RAM, CPU, Kinston DC600M 480GB 2.5" and installed the newly released Proxmox 9 (so much hope I had - good times)
- did 10 more passes of memtest86 on the nemix - no errors - passed.
- Bought a brand new Supermicro H13SAE-MF motherboard - bios understood the Ryzen 9 9000. It's at release 2.3. There is a 2.4 on the SuperMicro site but the a-holes don't show a change log anywhere. Haven't updated to the 2.4 bios as I'm not thinking it's the issue.
- changed ALL SATA cables.
- Did a test with only the Windows VMs (no linux). Same problem.
- Removed the Samsung 9100 Pro NVMe for a test day with Windows VMs on Kingston 3840GB. Same problem.
- installed without error and f-around with xcp-ng. Found out after several hours you can't create a VM of more than 2GB (like in the 90's) without LVM magic - that's annoying, so screw them. Much prefer to take an EXT drive from a system and move it if needed. Another fresh install of Proxmox 9. (sadly, I didn't wait 20 hours to see how XCP-NG played).Might be a new OS install in my testing future since I jumped the gun on removing XCP-NG
- Noticed in the logs that nomodset was a thing on the running Proxmox box. Found an "installation.cfg" passing that in /etc/defaults/grub.d Removed that and now am seeing the AMD CPU items loading fine with no "nomodeset warnings" in journalctl. 20 hours later - still rebooted on schedule.
- Replaced power supply with Superflower 850Watt Gold. Same issue.
- Physically moved the system in my house to directly connect it (same ports LAN1 plugged in, LAN2 not, BMC plugged in with new cables). Into the Unifi 24-port switch. Originally, it's been on a 5-port Unifi switch to this point. I also replaced the power cord. Connected to an APC battery in this test. Ran it headless. No change - reboots after just shy of 20 hours.
Linux VMs have always had x86-64-v2-AES, 4 cores, OVMF UEFI, Q35 qcow. One has ballooning on with 1 to 4GB balloon RAM; the other same, except 4 to 8 GB RAM.
Win VMs have been tried with and without ballooning. 1 has 8GB. 1 has 16GB. OVMF UEFI, Q35 qcow and Win11 picked during install for the TPM/secure boot items. Both have 4 cores. For CPU types, I started with X86-64-v2-AES. Switched CPU to host and test. Now (active during previous crash) I've got them set to x86-64-v3. Same problem.
In one case, after wiping it, I quickly did 2 Linux and 2 Windows VM installations all simultaneously. This hardware is a champ!! (but only for 20 hours).
During the above tests, I completely wiped the VMs and started fresh installs. Ubuntu Server 24.04 and Win1124h2engX64 ISO are recent (last couple of weeks). virtio-win-0.1.271.iso re-download and installed in win11vms.
There is no PCI video card - no GPU pass-through in use or needed.
I have updated my UniFi equipment and rebooted my pfSense dedicated box. Same deal at around 19+ hours.
What's left?? Everything seems to be a stretch but I've built literally hundreds of systems over 2 decades! I've seen / done some pretty crazy and or stupid things. Not sure what I'm missing. Do I move the tower to the clients office (tplink omada gear there) in case of something weird on my network?? My current test with VMs off and drives unplugged will be completed in about 16 hours and ~50 minutes. (It's like knowing when you're going to die.) This last test will be a replacement of every single pc of hardware if your keeping score (except for the CPU cooler.)
I need to get this to a site and it's taking a tole on my mojo to have to make a change and wait for the fail in 20hrs.
Does anyone have any suggestions?? Please and thanks for your time or maybe related crazy story!
Attachments
Last edited: