Proxmox Randomly crashing multiple times a day.

ryan.kistler

New Member
Aug 3, 2020
8
0
1
30
My proxmox has been crashing multiple times day. The web-gui becomes unresponsive, and the shell for proxmox shows the what's in the picture. I ran memtest on it last night, and it showed 0 errors. Logs don't show anything, just immediately stop where the crash occurs. I really have no idea what is going on, I've updated everything to the latest no-sub repository. I will respond quickly to any help you guys have to offer.

CaptureScreen (1).jpeg

I've also included a WEBM that i was able to capture of the Proxmox CLI that shows the crash. It crashes right at 3:33 in the WEBM.

https://webmshare.com/play/VragP

Hardware is:
Ryzen 9 3950x
x570d4i-2t
128GB DDR4-2666 ECC RAM (Though Proxmox doesn't seem to think it's working as ECC, despite being enabled in BIOS)
4 2TB PCIE4 NVME Drives on an ASUS Hyper m.2 riser card in a ZFS RAID 10
1 256GB Samsung 970 Evo as Proxmox Boot Drive.
500W Athena PSU

I'm seeing page fault errors in that WEBM, which tells me it's a memory issue. I manually set the clock speed to 2133MHZ instead of the default 2400. So far it's been about an hour an no crash yet, but doesn't really mean much.
 
Hm, the crash according to the webm happens in KVM code, does the issue occur if no VMs are running on the machine?

Other than that I agree with your assessment that on the surface it does look like a hardware error...
 
Hm, the crash according to the webm happens in KVM code, does the issue occur if no VMs are running on the machine?

Other than that I agree with your assessment that on the surface it does look like a hardware error...

I got general stability once I lowered memory clocks to 2133Mhz. The memory is NEMIX brand which I'm not really too aware of, but very few brands make 32GB DDR4 SODIMM's so I was limited on what I could buy. I ran memtest86 over night and it does still eventually throw errors. But it was able to make a pass through the entire test, then the second pass it got 1 error, then a couple of seconds after that 1 error, it showed >400,000 errors and immediately froze. So to me, there's some sort of hard fault that may be temperature/power related.
 
I've been having the same issue, curious to know if changing your memory clocks and potentially making changes to power made a difference.
 
Just as an update, the memory clock change DID NOT fix my issue. Still getting crashes every couple of days, and now worse today I experienced 3 crashes in a row. All were some sort of page fault. I'm running MemTest86 on every stick individually. I guess it could theoretically be the PSU as well, as when it fails it seems to fail hard in Memtest, randomly going from clean to hundreds of errors to a freeze within 10 seconds. I also found that crashes seem to happen more frequently in bad weather, despite being hooked up to a UPS. I've already ordered a beefier 850w PSU that will fit, so hoping that will solve the issue.
 
I am getting same issue and seems like lowering memory back to auto and not using xmp profile let me boot and access web gui.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!