Hello all! So I've had an issue with Proxmox starting about 2 weeks ago. My Ryzen 9 3950x / Gigabyte x570 UD MB Proxmox setup (7.3) was crashing, kernel panicking, and just being super unstable out of no where. I have since then replaced (in order) Motherboard with a ASUS Prime x570 pro (Latest BIOS), swapped out my 9265-8i raid card with a 9300-8i, swapped my dual SFP+ Ethernet card with a single SFP+, swapped the 64 gigs of RAM with 128 gigs ECC from NEMIX, and then got a new Ryzen 9 5950x. I was getting issues like USB would stop working when trying to backup my VMs to an external HDD, Memory errors (ran Memtest for 6+ hours and 2 passes, no errors found), and messed with enabling / disabling IOMMU, Re-Bar, C states, and have hit a wall. At this point, 99% of the errors are gone with one crash I had on the all new hardware and I am getting MCE / Memory error still even with replaced ram that tests good. Looking for thoughts / ideas that I have maybe missed, attached is a DMESG from the 3950x showing most of the issues on the new ASUS PRIME MB and also this is the error I've been still getting with the new processor. Thank you in advance! _EDIT ADDED SYSLOG FOR 5950X_
Code:
Dec 04 09:34:16 vmhost rasdaemon[6869]: rasdaemon: mce_record store: 0x55b26aabc488
Dec 04 09:34:16 vmhost rasdaemon[6869]: rasdaemon: register inserted at db
Dec 04 09:34:16 vmhost rasdaemon[6869]: <...>-3393569 [000] 0.007939: mce_record: 2022-12-04 09:34:16 -0500 Unified Memory Controller (bank=18), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
Dec 04 09:34:16 vmhost rasdaemon[6869]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=3, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01a001101000000, addr= c78fb7a00, synd= 40004000a801203, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
Dec 04 09:34:16 vmhost kernel: mce: [Hardware Error]: Machine check events logged
Attachments
Last edited: