VM Instability and KP's

jasonsansone · Jun 3, 2021

Initial Disclaimer and Apology: The Proxmox forum isn't really the place for this, but I am casting a wide net hoping anyone has any ideas. I don't believe there is anything in anyway wrong with Proxmox software or my installation at a software level. I apologize if this is long and overly detailed, but I am at wits' end trying to make my (new to me) X10DRi-LN4+ function properly.

I have three Supermicro chassis (CSE-826BTS-R920LPBP-1). I have attached PDF printouts of the FRU information and Hardware information (at some point the FRU on Chassis 1 got nuked, but fortunately the issues herein aren’t related to that motherboard). All three systems are spec’d out identical with 2x Intel E5-2697 v4 processors and 12x Micron 16GB DDR4 2400 modules. The memory modules part no. is MTA18ASF2G72PDZ-2G3 which is the same as Supermicro’s MEM-DR416L-CL01-ER24. The modules should be on the HCL list. The PCIe card configuration and BIOS setting configurations of the three chassis are identical. The chassis are in a Proxmox / Ceph Cluster running Proxmox 6.4-8 with Ceph 15.2.11. Chassis 1 and 2 work absolutely flawlessly. The issues promoting this plea for help relate to Chassis 3.

All three chassis were purchased used from the same seller. Chassis 1 and 2 already had BIOS 3.x+ and supported E5 v4 processors when I received them. The third chassis shipped with 1.0a BIOS installed. It only supported E5 v3 processors but I do not have access to any E5 v3 processors. All three Chassis already had a Supermicro OOB key installed when we received them, so the systems supported updating BIOS via BNC/IPMI. I updated the BIOS using the BMC/IPMI to 3.3 and cleared the CMOS by using the jumper pin and removing the CMOS battery. On initial boot the motherboard “seems” to support the E5-2697 v4 processors however I receive this error: "Failing DIMM: DIMM location (Uncorrectable memory component found) P2-DIMME1”. There are also stability issues I will detail below. The memory has been tested in other chassis and is known good. Also, rearranging the modules does not change the location of the error. It remains at slot P2-DIMME1. The slot has been blown out and inspected for any bent pins. None are visible. Curiously, the error will clear when power is pulled from the system requiring the BMC to cold boot. However, the error returns if you reflash the BIOS. I have tested using only two DIMM modules (without any modules located on Socket 2) and I have swapped the CPU between the sockets. I have also cleaned the CPU and inspected for bent pins. I haven’t found any visible defects but that doesn’t mean I didn’t miss something. All six CPU are the the same seller. I have not yet been able to test known good CPU from Chassis 1 or 2 in Chassis 3 because that will require taking the entire cluster out of production.

I am uncertain if this is related to the upgrade from 1.x to 3.x BIOS without a v3 processor. See here - https://www.supermicro.com/support/faqs/faq.cfm?faq=25325 and here https://forums.servethehome.com/index.php?threads/issue-with-dimm-slots-on-x10drc-ln4.12518/. Some FAQ entries indicate this error is related to wrong CPU. However, other FAQ indicate the BIOS can be updated successfully without a CPU or with v4 CPU installed using the BMC. I have also attempted a clean flash of the BIOS from EFI without any change. I have probably reflashed the BIOS over a dozen times while hunting my gremlin. The BIOS flash instructions state that you do not need to adjust motherboard Jumper JPME2 (with Intel ME in manufacture mode), but I have tested that way as well. None of the BIOS release notes or instructions indicate that you can’t go directly from 1.x to 3.x nor do they state that you must have a v3 CPU installed in order to successfully flash to 2.x+.

On the surface, Chassis 3 works fine except for the DIMM error on boot. Proxmox boots without errors. Ceph does not complain. LXC containers and VM’s will live migrate around the cluster. However, I have CPU hash calculation and stability problems. Prime95 complains about "FATAL ERROR: Rounding was 0.4999794644, expected less than 0.4”. The error occurs in seconds, not after a long stress test generating heat. This error does not occur when testing Chassis 1 and 2. Google searches are all related to overclocking, but they suggest the CPU are unstable and/or needing more voltage. I am uncertain if this is related to the v3 / v4 upgrade or any BIOS / microcode. Proxmox Backups to or from this Chassis all fail with SSL/TLS checksum mismatches. See here https://forum.proxmox.com/threads/restore-failed-detected-chunk-with-wrong-digest.89402/. Some VM’s KP when started on this Chassis but work on the other two. A live migration of those VM’s to this Chassis will cause crashes inside the VM. LXC containers all seem to function properly.

Nothing I have attempted has allowed Prime95 to run without error nor have I been able to bring stability to the VM’s on Chassis 3. It is possible one or both of the CPU in Chassis 3 have some defect or that the motherboard has some defect. It is also possible the error is in some way related to the original 1.x BIOS and has never properly been upgraded to support the v4 processors. I am completely lost but any insight would be appreciated.

entilza · Jun 4, 2021

Wow well... If these are in production I really don't think anyone can really help you until you get maybe chassis 2 & 3 out of production so you can do some hardware testing... perhaps swapping CPU from chassis 2 -> 3 and testing .

Can you pull one CPU out of Chassis 3 and just run 1 CPU?

Goodluck!

jasonsansone · Jun 4, 2021

Thank you for your help. I actually did exactly that - I tested the third Chassis with one CPU in Socket 1. I was able to narrow down the issue to the 2nd CPU. It has some errata. Sadly, it will need to be replaced. Thank you for helping me find this little gremlin!

entilza · Jun 4, 2021

Cool! Yeah comparative analysis is a good debugging tool. Goodluck getting a new CPU hopefully that is not too hard to obtain.

jasonsansone · Jun 4, 2021

It isn't hard to obtain, but it also isn't inexpensive.

entilza · Jun 4, 2021

jasonsansone said:
It isn't hard to obtain, but it also isn't inexpensive.

Nice can you run with the 1 CPU in meantime ?

jasonsansone · Jun 4, 2021

Unfortunately not. The PCIe slots are completely full and require both sockets to be populated. With only CPU0 (first socket) populated, I would be forced to choose between networking cards and NVMe drives. Can't really operate an HCI infrastructure without storage or networking. However, I have migrated all containers and VM's off that node and can limp along with it being used only for Ceph. I have enough slack compute and memory capacity to get by with everything shoved off on Chassis 1 and 2 for now. Once the defective CPU is replaced, I can change my HA config to rebalance the cluster.

Search

Search

VM Instability and KP's

jasonsansone

Active Member

Attachments

entilza

Well-Known Member

jasonsansone

Active Member

entilza

Well-Known Member

jasonsansone

Active Member

entilza

Well-Known Member

jasonsansone

Active Member

We value your privacy