Dell R820 Issues

Greatsamps

Active Member
Sep 25, 2017
29
0
41
44
We are having what appears to be 2 issues with our Dell R820 servers.



We are in the process of moving from Hyper-V to Proxmox, and as we migrate we are updating the existing Dell hardware to latest (2.7) BIOS and installing Proxmox. So far we have migrated 2 servers from Windows and have an additional 5 servers in the cluster.



The first issue we see is that a lot of the time after a reboot, the server becomes stuck after the initializing iDrac screen. The screen is blank with just a flashing cursor in the top left side. The only way to get past this is with a power reset (warm). The server will then boot, but a CPU machine check error is logged in the iDrac logs.



This could of course be a hardware issue, but we have to look at the bigger picture. This is happening on every single server that has been updated to BIOS 2.7 and has had Proxmox installed on it, and only since then. Specifically, the 2 machines that have been migrated from windows never showed this issue before the migration, and the 2 remaining windows ones have never shown it either.



There have been no hardware changes, and the machines have not been physically moved and are located in a T4 data centre. I find it implausible that all of a sudden just these machines have all developed hardware issues. There is a little more info in this fault in the logs:



Detailed Description:

System event log and OS logs may indicate that the exception is external to the processor.

Recommended Action:

1) Check system and operating system logs for exceptions. If no exceptions are found continue. 2) Turn system off and remove input power for one minute. Re-apply input power and turn system on. 3) Make sure the processor is seated correctly. 4) If the issue still persists, contact technical support. Refer to the product documentation to choose a convenient contact method.



On one of the servers that has just shown this problem we have run a multi-core memtest86 for 12 hours on it, and it rebooted cleanly afterwards, but sometimes when running as little as just 15 minutes on Proxmox, after a restart this issue presents its self.



I am by far an expert on CPU’s, are there any error flags that the OS can set on the CPU? For example, can Proxmox set a flag on the CPU to indicate that there is an error, and this is being picked up by the BIOS when it restarts? I have not tried it, but I would all but guarantee that if I were to do a cold boot I would not see this issue.



The second issue is that at times the Proxmox hosts are completely locking up. We lose all connectivity to them, and it is impossible to type on the console. At first we suspected this could be a kernel panic and accordingly setup netconsole on it.



One server died yesterday and nothing was logged to the netconsole or the physical console. The only way to bring it back was with a power reset. What was noticed however was that the server whilst it was locked up was using double the power it was previously. Upon some testing where we maxed every CPU core out to 100%, the power consumption matched, so our thoughts are that this is some rogue process that is consuming 100% of the server’s resources and effectively creating a denial of service condition.



Where should we go with this?



Lastly, I would be interested to hear from anyone who is running Proxmox 6.3 on Dell R820’s along with the current revision of your BIOS.
 
Is there a correlation with uptime of the system?
I experience a strange issue once my server reaches 25 days uptime. It is a rather old opteron based system...
The R720 is quite old as well...
I have not yet four d the source on my end but it is related to my ZFS pool getting suspended and the whole system locks up.
If I reboot on 24 days uptime all is fine.
As we speak of it. It is time (again) ;)
 
Not really. All 7 nodes were rebooted on the 15th, and only 1 of them decided to die yesterday. I just restarted it again after is being up for 2 hours, and it rebooted cleanly...

am wondering if newer or less complex hardware such as supermicro might be a better option?
 
There is no such thing that you could describe as "less complex Hardware" imho. In the end it all boils down to the same chips and stuff.
The board layout may provide different capabilities but in the end we are talking about Intel (or AMD) x86 technology.
My experience is that at a certain point in time modern OS will likely show some strange side effecta on older gear. That is simply due to the fact that against recent systems is developed an/or tested.
I personally have great experience with Supermicro and would recommend the gear.

Can you find anything in the logs right before the system crashes?
Have you updated other gear (aside mainboard) to a recent level?
Different components may have cross a compatibility requirements...
 
I see your point, the thing with the Dell servers though is there is much more to them than just a BIOS, you have iDrac firmware, Lifecycle controller firmware, at one point even PSU's had firmware!

I am considering acquiring a supermicro of a more modern generation to see if that is less prone to problems. Was looking at this spec:

1 x SuperMicro CSE-819U X10DRU-i - BPN-SAS-815TQ, 4-Port 10GBase-T RJ45
2 x Intel Xeon E5-2690 V4 - 14-Core 2.60GHz (20MB Cache, 9.60GTs, 85W)
8 x 32GB - DDR4 2133MHz (PC4-17000LR, 4Rx4)
1 x Adaptec 6805T 512MB (SAS/SATA) RAID Kit - 0/1/5/6/10/50/60/JBOD​
2 x Intel Hot-Swap PSU 750W​

Do you have any thoughts on this? Know it is not the latest and greatest, but given there are potentially 9 of these to replace, we have to balance cost etc.
 
Seem to be decent Servers.
I personally do not like Adaptec Raid cards and prefer LSI based cards.

But don't be surprised - there is a BMC and IPMi and other stuff on these boards as well. ;)
 
Just out of interest here.

Do you think any of these issues could be caused by the Guest CPU Type? We have set this to Sandy Bridge on all Guests as that is a common CPU type among all hosts, however, it's not the default of kvm64.

Also, would I be correct that adding a CPU of a newer generation would be compatible with the older Sandy Bridge guest CPU type, or will I run into issues?
 
Do you think any of these issues could be caused by the Guest CPU Type?
No. Don't think so. This setting only controls the cpu features exposed to the guest to ensure migration compatibility AFAIK.

it's not the default of kvm64.
The default provides most compatibility bit least performance. As said I don't think this relates, but of course you can try.


or will I run into issues?
Again AFAIK this is only to ensure compatibility. But no guarantees ;)
 
I own over 200 Dell Servers, from R210, to 730XD. Lots and lots of 610's, 620's and 630s a few 420's, 1 2950 (They are tanks). The issue you are seeing is related to the iDrac. Proxmox is not doing anything to it. Check that your firmware is updated, if it is a removeable idrac card you can try getting a new one. Idracs can be a little finicky. I have a great company in Irvine CA that specializes in Dell hardware I could recommend. I love super micro for the price point, and what you can get but, I also work with about 90 of those that sometimes get stuck in the POST screen upon reboots which is irritating. These are smaller servers, but running Caby Lake, Coffee Lake, Skylake CPu's. So fairly more modern than my dells in some cases.
 
Thanks for your detailed reply, this was just the info i was looking for.

Issue with the 820's is that the iDrac is part of the system board, so can't be replaced. It is also strange that it is now happening on all 7 servers that i have installed Proxmox on, and has never happened on the remaining 2. That said, i have also updated BIOS and iDrac firmware to latest on these 7, whereas the other 2 are on older versions.

I am due to migrate another over in the coming days, so perhaps i will leave the firmware as it is to see the outcome.

Getting stuck on reboots, as you say, is annoying but i can live with. The main issue i can't live with is the whole server getting CPU starved to the point even the console does not work and it needing a reboot, taking any VM's running with it. This is not guest CPU related, it has happened on servers running next to nothing previously.

I don't see how the iDrac could be doing anything here, so previous theory was that it was due to age of CPU's. Your servers are of a similar or older age so that theory now is not looking quite so strong. Other than the server's standard components, the only other element is the Mellanox Connect X4 network card. I have however spoken with someone else who is running this card on a newer supermicro without issue.

Would you mind sharing what network card(s) you use and what version of Proxmox you are on?

Thanks once again.
 
Last edited:
Sorry I know what I said was kind of a blurt.
**edit** Accidently closed this window, I am the king of clicking the one "X" on a browser across four screens. Kudos to ProxMox forum for Autosave! **edit**

So here are the details of 6 servers I have running ProxMox Successfully that are of Dell Manufacturer.

3x Dell R610
CPU(s); 16 x Intel(R) Xeon(R) CPU E5540 @ 2.53GHz (2 Sockets)
Memory; 192GB DDR3
Raid Controller; LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (IT Mode) Zfs Raid 10
--6x Sata 7200 RPM 2.5 500GB
Network Card; Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet

1x Dell R620
CPU(s); 32 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz (2 Sockets)
Memory 192GB DDR3
Raid Controller; LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05) (Hardware Raid 10)
--6 or 8x 1.6 TB Intel S3500 SSD I think
Network Card; Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe

1x Dell rR810 II (I have another box not on right now)
CPU(s); 80 x Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz (4 Sockets)
Memory; 128GB
Raid Controller; LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) Raid 10 6x 1TB (Samsung??) Not sure
Bios: 2.9.0
Life Cycle: 1.7.5.4
iDRAC6
Network Card 1; Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
Network Card 2; Intel Corporation 82575GB Gigabit Network Connection (rev 02)

1x Dell R610
CPU(s) 24 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHz (2 Sockets)
Memory; 192 GB DDR3
Raid Controller;LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)
Network Card; Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

Now with the exception of the 3x R610's running special raid card so I could do IT mode to do ZFS, most of these servers should be running some variety of dell H700, H710 or whatever with 512 - 1gb Cache on the card. Its been awhile since I have rebooted these boxes, one of them having a 500+ day uptime. So I can't remember. 5 of these are at my Colo in Lax "One Wilshire" one of them is in my Dallas DFW colo. I hope this helps. I can say that my 810's can be a little weird as far as CPU halts rare but on occasion. Maybe do to one PSU in use as they are in LAX and power is at a premium there in my rack.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!