We are having what appears to be 2 issues with our Dell R820 servers.
We are in the process of moving from Hyper-V to Proxmox, and as we migrate we are updating the existing Dell hardware to latest (2.7) BIOS and installing Proxmox. So far we have migrated 2 servers from Windows and have an additional 5 servers in the cluster.
The first issue we see is that a lot of the time after a reboot, the server becomes stuck after the initializing iDrac screen. The screen is blank with just a flashing cursor in the top left side. The only way to get past this is with a power reset (warm). The server will then boot, but a CPU machine check error is logged in the iDrac logs.
This could of course be a hardware issue, but we have to look at the bigger picture. This is happening on every single server that has been updated to BIOS 2.7 and has had Proxmox installed on it, and only since then. Specifically, the 2 machines that have been migrated from windows never showed this issue before the migration, and the 2 remaining windows ones have never shown it either.
There have been no hardware changes, and the machines have not been physically moved and are located in a T4 data centre. I find it implausible that all of a sudden just these machines have all developed hardware issues. There is a little more info in this fault in the logs:
Detailed Description:
System event log and OS logs may indicate that the exception is external to the processor.
Recommended Action:
1) Check system and operating system logs for exceptions. If no exceptions are found continue. 2) Turn system off and remove input power for one minute. Re-apply input power and turn system on. 3) Make sure the processor is seated correctly. 4) If the issue still persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
On one of the servers that has just shown this problem we have run a multi-core memtest86 for 12 hours on it, and it rebooted cleanly afterwards, but sometimes when running as little as just 15 minutes on Proxmox, after a restart this issue presents its self.
I am by far an expert on CPU’s, are there any error flags that the OS can set on the CPU? For example, can Proxmox set a flag on the CPU to indicate that there is an error, and this is being picked up by the BIOS when it restarts? I have not tried it, but I would all but guarantee that if I were to do a cold boot I would not see this issue.
The second issue is that at times the Proxmox hosts are completely locking up. We lose all connectivity to them, and it is impossible to type on the console. At first we suspected this could be a kernel panic and accordingly setup netconsole on it.
One server died yesterday and nothing was logged to the netconsole or the physical console. The only way to bring it back was with a power reset. What was noticed however was that the server whilst it was locked up was using double the power it was previously. Upon some testing where we maxed every CPU core out to 100%, the power consumption matched, so our thoughts are that this is some rogue process that is consuming 100% of the server’s resources and effectively creating a denial of service condition.
Where should we go with this?
Lastly, I would be interested to hear from anyone who is running Proxmox 6.3 on Dell R820’s along with the current revision of your BIOS.
We are in the process of moving from Hyper-V to Proxmox, and as we migrate we are updating the existing Dell hardware to latest (2.7) BIOS and installing Proxmox. So far we have migrated 2 servers from Windows and have an additional 5 servers in the cluster.
The first issue we see is that a lot of the time after a reboot, the server becomes stuck after the initializing iDrac screen. The screen is blank with just a flashing cursor in the top left side. The only way to get past this is with a power reset (warm). The server will then boot, but a CPU machine check error is logged in the iDrac logs.
This could of course be a hardware issue, but we have to look at the bigger picture. This is happening on every single server that has been updated to BIOS 2.7 and has had Proxmox installed on it, and only since then. Specifically, the 2 machines that have been migrated from windows never showed this issue before the migration, and the 2 remaining windows ones have never shown it either.
There have been no hardware changes, and the machines have not been physically moved and are located in a T4 data centre. I find it implausible that all of a sudden just these machines have all developed hardware issues. There is a little more info in this fault in the logs:
Detailed Description:
System event log and OS logs may indicate that the exception is external to the processor.
Recommended Action:
1) Check system and operating system logs for exceptions. If no exceptions are found continue. 2) Turn system off and remove input power for one minute. Re-apply input power and turn system on. 3) Make sure the processor is seated correctly. 4) If the issue still persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
On one of the servers that has just shown this problem we have run a multi-core memtest86 for 12 hours on it, and it rebooted cleanly afterwards, but sometimes when running as little as just 15 minutes on Proxmox, after a restart this issue presents its self.
I am by far an expert on CPU’s, are there any error flags that the OS can set on the CPU? For example, can Proxmox set a flag on the CPU to indicate that there is an error, and this is being picked up by the BIOS when it restarts? I have not tried it, but I would all but guarantee that if I were to do a cold boot I would not see this issue.
The second issue is that at times the Proxmox hosts are completely locking up. We lose all connectivity to them, and it is impossible to type on the console. At first we suspected this could be a kernel panic and accordingly setup netconsole on it.
One server died yesterday and nothing was logged to the netconsole or the physical console. The only way to bring it back was with a power reset. What was noticed however was that the server whilst it was locked up was using double the power it was previously. Upon some testing where we maxed every CPU core out to 100%, the power consumption matched, so our thoughts are that this is some rogue process that is consuming 100% of the server’s resources and effectively creating a denial of service condition.
Where should we go with this?
Lastly, I would be interested to hear from anyone who is running Proxmox 6.3 on Dell R820’s along with the current revision of your BIOS.