We re-purposed some old hardware to setup a proper sandbox environment. The cluster has 4 Dell R620 servers and a relatively old Intel S5520HC system with Intel E5620 (Westmere) CPUs. This last node isn't going to get used for virtuals, primarily serving as a dedicated Ceph storage node.
System had 1.5 year+ uptime running RHEL5 but we couldn't get it to stay up longer than 2 hours booting Proxmox, Debian 9, 10, 11 or CentOS 7. This involved simply booting the systems in to a rescue environment and leaving it sitting there.
System event logs (ipmi-sel) would report the following error:
Physical symptoms would be the system resetting and beeping a couple of times until it restarted itself again.
Turns out allot (all?) Nehalem and early Westmere Intel Xeon CPUs have a physical design flaw when switching CPUs in to various low power C states. When we carefully recorded the current BIOS settings, reset them to defaults and compared those values to ones we recorded we observed that the Processor options had previously been altered to disable C6 power states, whilst C3 defaults to being disabled.
I assume Linux kernels newer than RHEL5 initialise CPUs more directly, possibly ignoring or bypassing the BIOS and that they subsequently ignore C states having been disabled. Solution was to pass the following options to the kernel at bootup and the system has been stable ever since!
/etc/default/grub
PS: We always run with 'intel_pstate' disabled and added the two cstate options
System had 1.5 year+ uptime running RHEL5 but we couldn't get it to stay up longer than 2 hours booting Proxmox, Debian 9, 10, 11 or CentOS 7. This involved simply booting the systems in to a rescue environment and leaving it sitting there.
System event logs (ipmi-sel) would report the following error:
Code:
3 | Sep-18-2019 | 07:17:14 | Pwr Unit Status | Power Unit | Power Unit Failure detected
4 | Sep-18-2019 | 07:17:19 | Pwr Unit Status | Power Unit | Power Off/Power Down
Physical symptoms would be the system resetting and beeping a couple of times until it restarted itself again.
Turns out allot (all?) Nehalem and early Westmere Intel Xeon CPUs have a physical design flaw when switching CPUs in to various low power C states. When we carefully recorded the current BIOS settings, reset them to defaults and compared those values to ones we recorded we observed that the Processor options had previously been altered to disable C6 power states, whilst C3 defaults to being disabled.
I assume Linux kernels newer than RHEL5 initialise CPUs more directly, possibly ignoring or bypassing the BIOS and that they subsequently ignore C states having been disabled. Solution was to pass the following options to the kernel at bootup and the system has been stable ever since!
/etc/default/grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1"
PS: We always run with 'intel_pstate' disabled and added the two cstate options