Hey Gang,
We have a number of Dell R820 servers running Proxmox 6.2. They are all near-identical in spec, 4 x 8 Core CPU, 256GB RAM, 512GB (ish) boot SSD disk, and a Mellanox Connect X4 NIC.
We are seeing periodic issues with all of the servers where they will just freeze and require a hard reboot in order to bring back to life. When connecting to the iDrac, there are no errors on the screen, just a login prompt, but you can't type either, which indicates to me a Kernel Panic.
When the machine has booted, there is nothing in the Syslog or Kernel logs other than a gap.
Given that this is happening to all of them I think it's safe to rule out hardware issues, especially considering some of these devices were running for 2.5 years without even a reboot on Windows 2016 Server!
The Dell hardware is running the latest firmware, as are the Mellanox cards.
The most recent machine that died today, only had 2 VM's on it at the time, and had a load average of 10 and CPU usage of 15% at the time it died. My initial thoughts here are some issues with the network driver/firmware but with an absence of any logs i need to know where to start.
Any ideas?
We have a number of Dell R820 servers running Proxmox 6.2. They are all near-identical in spec, 4 x 8 Core CPU, 256GB RAM, 512GB (ish) boot SSD disk, and a Mellanox Connect X4 NIC.
We are seeing periodic issues with all of the servers where they will just freeze and require a hard reboot in order to bring back to life. When connecting to the iDrac, there are no errors on the screen, just a login prompt, but you can't type either, which indicates to me a Kernel Panic.
When the machine has booted, there is nothing in the Syslog or Kernel logs other than a gap.
Given that this is happening to all of them I think it's safe to rule out hardware issues, especially considering some of these devices were running for 2.5 years without even a reboot on Windows 2016 Server!
The Dell hardware is running the latest firmware, as are the Mellanox cards.
The most recent machine that died today, only had 2 VM's on it at the time, and had a load average of 10 and CPU usage of 15% at the time it died. My initial thoughts here are some issues with the network driver/firmware but with an absence of any logs i need to know where to start.
Any ideas?