Node Crash/Freezing

Greatsamps

Active Member
Sep 25, 2017
29
0
41
43
Hey Gang,

We have a number of Dell R820 servers running Proxmox 6.2. They are all near-identical in spec, 4 x 8 Core CPU, 256GB RAM, 512GB (ish) boot SSD disk, and a Mellanox Connect X4 NIC.

We are seeing periodic issues with all of the servers where they will just freeze and require a hard reboot in order to bring back to life. When connecting to the iDrac, there are no errors on the screen, just a login prompt, but you can't type either, which indicates to me a Kernel Panic.

When the machine has booted, there is nothing in the Syslog or Kernel logs other than a gap.

Given that this is happening to all of them I think it's safe to rule out hardware issues, especially considering some of these devices were running for 2.5 years without even a reboot on Windows 2016 Server!

The Dell hardware is running the latest firmware, as are the Mellanox cards.

The most recent machine that died today, only had 2 VM's on it at the time, and had a load average of 10 and CPU usage of 15% at the time it died. My initial thoughts here are some issues with the network driver/firmware but with an absence of any logs i need to know where to start.

Any ideas?
 
Hey,

what kernel do you currently use? Can you try upgrading to latest package updates?

When the machine has booted, there is nothing in the Syslog or Kernel logs other than a gap.
You could try opening a ssh session to such a server and run dmesg -wT in there, sometimes one is lucky and can see some logs there shortly before the system completely hangs and does not syncs out logs to disk. At least that helped me in the past to catch some message which I got no trace off else.
Without any such messages we cannot really guess anything..
 
Hi,

Thanks for the response, below are specific version numbers.

Kernel Version Linux 5.4.65-1-pve #1 SMP PVE 5.4.65-1 (Mon, 21 Sep 2020 15:40:22 +0200)
PVE Manager Version pve-manager/6.2-12/b287dd27

Is there a more recent Kernel version?

I could try that, but the chances are it will happen when no one is around, and most of the time we notice it died because the monitoring picks it up as dead.

Our plan is to hook up a netconsole stream to see what that can record.
 
I have updated all nodes to:

Kernel Version Linux 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100)
PVE Manager Version pve-manager/6.3-3/eee5f901

I have also installed Mellanox drivers directly from their Apt repository which performed some kernel patches.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!