Issues with PM cluster

Sep 16, 2019
12
1
3
29
Hello,

We are having some strange issues going on in our ProxMox cluster. Our nodes seem to be just randomly kernel panicking for no rhyme or reason. We can't find anything for logs on why this is happening(if there's any non-standard log locations we'd love to know about them!). Our storage back-end is ceph, and we currently run PM VE 5.4-13 with 8 nodes. There doesn't seem to be a pattern on any node as 3-4 of them have been having the issue. A reboot of the node seems to bring the node back into the cluster without issues. We've ran memtests on the nodes, and there doesn't seem to be an issue with memory. Resources seem to be sufficient (under 65% on each node). Any help is appreciated, as we're kind of at a loss here with nodes just randomly dying. We had two nodes die yesterday, one around 8:25 AM CST, and the other around 4:10 PM CST.
 
Have you been able to attach a console to the server and see the kernel panic?

Most kernel panics wont be in the log due to the fact its the kernel itself that has crashed.
 
We haven't actually seen the kernel panic happen, by the time it crashes we're scrambling trying to get vm's migrated and restored the login screen is just frozen at the console. We probably can log into each of the consoles to see the kernel panic happen, but we do not stay logged in the console as a security practice.
 
We haven't actually seen the kernel panic happen, by the time it crashes we're scrambling trying to get vm's migrated and restored the login screen is just frozen at the console. We probably can log into each of the consoles to see the kernel panic happen, but we do not stay logged in the console as a security practice.

Just to confirm the node itself its kernel panicking? Not just the VM's?

What I meant was as in physically connecting a screen or via ILO to see what the physical server is doing.

For example could be a new driver update or kernel that is causing the crash/panic.

You really need to catch the kernel panic (if it's on the physical node) to atleast have an idea of what the issue may be.
 
Yeah it's a physical node..None of the VMs panic. Ok, we will try and catch the kernel panic through the console directly and post results when it happens. Thanks so much for your prompt reply!!
 
  • Like
Reactions: sg90
Yeah it's a physical node..None of the VMs panic. Ok, we will try and catch the kernel panic through the console directly and post results when it happens. Thanks so much for your prompt reply!!

Maybe you guys could use the ipmi management port to get a remote console so you don't have to leave a physical one unattended.
 
  • Like
Reactions: ritterbc
We are using PRTG to track resource utilization. The beginning of the gap is when the machine stopped responding completely. 32 processors, each line is 1 proc
 

Attachments

  • imgpsh_mobile_save.jpg
    imgpsh_mobile_save.jpg
    112.4 KB · Views: 3
We run supermicro nodes (specifically R309.v6) for out proxmox hosts. The backend storage connects to an external ceph cluster.
We also have NFS mounts for our backups, running FreeNAS.

It seems to be random proxmox hosts, not one specific node.
 
We run updates monthly, and we are on the enterprise repo.
#1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200)

This is just my 2 cents. We typically never update all of our nodes in a cluster. In most situations, we update only 1-2 nodes and let them run for 1-2 weeks. If all is well, we update the rest. For us, this has proven to save us a number of times over the years.

For example, we just recently did a round of updates and ran into this bad boy.

https://bugzilla.proxmox.com/show_bug.cgi?id=2354

Personally I am sticking with 4.15.18-18. 4.15.18-21 seems solid, but I am going to wait till its in the enterprise repo's before making my move. 4.15.18-20 with the above bug scares me enough to not run it in production.

Maybe try moving back to 4.15.18-18 on half the nodes, or maybe even all of them.
 
  • Like
Reactions: ritterbc
Have you been able to successfully downgrade your kernel and uninstall the latest version? We ran into some issues attempting to uninstall the -20 version of the kernel. I'm not sure if this is necessary provided we can get the GRUB configuration updated so that -18 is the default option.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!