Issues with PM cluster

ritterbc · Sep 16, 2019

Hello,

We are having some strange issues going on in our ProxMox cluster. Our nodes seem to be just randomly kernel panicking for no rhyme or reason. We can't find anything for logs on why this is happening(if there's any non-standard log locations we'd love to know about them!). Our storage back-end is ceph, and we currently run PM VE 5.4-13 with 8 nodes. There doesn't seem to be a pattern on any node as 3-4 of them have been having the issue. A reboot of the node seems to bring the node back into the cluster without issues. We've ran memtests on the nodes, and there doesn't seem to be an issue with memory. Resources seem to be sufficient (under 65% on each node). Any help is appreciated, as we're kind of at a loss here with nodes just randomly dying. We had two nodes die yesterday, one around 8:25 AM CST, and the other around 4:10 PM CST.

sg90 · Sep 16, 2019

Have you been able to attach a console to the server and see the kernel panic?

Most kernel panics wont be in the log due to the fact its the kernel itself that has crashed.

ritterbc · Sep 16, 2019

We haven't actually seen the kernel panic happen, by the time it crashes we're scrambling trying to get vm's migrated and restored the login screen is just frozen at the console. We probably can log into each of the consoles to see the kernel panic happen, but we do not stay logged in the console as a security practice.

sg90 · Sep 16, 2019

ritterbc said:
We haven't actually seen the kernel panic happen, by the time it crashes we're scrambling trying to get vm's migrated and restored the login screen is just frozen at the console. We probably can log into each of the consoles to see the kernel panic happen, but we do not stay logged in the console as a security practice.

Just to confirm the node itself its kernel panicking? Not just the VM's?

What I meant was as in physically connecting a screen or via ILO to see what the physical server is doing.

For example could be a new driver update or kernel that is causing the crash/panic.

You really need to catch the kernel panic (if it's on the physical node) to atleast have an idea of what the issue may be.

ritterbc · Sep 16, 2019

Yeah it's a physical node..None of the VMs panic. Ok, we will try and catch the kernel panic through the console directly and post results when it happens. Thanks so much for your prompt reply!!

adamb · Sep 16, 2019

ritterbc said:
Yeah it's a physical node..None of the VMs panic. Ok, we will try and catch the kernel panic through the console directly and post results when it happens. Thanks so much for your prompt reply!!

Maybe you guys could use the ipmi management port to get a remote console so you don't have to leave a physical one unattended.

ritterbc · Sep 23, 2019

Not a kernel panic this morning, but it just hangs up at this screen. Any help is appreciated.

adamb · Sep 23, 2019

ritterbc said:
Not a kernel panic this morning, but it just hangs up at this screen. Any help is appreciated.
View attachment 11902

I don't seem to have pigz on any of my installs, is that a default package?

ritterbc · Sep 23, 2019

This had to be installed separately to enable multi threaded compression for ZFS backups.

adamb · Sep 23, 2019

Seems like something is consuming all your CPU cycles. What type of monitoring do you have in place to track these items?

ritterbc · Sep 23, 2019

We are using PRTG to track resource utilization. The beginning of the gap is when the machine stopped responding completely. 32 processors, each line is 1 proc

ritterbc · Sep 23, 2019

Here is the RAM during that time.

adamb · Sep 23, 2019

What type of hardware setup is this?
Do you run VM's on the ceph nodes?
Do you notice specific hosts have more issues than others?

ritterbc · Sep 23, 2019

We run supermicro nodes (specifically R309.v6) for out proxmox hosts. The backend storage connects to an external ceph cluster.
We also have NFS mounts for our backups, running FreeNAS.

It seems to be random proxmox hosts, not one specific node.

ritterbc · Sep 23, 2019

We do not run VMs on the ceph nodes.

adamb · Sep 23, 2019

ritterbc said:
We do not run VMs on the ceph nodes.

What kernel are your nodes on?
Did you notice this started after any updates?

ritterbc · Sep 23, 2019

We run updates monthly, and we are on the enterprise repo.
#1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200)

adamb · Sep 23, 2019

ritterbc said:
We run updates monthly, and we are on the enterprise repo.
#1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200)

This is just my 2 cents. We typically never update all of our nodes in a cluster. In most situations, we update only 1-2 nodes and let them run for 1-2 weeks. If all is well, we update the rest. For us, this has proven to save us a number of times over the years.

For example, we just recently did a round of updates and ran into this bad boy.

https://bugzilla.proxmox.com/show_bug.cgi?id=2354

Personally I am sticking with 4.15.18-18. 4.15.18-21 seems solid, but I am going to wait till its in the enterprise repo's before making my move. 4.15.18-20 with the above bug scares me enough to not run it in production.

Maybe try moving back to 4.15.18-18 on half the nodes, or maybe even all of them.

ritterbc · Sep 23, 2019

Thank you for the prompt reply. We will try and boot from 4.15.18-18 or -19. Thanks for the advice as well!

ritterbc · Sep 24, 2019

Have you been able to successfully downgrade your kernel and uninstall the latest version? We ran into some issues attempting to uninstall the -20 version of the kernel. I'm not sure if this is necessary provided we can get the GRUB configuration updated so that -18 is the default option.

Issues with PM cluster

New Member

Renowned Member

New Member

Renowned Member

New Member

Famous Member

New Member

Famous Member

New Member

Famous Member

New Member

Attachments

New Member

Famous Member

New Member

New Member

Famous Member

New Member

Famous Member

New Member

New Member

We value your privacy