[SOLVED] cluster nodes reboot when one node fails

esi_y · Aug 27, 2024

IIEP_IT said:
It is not one network, the /25 mask splits that into two subnets:

IIEP_IT said:
172.25.251.33/25 means the network is 172.25.251.0/25 and the hosts go from 251.1-126 and broadcast is 251.127
172.25.251.233/25 means the network is 172.25.251.128/25 and the hosts go from 251.129-254 and broadcast is 251.255

My bad, I must be intellectually challenged tonight. I did not realise the 33 < 128 and 233 is exactly that one bit off.

IIEP_IT said:
From the corosync help pages, I have never seen it put the mask in the IP. You simply tell it which IP to use for communications as as configured on the interfaces.

Yes, it should be fine. Thanks for clarification. I believe this is not the problem then, an early red herring courtesy of me and myself only...

IIEP_IT · Aug 27, 2024

Thanks. Am looking through HA now and softdog. The thing is, if I put a node into maintenance, I can safely turn it off and the cluster is fine.

But in this case where it suddenly crashed while hosting HA resources on it, it caused problems for the cluster. So I am suspecting that the problem lies there.

Anyway, thanks again for looking through the logs and the config!

esi_y · Aug 27, 2024

I will still have a look at it again, it was just a low hanging fruit first out of the way. The n2 should not be rebooting, basically. I will have a look why the watchdog-mux stops the updates. Someone else might come and have an idea too. I came across a couple of these "mysterious" reboots within HA stack and would be glad to nail it further down, but most often it's "just" some misconfiguration.

So far the only distinguishing thing with n2 is that it was the CRM.

If you are wondering how exactly the watchdog works with HA on, you may get a bit of an idea in this post (also to understand the log messages more precisely):
https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/#post-635602

esi_y · Aug 29, 2024

Hey! Just wanted to ping you that I have not forgotten this thread, but been hovering about some past corosync bugs to perhaps make a better guess.

Is this:

1) a setup where you could afford to turn on extra debugging for a period (it will produce quite an amount of logs); and
2) would this be easy for you to reproduce (does it happen often enough) to capture such with the extra logs?

IIEP_IT · Aug 29, 2024

Hello. Thanks for your message.

This is actually a production cluster so I am trying to find less disruptive ways of figuring out the problem. From what I have seen in the logs, KNET showed links which were going down intermittently, way before the issue with the reboot occurred.

Going through other posts, I saw some recommendations to:
1. turn off spanning tree on the switch
2. disable the Intel(R) Management Engine Interface kernel module, in my case mei and mei_me

So I am going that route first then monitor the corosync logs and see.

If that fails, I will see about extra logs when I get a chance over a weekend when I can schedule a maintenance window and get back to you.

Thanks again.

esi_y · Aug 29, 2024

IIEP_IT said:
Hello. Thanks for your message.

This is actually a production cluster so I am trying to find less disruptive ways of figuring out the problem. From what I have seen in the logs, KNET showed links which were going down intermittently, way before the issue with the reboot occurred.

Going through other posts, I saw some recommendations to:
1. turn off spanning tree on the switch
2. disable the Intel(R) Management Engine Interface kernel module, in my case mei and mei_me

So I am going that route first then monitor the corosync logs and see.

If that fails, I will see about extra logs when I get a chance over a weekend when I can schedule a maintenance window and get back to you.

Thanks again.

Sure, no worries, you may also want to check flow control is off and whether the NIC seems to be doing well with ethtool -S ethX | grep -e flow -e pause.

IIEP_IT · Sep 1, 2024

DIsabling the mei and mei_me kernel modules solved this issue for me. The KNET links are now stable and I no longer see drops in the connection in either ring. I am able to put nodes into maintenance and reboot them with no problem, the cluster stays quorate as expected and continues working.

Thanks to everyone who chimed in and specially to @esi_y for taking the time to check out the logs.

esi_y · Sep 1, 2024

IIEP_IT said:
DIsabling the mei and mei_me kernel modules solved this issue for me.

This one's new to me, actually. Thanks for providing feedback. Do you mind list the exact CPU on those machines?

IIEP_IT · Sep 2, 2024

Sure, Intel(R) Xeon(R) CPU E5-2650 v4 and Intel(R) Xeon(R) Gold 5120

JoccE · Sep 5, 2024

I've had similar issue lately after moving from v2 -> v4 CPUs, i think i solved it with full reinstall, but if it continues i will test the mei module also and report back.

Search

Search

[SOLVED] cluster nodes reboot when one node fails

esi_y

Renowned Member

IIEP_IT

New Member

esi_y

Renowned Member

esi_y

Renowned Member

IIEP_IT

New Member

esi_y

Renowned Member

IIEP_IT

New Member

esi_y

Renowned Member

IIEP_IT

New Member

JoccE

Member

We value your privacy