[SOLVED] cluster nodes reboot when one node fails

It is not one network, the /25 mask splits that into two subnets:

172.25.251.33/25 means the network is 172.25.251.0/25 and the hosts go from 251.1-126 and broadcast is 251.127
172.25.251.233/25 means the network is 172.25.251.128/25 and the hosts go from 251.129-254 and broadcast is 251.255

My bad, I must be intellectually challenged tonight. I did not realise the 33 < 128 and 233 is exactly that one bit off.

From the corosync help pages, I have never seen it put the mask in the IP. You simply tell it which IP to use for communications as as configured on the interfaces.

Yes, it should be fine. Thanks for clarification. I believe this is not the problem then, an early red herring courtesy of me and myself only...
 
Thanks. Am looking through HA now and softdog. The thing is, if I put a node into maintenance, I can safely turn it off and the cluster is fine.

But in this case where it suddenly crashed while hosting HA resources on it, it caused problems for the cluster. So I am suspecting that the problem lies there.

Anyway, thanks again for looking through the logs and the config!
 
I will still have a look at it again, it was just a low hanging fruit first out of the way. The n2 should not be rebooting, basically. I will have a look why the watchdog-mux stops the updates. Someone else might come and have an idea too. I came across a couple of these "mysterious" reboots within HA stack and would be glad to nail it further down, but most often it's "just" some misconfiguration.

So far the only distinguishing thing with n2 is that it was the CRM.

If you are wondering how exactly the watchdog works with HA on, you may get a bit of an idea in this post (also to understand the log messages more precisely):
https://forum.proxmox.com/threads/getting-rid-of-watchdog-emergency-node-reboot.136789/#post-635602
 
Hey! Just wanted to ping you that I have not forgotten this thread, but been hovering about some past corosync bugs to perhaps make a better guess.

Is this:

1) a setup where you could afford to turn on extra debugging for a period (it will produce quite an amount of logs); and
2) would this be easy for you to reproduce (does it happen often enough) to capture such with the extra logs?
 
Hello. Thanks for your message.

This is actually a production cluster so I am trying to find less disruptive ways of figuring out the problem. From what I have seen in the logs, KNET showed links which were going down intermittently, way before the issue with the reboot occurred.

Going through other posts, I saw some recommendations to:
1. turn off spanning tree on the switch
2. disable the Intel(R) Management Engine Interface kernel module, in my case mei and mei_me

So I am going that route first then monitor the corosync logs and see.

If that fails, I will see about extra logs when I get a chance over a weekend when I can schedule a maintenance window and get back to you.

Thanks again.
 
Hello. Thanks for your message.

This is actually a production cluster so I am trying to find less disruptive ways of figuring out the problem. From what I have seen in the logs, KNET showed links which were going down intermittently, way before the issue with the reboot occurred.

Going through other posts, I saw some recommendations to:
1. turn off spanning tree on the switch
2. disable the Intel(R) Management Engine Interface kernel module, in my case mei and mei_me

So I am going that route first then monitor the corosync logs and see.

If that fails, I will see about extra logs when I get a chance over a weekend when I can schedule a maintenance window and get back to you.

Thanks again.

Sure, no worries, you may also want to check flow control is off and whether the NIC seems to be doing well with ethtool -S ethX | grep -e flow -e pause.
 
Last edited:
DIsabling the mei and mei_me kernel modules solved this issue for me. The KNET links are now stable and I no longer see drops in the connection in either ring. I am able to put nodes into maintenance and reboot them with no problem, the cluster stays quorate as expected and continues working.

Thanks to everyone who chimed in and specially to @esi_y for taking the time to check out the logs.
 
  • Like
Reactions: esi_y
I've had similar issue lately after moving from v2 -> v4 CPUs, i think i solved it with full reinstall, but if it continues i will test the mei module also and report back.
 
  • Like
Reactions: IIEP_IT

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!