All servers in a cluster restart

BoringASK

New Member
Aug 9, 2018
12
0
1
24
Paris, France
Hello,

I have a problem since today on a cluster that has already been running for 3 months without too many worries.

Today I had a restart of all the servers in my cluster, I can't find anything wrong with the network.

How can I diagnose this problem? Why Proxmox relaunches healthy servers?
 

BoringASK

New Member
Aug 9, 2018
12
0
1
24
Paris, France
I'll had a bit more details :

- I have a 8 compute configurations. Storage is separated on a dedicated network
- The cluster is a 2 ring corosync setup, on dedicated network & separated switches.

After investigation, it seems that my node 004 is the one that crashed first. I'm guessing this because the last log was 5min before the last log of other servers. Other servers had some corosync logs before their crash, but not this one.

This node wasn't the "master" node of corosync. All the other nodes seems to have acknowledged the crash of node 004, but for an unknow reason, the watchdog still triggered, and rebooted the whole cluster (7 healthy nodes).

Code:
Sep 24 20:36:00 compute007 corosync[2309]: notice  [TOTEM ] A processor failed, forming new configuration.
Sep 24 20:36:00 compute007 corosync[2309]:  [TOTEM ] A processor failed, forming new configuration.
Sep 24 20:36:08 compute007 corosync[2309]: notice  [TOTEM ] A new membership (10.3.16.11:1708) was formed. Members left: 1
Sep 24 20:36:08 compute007 corosync[2309]: notice  [TOTEM ] Failed to receive the leave message. failed: 1
Sep 24 20:36:08 compute007 corosync[2309]:  [TOTEM ] A new membership (10.3.16.11:1708) was formed. Members left: 1
Sep 24 20:36:08 compute007 corosync[2309]:  [TOTEM ] Failed to receive the leave message. failed: 1
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: notice  [QUORUM] Members[7]: 4 5 6 2 3 7 8
Sep 24 20:36:08 compute007 corosync[2309]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [QUORUM] Members[7]: 4 5 6 2 3 7 8
Sep 24 20:36:08 compute007 corosync[2309]:  [MAIN  ] Completed service synchronization, ready to provide service.
I have the following line : "watchdog-mux[1264]: client watchdog expired - disable watchdog updates'. It looks like the watchdog daemon is responsible for all the reboots.. But it shouldn't have been triggered with the corosync being OK.

At the moment, I removed node 004 from production, but I'd like to understand why this 1 node failure took down the whole cluster.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!