All servers in a cluster restart

BoringASK · Sep 24, 2018

Hello,

I have a problem since today on a cluster that has already been running for 3 months without too many worries.

Today I had a restart of all the servers in my cluster, I can't find anything wrong with the network.

How can I diagnose this problem? Why Proxmox relaunches healthy servers?

BoringASK · Sep 24, 2018

I'll had a bit more details :

- I have a 8 compute configurations. Storage is separated on a dedicated network
- The cluster is a 2 ring corosync setup, on dedicated network & separated switches.

After investigation, it seems that my node 004 is the one that crashed first. I'm guessing this because the last log was 5min before the last log of other servers. Other servers had some corosync logs before their crash, but not this one.

This node wasn't the "master" node of corosync. All the other nodes seems to have acknowledged the crash of node 004, but for an unknow reason, the watchdog still triggered, and rebooted the whole cluster (7 healthy nodes).

Code:

Sep 24 20:36:00 compute007 corosync[2309]: notice  [TOTEM ] A processor failed, forming new configuration.
Sep 24 20:36:00 compute007 corosync[2309]:  [TOTEM ] A processor failed, forming new configuration.
Sep 24 20:36:08 compute007 corosync[2309]: notice  [TOTEM ] A new membership (10.3.16.11:1708) was formed. Members left: 1
Sep 24 20:36:08 compute007 corosync[2309]: notice  [TOTEM ] Failed to receive the leave message. failed: 1
Sep 24 20:36:08 compute007 corosync[2309]:  [TOTEM ] A new membership (10.3.16.11:1708) was formed. Members left: 1
Sep 24 20:36:08 compute007 corosync[2309]:  [TOTEM ] Failed to receive the leave message. failed: 1
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]: notice  [QUORUM] Members[7]: 4 5 6 2 3 7 8
Sep 24 20:36:08 compute007 corosync[2309]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
Sep 24 20:36:08 compute007 corosync[2309]:  [QUORUM] Members[7]: 4 5 6 2 3 7 8
Sep 24 20:36:08 compute007 corosync[2309]:  [MAIN  ] Completed service synchronization, ready to provide service.

I have the following line : "watchdog-mux[1264]: client watchdog expired - disable watchdog updates'. It looks like the watchdog daemon is responsible for all the reboots.. But it shouldn't have been triggered with the corosync being OK.

At the moment, I removed node 004 from production, but I'd like to understand why this 1 node failure took down the whole cluster.

guletz · Sep 25, 2018

Hi,

your first step is to use a dedicated host (outside of pmx) where you will receive ALL of your pmx nodes logs.

Search

Search

All servers in a cluster restart

BoringASK

Member

BoringASK

Member

guletz

Famous Member