All servers in a cluster restart

Discussion in 'Proxmox VE: Installation and configuration' started by BoringASK, Sep 24, 2018.

  1. BoringASK

    BoringASK New Member

    Joined:
    Aug 9, 2018
    Messages:
    12
    Likes Received:
    0
    Hello,

    I have a problem since today on a cluster that has already been running for 3 months without too many worries.

    Today I had a restart of all the servers in my cluster, I can't find anything wrong with the network.

    How can I diagnose this problem? Why Proxmox relaunches healthy servers?
     
  2. BoringASK

    BoringASK New Member

    Joined:
    Aug 9, 2018
    Messages:
    12
    Likes Received:
    0
    I'll had a bit more details :

    - I have a 8 compute configurations. Storage is separated on a dedicated network
    - The cluster is a 2 ring corosync setup, on dedicated network & separated switches.

    After investigation, it seems that my node 004 is the one that crashed first. I'm guessing this because the last log was 5min before the last log of other servers. Other servers had some corosync logs before their crash, but not this one.

    This node wasn't the "master" node of corosync. All the other nodes seems to have acknowledged the crash of node 004, but for an unknow reason, the watchdog still triggered, and rebooted the whole cluster (7 healthy nodes).

    Code:
    Sep 24 20:36:00 compute007 corosync[2309]: notice  [TOTEM ] A processor failed, forming new configuration.
    Sep 24 20:36:00 compute007 corosync[2309]:  [TOTEM ] A processor failed, forming new configuration.
    Sep 24 20:36:08 compute007 corosync[2309]: notice  [TOTEM ] A new membership (10.3.16.11:1708) was formed. Members left: 1
    Sep 24 20:36:08 compute007 corosync[2309]: notice  [TOTEM ] Failed to receive the leave message. failed: 1
    Sep 24 20:36:08 compute007 corosync[2309]:  [TOTEM ] A new membership (10.3.16.11:1708) was formed. Members left: 1
    Sep 24 20:36:08 compute007 corosync[2309]:  [TOTEM ] Failed to receive the leave message. failed: 1
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: warning [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]: notice  [QUORUM] Members[7]: 4 5 6 2 3 7 8
    Sep 24 20:36:08 compute007 corosync[2309]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [CPG   ] downlist left_list: 1 received
    Sep 24 20:36:08 compute007 corosync[2309]:  [QUORUM] Members[7]: 4 5 6 2 3 7 8
    Sep 24 20:36:08 compute007 corosync[2309]:  [MAIN  ] Completed service synchronization, ready to provide service.
    
    I have the following line : "watchdog-mux[1264]: client watchdog expired - disable watchdog updates'. It looks like the watchdog daemon is responsible for all the reboots.. But it shouldn't have been triggered with the corosync being OK.

    At the moment, I removed node 004 from production, but I'd like to understand why this 1 node failure took down the whole cluster.
     
  3. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    888
    Likes Received:
    121
    Hi,

    your first step is to use a dedicated host (outside of pmx) where you will receive ALL of your pmx nodes logs.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice