"Quorum dissolved" after short network outage due to STP topology change

jampy

Member
Jun 26, 2015
39
0
6
Three Proxmox 3.4 nodes (all enterprise updates applied) in HA setup.

All have three ethernet ports:
port 1 is connected to the WAN gateway
port 2 is connected to a Netgear GS724T managed switch ("master")
port 3 is connected to another Netgear switch (same model, "slave")

The Netgear switches are also connected together directly using a patch cable.

On each Proxmox node port 2 and 3 are configured as a virtual Linux bridge. The Netgear switches have a low STP priority value (and thus ususally the first Netgear switch becomes root).

In theory, this gives me a fully redundant mesh LAN network, that auto-configures itself.

Problem: When the root switch goes offline (tried with a reboot) it takes a few seconds until the slave Netgear switch becomes the new root, meaning that the whole network is dysfunctional for ~10 seconds.

Proxmox notices this and says "Quorum dissolved", shutting down VMs (strangely no fencing of nodes is started, though)

This leaves the Cluster in an unacceptable state, meaning that the switch is effectively a single point of failure.

Even worse, nodes can't do a normal reboot because of all kinds of probems with the rgmanager meaning that a reset must be done.

What can I do to avoid such a situation? Is there a way to extend Heartbeat timeouts and is it wise to do?


PS: The Netgear switches support RSTP and have it enabled, however the Linux bridges can only do STP.
 
Please can you post you cluster.conf file?

Here it is:

Code:
# cat /etc/pve/cluster.conf
<?xml version="1.0"?>
<cluster config_version="26" name="indunet-cluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_idrac8" cmd_prompt="admin1-&gt;" ipaddr="xxx" login="fencing_user" name="metal1-drac" passwd="xxx" secure="1"/>
    <fencedevice agent="fence_idrac8" cmd_prompt="admin1-&gt;" ipaddr="xxx" login="fencing_user" name="metal2-drac" passwd="xxx" secure="1"/>
    <fencedevice agent="fence_idrac8" cmd_prompt="admin1-&gt;" ipaddr="xxx" login="fencing_user" name="metal3-drac" passwd="xxx" secure="1"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="metal1" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device name="metal1-drac"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="metal2" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device name="metal2-drac"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="metal3" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device name="metal3-drac"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="105"/>
    <pvevm autostart="1" vmid="104"/>
    <pvevm autostart="1" vmid="108"/>
    <pvevm autostart="1" vmid="110"/>
    <pvevm autostart="1" vmid="111"/>
    <pvevm autostart="1" vmid="101"/>
    <pvevm autostart="1" vmid="112"/>
    <pvevm autostart="1" vmid="115"/>
    <pvevm autostart="1" vmid="102"/>
    <pvevm autostart="1" vmid="116"/>
  </rm>
</cluster>
 
Your config looks quite normal.

Proxmox notices this and says "Quorum dissolved", shutting down VMs (strangely no fencing of nodes is started, though)

I dont understand why/who shuts down those VMs. A 10 second network failure should not result in such behavior.
Can you see that you get quorum again after those 10 seconds (in syslog)?
 
I dont understand why/who shuts down those VMs. A 10 second network failure should not result in such behavior.
Can you see that you get quorum again after those 10 seconds (in syslog)?

I thought that Proxmox is (deliberately) sensitive to network outages (somewhere in the Proxmox docs I read that a highly reliable network is very important)..

Isn't a node expected to stop all HA services when it is out of quorum?

Anyway, here is an annotated excerpt of syslog (node "metal1", 192.168.100.1):

Code:
 Aug  8 11:54:16 metal1 rgmanager[346129]: [pvevm] VM 105 is running
Aug  8 11:54:16 metal1 rgmanager[346149]: [pvevm] VM 110 is running
Aug  8 11:54:28 metal1 kernel: tg3 0000:01:00.1: eth1: Link is down           <===== master switch reboots, network temporarily down due to STP topology reformation
Aug  8 11:54:28 metal1 kernel: vmbr0: port 1(eth1) entering disabled state
Aug  8 11:54:28 metal1 kernel: vmbr0: topology change detected, propagating
Aug  8 11:54:36 metal1 rgmanager[346223]: [pvevm] VM 105 is running
Aug  8 11:54:37 metal1 corosync[3627]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 11:54:47 metal1 kernel: vmbr0: neighbor 4000.08:bd:43:b3:e1:b4 lost on port 2(eth2)
Aug  8 11:54:47 metal1 kernel: vmbr0: topology change detected, propagating
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:54:49 metal1 pmxcfs[3396]: [status] notice: node lost quorum
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:54:49 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 3
Aug  8 11:54:49 metal1 corosync[3627]:   [CMAN  ] quorum lost, blocking activity
Aug  8 11:54:49 metal1 corosync[3627]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug  8 11:54:49 metal1 corosync[3627]:   [QUORUM] Members[1]: 1
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:54:49 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:54:49 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:54:49 metal1 rgmanager[4149]: #1: Quorum Dissolved     <=================
Aug  8 11:54:49 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:3 left:2)
Aug  8 11:54:49 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:54:49 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:54:49 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:54:49 metal1 kernel: dlm: closing connection to node 2
Aug  8 11:54:49 metal1 kernel: dlm: closing connection to node 3
Aug  8 11:54:49 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:54:49 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:54:49 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:54:49 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396
Aug  8 11:54:49 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:54:49 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396
Aug  8 11:54:49 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:54:50 metal1 rgmanager[346278]: [pvevm] VM 104 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346300]: [pvevm] VM 108 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346321]: [pvevm] VM 111 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346341]: [pvevm] VM 101 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346361]: [pvevm] VM 112 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346381]: [pvevm] VM 115 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346401]: [pvevm] VM 102 is already stopped
Aug  8 11:54:50 metal1 rgmanager[346421]: [pvevm] VM 116 is already stopped
Aug  8 11:55:17 metal1 glusterfs: [2015-08-08 09:55:17.664791] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-systems-client-2: server 192.168.100.3:49152 has not responded in the last 42 seconds, disconnecting.
Aug  8 11:55:17 metal1 glusterfs: [2015-08-08 09:55:17.689880] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-systems-client-1: server 192.168.100.2:49152 has not responded in the last 42 seconds, disconnecting.

<== the secondary switch (temporarily the new root switch) *probably* entered forwarding state around here ==>
<== I can't tell exactly since there are no logs ==>

Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:55:26 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:26 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:55:26 metal1 corosync[3627]:   [CMAN  ] quorum regained, resuming activity
Aug  8 11:55:26 metal1 corosync[3627]:   [QUORUM] This node is within the primary component and will provide service.
Aug  8 11:55:26 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 2
Aug  8 11:55:26 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 2
Aug  8 11:55:26 metal1 pmxcfs[3396]: [status] notice: node has quorum
Aug  8 11:55:26 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:1 left:0)
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:55:26 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:55:26 metal1 fenced[3837]: receive_start 2:4 add node with started_count 1
Aug  8 11:55:26 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/00000008)
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/00000008)
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: leader is 1/3396
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: synced members: 1/3396, 2/3373
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: start sending inode updates
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: sent all (0) updates
Aug  8 11:55:26 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:55:26 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:55:26 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:55:27 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:55:27 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:55:27 metal1 corosync[3627]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:55:27 metal1 corosync[3627]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:55:27 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:2 left:0)
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:55:27 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:55:27 metal1 fenced[3837]: receive_start 2:5 add node with started_count 1
Aug  8 11:55:27 metal1 fenced[3837]: receive_start 3:11 add node with started_count 7
Aug  8 11:55:27 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/00000009)
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/00000009)
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: leader is 1/3396
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: synced members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: start sending inode updates
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: sent all (0) updates
Aug  8 11:55:27 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:55:27 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:55:27 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:55:49 metal1 pmxcfs[3396]: [status] notice: received log
Aug  8 11:56:09 metal1 kernel: tg3 0000:01:00.1: eth1: Link is up at 1000 Mbps, full duplex      <====== master switch up again after rebooting and thus triggering again STP topology reformation!
Aug  8 11:56:09 metal1 kernel: tg3 0000:01:00.1: eth1: Flow control is off for TX and off for RX
Aug  8 11:56:09 metal1 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Aug  8 11:56:09 metal1 kernel: vmbr0: port 1(eth1) entering listening state
Aug  8 11:56:19 metal1 corosync[3627]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 11:56:24 metal1 kernel: vmbr0: port 1(eth1) entering learning state
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:56:31 metal1 pmxcfs[3396]: [status] notice: node lost quorum
Aug  8 11:56:31 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 3
Aug  8 11:56:31 metal1 corosync[3627]:   [CMAN  ] quorum lost, blocking activity
Aug  8 11:56:31 metal1 corosync[3627]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug  8 11:56:31 metal1 corosync[3627]:   [QUORUM] Members[1]: 1
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:56:31 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:56:31 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:56:31 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:3 left:2)
Aug  8 11:56:31 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:56:31 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:56:31 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:56:31 metal1 kernel: dlm: closing connection to node 2
Aug  8 11:56:31 metal1 kernel: dlm: closing connection to node 3
Aug  8 11:56:31 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:56:31 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396
Aug  8 11:56:31 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:56:31 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:56:31 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:56:31 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396
Aug  8 11:56:31 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:56:39 metal1 kernel: vmbr0: topology change detected, sending tcn bpdu
Aug  8 11:56:39 metal1 kernel: vmbr0: port 1(eth1) entering forwarding state
Aug  8 11:56:51 metal1 glusterfs: [2015-08-08 09:56:51.698654] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-systems-client-2: server 192.168.100.3:49152 has not responded in the last 42 seconds, disconnecting.
Aug  8 11:56:52 metal1 glusterfs: [2015-08-08 09:56:52.699166] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-systems-client-1: server 192.168.100.2:49152 has not responded in the last 42 seconds, disconnecting.
Aug  8 11:56:52 metal1 pvestatd[4864]: status update time (27.475 seconds)
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:58:46 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:46 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:58:46 metal1 corosync[3627]:   [CMAN  ] quorum regained, resuming activity
Aug  8 11:58:46 metal1 corosync[3627]:   [QUORUM] This node is within the primary component and will provide service.
Aug  8 11:58:46 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 2
Aug  8 11:58:46 metal1 pmxcfs[3396]: [status] notice: node has quorum
Aug  8 11:58:46 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 2
Aug  8 11:58:46 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:1 left:0)
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:58:46 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:58:46 metal1 fenced[3837]: receive_start 2:8 add node with started_count 1
Aug  8 11:58:46 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/0000000C)
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/0000000C)
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: leader is 1/3396
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: synced members: 1/3396, 2/3373
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: start sending inode updates
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: sent all (0) updates
Aug  8 11:58:46 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:58:46 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:58:46 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 11:58:47 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:58:47 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:58:47 metal1 corosync[3627]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:58:47 metal1 corosync[3627]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:58:47 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:2 left:0)
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:58:47 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:58:47 metal1 fenced[3837]: receive_start 2:9 add node with started_count 1
Aug  8 11:58:47 metal1 fenced[3837]: receive_start 3:15 add node with started_count 7
Aug  8 11:58:47 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/0000000D)
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/0000000D)
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: leader is 1/3396
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: synced members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: start sending inode updates
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: sent all (0) updates
Aug  8 11:58:47 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 11:58:47 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 11:58:47 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 12:03:39 metal1 pmxcfs[3396]: [status] notice: received log
Aug  8 12:03:39 metal1 pmxcfs[3396]: [status] notice: received log
Aug  8 12:06:26 metal1 corosync[3627]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 12:06:38 metal1 corosync[3627]:   [QUORUM] Members[2]: 1 3
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] New Configuration:
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] Members Left:
Aug  8 12:06:38 metal1 corosync[3627]:   [CLM   ] Members Joined:
Aug  8 12:06:38 metal1 corosync[3627]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 12:06:38 metal1 kernel: dlm: closing connection to node 2
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 12:06:38 metal1 corosync[3627]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:3 left:1)
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 12:06:38 metal1 corosync[3627]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 12:06:38 metal1 fenced[3837]: receive_start 3:16 add node with started_count 7
Aug  8 12:06:38 metal1 pmxcfs[3396]: [status] notice: cpg_send_message retried 1 times
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: starting data syncronisation
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/0000000E)
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: leader is 1/3396
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: synced members: 1/3396, 3/3423
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: start sending inode updates
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: sent all (0) updates
Aug  8 12:06:38 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: received sync request (epoch 1/3396/0000000E)
Aug  8 12:06:38 metal1 pmxcfs[3396]: [dcdb] notice: received all states
Aug  8 12:06:38 metal1 pmxcfs[3396]: [status] notice: all data is up to date
Aug  8 12:06:38 metal1 pmxcfs[3396]: [status] notice: dfsm_deliver_queue: queue length 32

Maybe it was more than 10 seconds..? What time is critical?

Note that after rebooting all physical nodes (that did't help) I had to manually start cman and rgmanager on each note to have the HA cluster working again since rgmanager wasn't running.

Can you find anything relevant in the log above?


Thanks
 
There are strange leave/join messages - network does not really lock stable. That may also be the reason why fencing does not work.

Note that each change of the root bridge causes a network outage due to how STP works. This means that my test (reboot of the master switch) caused the network to go down, up, down again and up again. Could that explain the "strange leave/join messages" ?

Fencing does not need the local LAN as it contacts the IDRAC webservice using the WAN port (which never went down). I can't find anything in the logs that would suggest a fence attempt. Also, a few minutes after the network was up again I could successfully fenced a node via "fence_node".

Please give some hint so that I have a chance to analyze the situation better.
 
Any further hints in /var/log/cluster/* ?

Everything in that directory is also included in the "syslog" file.

I could do another test (reboot that switch again) and see what happens.
In that case please let me know what I should watch (syslog? send ICMP PINGs between the nodes? some multicast test? ...?)
 
You sent the log from one node - what about the syslogs from the other nodes?


Good point!

Syslog of node #1: link

Here is node #2:

Code:
Aug  8 11:54:28 metal2 kernel: tg3 0000:01:00.1: eth1: Link is down                       <========= master switch reboots
Aug  8 11:54:28 metal2 kernel: vmbr0: port 1(eth1) entering disabled state
Aug  8 11:54:28 metal2 kernel: vmbr0: topology change detected, propagating
Aug  8 11:54:37 metal2 corosync[3653]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 11:54:47 metal2 kernel: vmbr0: neighbor 4000.08:bd:43:b3:e1:b4 lost on port 2(eth2)
Aug  8 11:54:47 metal2 kernel: vmbr0: topology change detected, propagating
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:54:49 metal2 pmxcfs[3373]: [status] notice: node lost quorum
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:54:49 metal2 corosync[3653]:   [QUORUM] Members[2]: 2 3
Aug  8 11:54:49 metal2 corosync[3653]:   [CMAN  ] quorum lost, blocking activity
Aug  8 11:54:49 metal2 corosync[3653]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug  8 11:54:49 metal2 corosync[3653]:   [QUORUM] Members[1]: 2
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:54:49 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:54:49 metal2 corosync[3653]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:54:49 metal2 rgmanager[4135]: #1: Quorum Dissolved                             <=============
Aug  8 11:54:49 metal2 kernel: dlm: closing connection to node 1
Aug  8 11:54:49 metal2 kernel: dlm: closing connection to node 3
Aug  8 11:54:49 metal2 corosync[3653]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.2) ; members(old:3 left:2)
Aug  8 11:54:49 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:54:49 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:54:49 metal2 corosync[3653]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:54:49 metal2 pmxcfs[3373]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:54:49 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373
Aug  8 11:54:49 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:54:49 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:54:49 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:54:49 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373
Aug  8 11:54:49 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:54:50 metal2 rgmanager[6836]: [pvevm] VM 116 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6835]: [pvevm] VM 108 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6834]: [pvevm] VM 104 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6837]: [pvevm] VM 102 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6838]: [pvevm] VM 101 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6853]: [pvevm] VM 110 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6861]: [pvevm] VM 111 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6870]: [pvevm] VM 115 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6876]: [pvevm] VM 112 is already stopped
Aug  8 11:54:50 metal2 rgmanager[6887]: [pvevm] VM 105 is already stopped

<== the secondary switch (temporarily the new root switch) *probably* entered forwarding state around here ==>
<== I can't tell exactly since there are no logs ==>

Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:55:26 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:26 metal2 corosync[3653]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:55:26 metal2 corosync[3653]:   [CMAN  ] quorum regained, resuming activity
Aug  8 11:55:26 metal2 corosync[3653]:   [QUORUM] This node is within the primary component and will provide service.
Aug  8 11:55:26 metal2 corosync[3653]:   [QUORUM] Members[2]: 1 2
Aug  8 11:55:26 metal2 pmxcfs[3373]: [status] notice: node has quorum
Aug  8 11:55:26 metal2 corosync[3653]:   [QUORUM] Members[2]: 1 2
Aug  8 11:55:26 metal2 rgmanager[4135]: Quorum Regained
Aug  8 11:55:26 metal2 corosync[3653]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:1 left:0)
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:55:26 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:55:26 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:55:26 metal2 corosync[3653]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:55:26 metal2 fenced[3830]: receive_start 1:8 add node with started_count 5
Aug  8 11:55:26 metal2 rgmanager[4135]: State change: Local UP
Aug  8 11:55:26 metal2 rgmanager[4135]: State change: metal1 UP
Aug  8 11:55:26 metal2 rgmanager[4135]: Loading Service Data
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/00000008)
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/00000008)
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: leader is 1/3396
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: synced members: 1/3396, 2/3373
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:55:26 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:55:27 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:55:27 metal2 corosync[3653]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:55:27 metal2 corosync[3653]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:55:27 metal2 corosync[3653]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:55:27 metal2 corosync[3653]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:2 left:0)
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:55:27 metal2 corosync[3653]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:55:27 metal2 fenced[3830]: receive_start 1:9 add node with started_count 5
Aug  8 11:55:27 metal2 fenced[3830]: receive_start 3:11 add node with started_count 7
Aug  8 11:55:27 metal2 rgmanager[4135]: State change: metal3 UP
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/00000009)
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/00000009)
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: leader is 1/3396
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: synced members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:55:27 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:55:27 metal2 rgmanager[4135]: Skipping stop-before-start: overridden by administrator
Aug  8 11:55:49 metal2 pmxcfs[3373]: [status] notice: received log
Aug  8 11:56:09 metal2 kernel: tg3 0000:01:00.1: eth1: Link is up at 1000 Mbps, full duplex           <====== master switch up again after rebooting and thus triggering again STP topology reformation!
Aug  8 11:56:09 metal2 kernel: tg3 0000:01:00.1: eth1: Flow control is off for TX and off for RX
Aug  8 11:56:09 metal2 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Aug  8 11:56:09 metal2 kernel: vmbr0: port 1(eth1) entering listening state
Aug  8 11:56:19 metal2 corosync[3653]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 11:56:24 metal2 kernel: vmbr0: port 1(eth1) entering learning state
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:56:31 metal2 pmxcfs[3373]: [status] notice: node lost quorum
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:56:31 metal2 corosync[3653]:   [QUORUM] Members[2]: 2 3
Aug  8 11:56:31 metal2 corosync[3653]:   [CMAN  ] quorum lost, blocking activity
Aug  8 11:56:31 metal2 corosync[3653]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug  8 11:56:31 metal2 corosync[3653]:   [QUORUM] Members[1]: 2
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:56:31 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:56:31 metal2 corosync[3653]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:56:31 metal2 rgmanager[4135]: #1: Quorum Dissolved                    <=======================================================
Aug  8 11:56:31 metal2 corosync[3653]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.2) ; members(old:3 left:2)
Aug  8 11:56:31 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:56:31 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:56:31 metal2 corosync[3653]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:56:31 metal2 kernel: dlm: closing connection to node 1
Aug  8 11:56:31 metal2 kernel: dlm: closing connection to node 3
Aug  8 11:56:31 metal2 pmxcfs[3373]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:56:31 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373
Aug  8 11:56:31 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:56:31 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:56:31 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:56:31 metal2 pmxcfs[3373]: [dcdb] notice: members: 2/3373
Aug  8 11:56:31 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:56:31 metal2 rgmanager[8065]: [pvevm] VM 105 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8085]: [pvevm] VM 104 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8105]: [pvevm] VM 108 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8125]: [pvevm] VM 110 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8145]: [pvevm] VM 111 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8150]: [pvevm] VM 101 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8182]: [pvevm] VM 115 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8200]: [pvevm] VM 112 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8201]: [pvevm] VM 102 is already stopped
Aug  8 11:56:31 metal2 rgmanager[8217]: [pvevm] VM 116 is already stopped
Aug  8 11:56:39 metal2 kernel: vmbr0: topology change detected, sending tcn bpdu
Aug  8 11:56:39 metal2 kernel: vmbr0: port 1(eth1) entering forwarding state
Aug  8 11:58:21 metal2 kernel: INFO: task rgmanager:7098 blocked for more than 120 seconds.                   <===============
Aug  8 11:58:21 metal2 kernel:      Tainted: G           ---------------  T 2.6.32-40-pve #1
Aug  8 11:58:21 metal2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  8 11:58:21 metal2 kernel: rgmanager     D ffff88106fb4d4c0     0  7098   4134    0 0x00000000
Aug  8 11:58:21 metal2 kernel: ffff881065fc3ca0 0000000000000086 ffff881000000000 ffff881065fc3c08
Aug  8 11:58:21 metal2 kernel: ffffffff81139f07 ffff881072f67900 ffff881033fcbae8 ffff881065fc3ca8
Aug  8 11:58:21 metal2 kernel: ffffffff81168b79 ffffc90016f2f0c8 0000000100094bdf 0000000000000001
Aug  8 11:58:21 metal2 kernel: Call Trace:
Aug  8 11:58:21 metal2 kernel: [<ffffffff81139f07>] ? unlock_page+0x27/0x30
Aug  8 11:58:21 metal2 kernel: [<ffffffff81168b79>] ? __do_fault+0x4d9/0x5d0
Aug  8 11:58:21 metal2 kernel: [<ffffffff815642f5>] rwsem_down_failed_common+0x95/0x1e0
Aug  8 11:58:21 metal2 kernel: [<ffffffff8105e123>] ? enqueue_boosted_entity+0x43/0x60
Aug  8 11:58:21 metal2 kernel: [<ffffffff81564496>] rwsem_down_read_failed+0x26/0x30
Aug  8 11:58:21 metal2 kernel: [<ffffffff812a07b4>] call_rwsem_down_read_failed+0x14/0x30
Aug  8 11:58:21 metal2 kernel: [<ffffffff81563b74>] ? down_read+0x24/0x2b
Aug  8 11:58:21 metal2 kernel: [<ffffffffa055c033>] dlm_user_request+0x43/0x1d0 [dlm]
Aug  8 11:58:21 metal2 kernel: [<ffffffff8106dfc0>] ? wake_up_state+0x10/0x20
Aug  8 11:58:21 metal2 kernel: [<ffffffff810c71a6>] ? wake_futex+0x66/0x80
Aug  8 11:58:21 metal2 kernel: [<ffffffff810c9f41>] ? do_futex+0x8b1/0xb60
Aug  8 11:58:21 metal2 kernel: [<ffffffff8104b37c>] ? __do_page_fault+0x26c/0x4c0
Aug  8 11:58:21 metal2 kernel: [<ffffffff81198727>] ? kmem_cache_alloc_trace+0x1a7/0x1b0
Aug  8 11:58:21 metal2 kernel: [<ffffffffa0566741>] device_write+0x5b1/0x710 [dlm]
Aug  8 11:58:21 metal2 kernel: [<ffffffff811ae001>] vfs_write+0xa1/0x190
Aug  8 11:58:21 metal2 kernel: [<ffffffff811ae35a>] sys_write+0x4a/0x90
Aug  8 11:58:21 metal2 kernel: [<ffffffff8100b162>] system_call_fastpath+0x16/0x1b
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:58:46 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:46 metal2 corosync[3653]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:58:46 metal2 corosync[3653]:   [CMAN  ] quorum regained, resuming activity
Aug  8 11:58:46 metal2 corosync[3653]:   [QUORUM] This node is within the primary component and will provide service.
Aug  8 11:58:46 metal2 corosync[3653]:   [QUORUM] Members[2]: 1 2
Aug  8 11:58:46 metal2 corosync[3653]:   [QUORUM] Members[2]: 1 2
Aug  8 11:58:46 metal2 pmxcfs[3373]: [status] notice: node has quorum
Aug  8 11:58:46 metal2 corosync[3653]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:1 left:0)
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:58:46 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373
Aug  8 11:58:46 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:58:46 metal2 corosync[3653]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:58:46 metal2 fenced[3830]: receive_start 1:12 add node with started_count 5
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/0000000C)
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/0000000C)
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: leader is 1/3396
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: synced members: 1/3396, 2/3373
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:58:46 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] New Configuration:
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] Members Left:
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] Members Joined:
Aug  8 11:58:47 metal2 corosync[3653]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:58:47 metal2 corosync[3653]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:58:47 metal2 corosync[3653]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:58:47 metal2 corosync[3653]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:58:47 metal2 corosync[3653]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:2 left:0)
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal2 pmxcfs[3373]: [status] notice: starting data syncronisation
Aug  8 11:58:47 metal2 corosync[3653]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:58:47 metal2 fenced[3830]: receive_start 1:13 add node with started_count 5
Aug  8 11:58:47 metal2 fenced[3830]: receive_start 3:15 add node with started_count 7
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/0000000D)
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: received sync request (epoch 1/3396/0000000D)
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: leader is 1/3396
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: synced members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: received all states
Aug  8 11:58:47 metal2 pmxcfs[3373]: [dcdb] notice: all data is up to date
Aug  8 12:00:21 metal2 kernel: INFO: task rgmanager:7098 blocked for more than 120 seconds.
Aug  8 12:00:21 metal2 kernel:      Tainted: G           ---------------  T 2.6.32-40-pve #1
Aug  8 12:00:21 metal2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  8 12:00:21 metal2 kernel: rgmanager     D ffff88106fb4d4c0     0  7098   4134    0 0x00000000
Aug  8 12:00:21 metal2 kernel: ffff881065fc3ca0 0000000000000086 ffff881000000000 ffff881065fc3c08
Aug  8 12:00:21 metal2 kernel: ffffffff81139f07 ffff881072f67900 ffff881033fcbae8 ffff881065fc3ca8
Aug  8 12:00:21 metal2 kernel: ffffffff81168b79 ffffc90016f2f0c8 0000000100094bdf 0000000000000001
Aug  8 12:00:21 metal2 kernel: Call Trace:
<trace removed due to forum post limits>
Aug  8 12:02:21 metal2 kernel: INFO: task rgmanager:7098 blocked for more than 120 seconds.
Aug  8 12:02:21 metal2 kernel:      Tainted: G           ---------------  T 2.6.32-40-pve #1
Aug  8 12:02:21 metal2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  8 12:02:21 metal2 kernel: rgmanager     D ffff88106fb4d4c0     0  7098   4134    0 0x00000000
Aug  8 12:02:21 metal2 kernel: ffff881065fc3ca0 0000000000000086 ffff881000000000 ffff881065fc3c08
Aug  8 12:02:21 metal2 kernel: ffffffff81139f07 ffff881072f67900 ffff881033fcbae8 ffff881065fc3ca8
Aug  8 12:02:21 metal2 kernel: ffffffff81168b79 ffffc90016f2f0c8 0000000100094bdf 0000000000000001
Aug  8 12:02:21 metal2 kernel: Call Trace:
Aug  8 12:02:21 metal2 kernel: [<ffffffff81139f07>] ? unlock_page+0x27/0x30
Aug  8 12:02:21 metal2 kernel: [<ffffffff81168b79>] ? __do_fault+0x4d9/0x5d0
Aug  8 12:02:21 metal2 kernel: [<ffffffff815642f5>] rwsem_down_failed_common+0x95/0x1e0
Aug  8 12:02:21 metal2 kernel: [<ffffffff8105e123>] ? enqueue_boosted_entity+0x43/0x60
Aug  8 12:02:21 metal2 kernel: [<ffffffff81564496>] rwsem_down_read_failed+0x26/0x30
Aug  8 12:02:21 metal2 kernel: [<ffffffff812a07b4>] call_rwsem_down_read_failed+0x14/0x30
Aug  8 12:02:21 metal2 kernel: [<ffffffff81563b74>] ? down_read+0x24/0x2b
Aug  8 12:02:21 metal2 kernel: [<ffffffffa055c033>] dlm_user_request+0x43/0x1d0 [dlm]
Aug  8 12:02:21 metal2 kernel: [<ffffffff8106dfc0>] ? wake_up_state+0x10/0x20
Aug  8 12:02:21 metal2 kernel: [<ffffffff810c71a6>] ? wake_futex+0x66/0x80
Aug  8 12:02:21 metal2 kernel: [<ffffffff810c9f41>] ? do_futex+0x8b1/0xb60
Aug  8 12:02:21 metal2 kernel: [<ffffffff8104b37c>] ? __do_page_fault+0x26c/0x4c0
Aug  8 12:02:21 metal2 kernel: [<ffffffff81198727>] ? kmem_cache_alloc_trace+0x1a7/0x1b0
Aug  8 12:02:21 metal2 kernel: [<ffffffffa0566741>] device_write+0x5b1/0x710 [dlm]
Aug  8 12:02:21 metal2 kernel: [<ffffffff811ae001>] vfs_write+0xa1/0x190
Aug  8 12:02:21 metal2 kernel: [<ffffffff811ae35a>] sys_write+0x4a/0x90
Aug  8 12:02:21 metal2 kernel: [<ffffffff8100b162>] system_call_fastpath+0x16/0x1b
Aug  8 12:03:37 metal2 shutdown[9015]: shutting down for system reboot
Aug  8 12:03:37 metal2 init: Switching to runlevel: 6
Aug  8 12:03:39 metal2 rrdcached[3336]: caught SIGTERM
Aug  8 12:03:39 metal2 rrdcached[3336]: starting shutdown
Aug  8 12:03:39 metal2 haveged: haveged stopping due to signal 15
Aug  8 12:03:39 metal2 /etc/init.d/logstash-forwarder: Attempting 'stop' on logstash-forwarder
Aug  8 12:03:39 metal2 /etc/init.d/logstash-forwarder: Killing logstash-forwarder (pid 3177) with SIGTERM
Aug  8 12:03:39 metal2 /etc/init.d/logstash-forwarder: Waiting logstash-forwarder (pid 3177) to die...
Aug  8 12:03:39 metal2 postfix/master[3404]: terminating on signal 15
Aug  8 12:03:39 metal2 pve-firewall[4101]: received signal TERM
Aug  8 12:03:39 metal2 pve-firewall[4101]: server closing
Aug  8 12:03:39 metal2 pve-firewall[4101]: clear firewall rules
Aug  8 12:03:39 metal2 pve-firewall[4101]: server stopped
Aug  8 12:03:39 metal2 rrdcached[3336]: clean shutdown; all RRDs flushed
Aug  8 12:03:39 metal2 rrdcached[3336]: removing journals
Aug  8 12:03:39 metal2 rrdcached[3336]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1439026892.775317
Aug  8 12:03:39 metal2 rrdcached[3336]: goodbye
Aug  8 12:03:39 metal2 spiceproxy[5115]: received signal TERM
Aug  8 12:03:39 metal2 spiceproxy[5115]: server closing
Aug  8 12:03:39 metal2 spiceproxy[5118]: worker exit
Aug  8 12:03:39 metal2 spiceproxy[5115]: worker 5118 finished
Aug  8 12:03:39 metal2 spiceproxy[5115]: server stopped
Aug  8 12:03:39 metal2 pvesh: <root@pam> starting task UPID:metal2:000023CF:000222F5:55C5D3FB:stopall::root@pam:
Aug  8 12:03:39 metal2 pvesh: <root@pam> end task UPID:metal2:000023CF:000222F5:55C5D3FB:stopall::root@pam: OK
Aug  8 12:03:40 metal2 pvestatd[4869]: received signal TERM
Aug  8 12:03:40 metal2 pvestatd[4869]: server closing
Aug  8 12:03:40 metal2 pvestatd[4869]: server stopped
Aug  8 12:03:40 metal2 /etc/init.d/logstash-forwarder: Waiting logstash-forwarder (pid 3177) to die...
Aug  8 12:03:40 metal2 /etc/init.d/logstash-forwarder: logstash-forwarder stopped.
Aug  8 12:03:40 metal2 pvepw-logger[3973]: received terminate request (signal)
Aug  8 12:03:40 metal2 pvepw-logger[3973]: stopping pvefw logger
Aug  8 12:03:40 metal2 pmxcfs[3373]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/metal3: -3
Aug  8 12:03:41 metal2 pveproxy[4966]: received signal TERM
Aug  8 12:03:41 metal2 pveproxy[4966]: server closing
Aug  8 12:03:41 metal2 pveproxy[4970]: worker exit
Aug  8 12:03:41 metal2 pveproxy[4969]: worker exit
Aug  8 12:03:41 metal2 pveproxy[4972]: worker exit
Aug  8 12:03:41 metal2 pveproxy[4966]: worker 4972 finished
Aug  8 12:03:41 metal2 pveproxy[4966]: worker 4970 finished
Aug  8 12:03:41 metal2 pveproxy[4966]: worker 4969 finished
Aug  8 12:03:41 metal2 pveproxy[4966]: server stopped
Aug  8 12:03:42 metal2 pvedaemon[4658]: received signal TERM
Aug  8 12:03:42 metal2 pvedaemon[4658]: server closing
Aug  8 12:03:42 metal2 pvedaemon[4663]: worker exit
Aug  8 12:03:42 metal2 pvedaemon[4662]: worker exit
Aug  8 12:03:42 metal2 pvedaemon[4661]: worker exit
Aug  8 12:03:42 metal2 pvedaemon[4658]: worker 4662 finished
Aug  8 12:03:42 metal2 pvedaemon[4658]: worker 4661 finished
Aug  8 12:03:42 metal2 pvedaemon[4658]: worker 4663 finished
Aug  8 12:03:42 metal2 pvedaemon[4658]: server stopped
Aug  8 12:04:21 metal2 kernel: INFO: task rgmanager:7098 blocked for more than 120 seconds.
Aug  8 12:04:21 metal2 kernel:      Tainted: G           ---------------  T 2.6.32-40-pve #1
Aug  8 12:04:21 metal2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  8 12:04:21 metal2 kernel: rgmanager     D ffff88106fb4d4c0     0  7098   4134    0 0x00000000
Aug  8 12:04:21 metal2 kernel: ffff881065fc3ca0 0000000000000086 ffff881000000000 ffff881065fc3c08
Aug  8 12:04:21 metal2 kernel: ffffffff81139f07 ffff881072f67900 ffff881033fcbae8 ffff881065fc3ca8
Aug  8 12:04:21 metal2 kernel: ffffffff81168b79 ffffc90016f2f0c8 0000000100094bdf 0000000000000001
Aug  8 12:04:21 metal2 kernel: Call Trace:
<trace removed to shorten forum post>
Aug  8 12:10:28 metal2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="3093" x-info="http://www.rsyslog.com"] start

What's different about node 2 is that the Kernel says a few times "kernel: INFO: task rgmanager:7098 blocked for more than 120 seconds.".

(node 3 follows in next post)
 
Syslog of node #3 follows:

Code:
Aug  8 11:54:28 metal3 kernel: tg3 0000:01:00.1: eth1: Link is  down                       <========= master switch reboots
Aug  8 11:54:28 metal3 kernel: vmbr0: port 1(eth1) entering disabled state
Aug  8 11:54:28 metal3 kernel: vmbr0: topology change detected, propagating
Aug  8 11:54:29 metal3 kernel: vmbr0: received tcn bpdu on port 2(eth2)
Aug  8 11:54:29 metal3 kernel: vmbr0: topology change detected, propagating
Aug  8 11:54:37 metal3 corosync[3672]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:54:49 metal3 corosync[3672]:   [QUORUM] Members[2]: 2 3
Aug  8 11:54:49 metal3 corosync[3672]:   [CMAN  ] quorum lost, blocking activity
Aug  8 11:54:49 metal3 corosync[3672]:   [QUORUM] This node is within  the non-primary component and will NOT provide any services.
Aug  8 11:54:49 metal3 corosync[3672]:   [QUORUM] Members[1]: 3
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:54:49 metal3 pmxcfs[3423]: [status] notice: node lost quorum
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:54:49 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:54:49 metal3 corosync[3672]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:54:49 metal3 rgmanager[4141]: #1: Quorum Dissolved                             <================
Aug  8 11:54:49 metal3 corosync[3672]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.3) ; members(old:3 left:2)
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: members: 3/3423
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:54:49 metal3 corosync[3672]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: members: 3/3423
Aug  8 11:54:49 metal3 kernel: dlm: closing connection to node 1
Aug  8 11:54:49 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:54:49 metal3 kernel: dlm: closing connection to node 2
Aug  8 11:54:50 metal3 rgmanager[18676]: [pvevm] VM 105 is already stopped
Aug  8 11:54:50 metal3 rgmanager[18698]: [pvevm] VM 110 is already stopped
Aug  8 11:54:50 metal3 pvevm: <root@pam> starting task UPID:metal3:0000491F:00876998:55C5D1EA:qmshutdown:111:root@pam:
Aug  8 11:54:50 metal3 task  UPID:metal3:0000491F:00876998:55C5D1EA:qmshutdown:111:root@pam::  shutdown VM 111:  UPID:metal3:0000491F:00876998:55C5D1EA:qmshutdown:111:root@pam:
Aug  8 11:54:50 metal3 pvevm: <root@pam> starting task UPID:metal3:00004920:0087699E:55C5D1EA:qmshutdown:101:root@pam:
Aug  8 11:54:50 metal3 task  UPID:metal3:00004920:0087699E:55C5D1EA:qmshutdown:101:root@pam::  shutdown VM 101:  UPID:metal3:00004920:0087699E:55C5D1EA:qmshutdown:101:root@pam:
Aug  8 11:54:50 metal3 pvevm: <root@pam> starting task UPID:metal3:00004921:008769A3:55C5D1EA:qmshutdown:112:root@pam:
Aug  8 11:54:50 metal3 task  UPID:metal3:00004921:008769A3:55C5D1EA:qmshutdown:112:root@pam::  shutdown VM 112:  UPID:metal3:00004921:008769A3:55C5D1EA:qmshutdown:112:root@pam:
Aug  8 11:54:50 metal3 pvevm: <root@pam> starting task UPID:metal3:00004922:008769A8:55C5D1EA:qmshutdown:115:root@pam:
Aug  8 11:54:50 metal3 task  UPID:metal3:00004922:008769A8:55C5D1EA:qmshutdown:115:root@pam::  shutdown VM 115:  UPID:metal3:00004922:008769A8:55C5D1EA:qmshutdown:115:root@pam:
Aug  8 11:54:50 metal3 pvevm: <root@pam> starting task UPID:metal3:00004923:008769AD:55C5D1EA:qmshutdown:102:root@pam:
Aug  8 11:54:50 metal3 task  UPID:metal3:00004923:008769AD:55C5D1EA:qmshutdown:102:root@pam::  shutdown VM 102:  UPID:metal3:00004923:008769AD:55C5D1EA:qmshutdown:102:root@pam:
Aug  8 11:54:50 metal3 pvevm: <root@pam> starting task UPID:metal3:00004924:008769B2:55C5D1EA:qmshutdown:116:root@pam:
Aug  8 11:54:50 metal3 task  UPID:metal3:00004924:008769B2:55C5D1EA:qmshutdown:116:root@pam::  shutdown VM 116:  UPID:metal3:00004924:008769B2:55C5D1EA:qmshutdown:116:root@pam:

<== the secondary switch (temporarily the new root switch) *probably* entered forwarding state around here ==>
<== I can't tell exactly since there are no logs ==>

Aug  8 11:54:51 metal3 rgmanager[18725]: [pvevm] Task still active, waiting
Aug  8 11:54:51 metal3 rgmanager[18745]: [pvevm] Task still active, waiting
Aug  8 11:54:51 metal3 rgmanager[18765]: [pvevm] Task still active, waiting
Aug  8 11:54:51 metal3 rgmanager[18785]: [pvevm] Task still active, waiting
Aug  8 11:54:51 metal3 rgmanager[18805]: [pvevm] Task still active, waiting
Aug  8 11:54:51 metal3 rgmanager[18825]: [pvevm] Task still active, waiting
Aug  8 11:54:52 metal3 rgmanager[18845]: [pvevm] Task still active, waiting
Aug  8 11:54:52 metal3 rgmanager[18865]: [pvevm] Task still active, waiting
Aug  8 11:54:52 metal3 rgmanager[18895]: [pvevm] Task still active, waiting
Aug  8 11:54:52 metal3 rgmanager[18915]: [pvevm] Task still active, waiting
Aug  8 11:54:52 metal3 rgmanager[18935]: [pvevm] Task still active, waiting
Aug  8 11:54:52 metal3 rgmanager[18955]: [pvevm] Task still active, waiting

...many, many more (removed due to forum limits, but see below) ...

Aug  8 11:54:57 metal3 rgmanager[19498]: [pvevm] Task still active, waiting
Aug  8 11:54:57 metal3 rgmanager[19518]: [pvevm] Task still active, waiting
Aug  8 11:54:57 metal3 rgmanager[19538]: [pvevm] Task still active, waiting
Aug  8 11:54:57 metal3 rgmanager[19558]: [pvevm] Task still active, waiting
Aug  8 11:54:58 metal3 rgmanager[19578]: [pvevm] Task still active, waiting
Aug  8 11:54:58 metal3 pveproxy[1047080]: proxy detected vanished client connection
Aug  8 11:54:58 metal3 rgmanager[19598]: [pvevm] Task still active, waiting
Aug  8 11:54:58 metal3 rgmanager[19618]: [pvevm] Task still active, waiting
Aug  8 11:54:58 metal3 rgmanager[19638]: [pvevm] Task still active, waiting
Aug  8 11:54:58 metal3 rgmanager[19658]: [pvevm] Task still active, waiting
Aug  8 11:54:58 metal3 rgmanager[19687]: [pvevm] Task still active, waiting
Aug  8 11:54:59 metal3 rgmanager[19708]: [pvevm] Task still active, waiting

...many, many more (removed due to forum limits, but see below) ...

Aug  8 11:55:19 metal3 rgmanager[22079]: [pvevm] Task still active, waiting
Aug  8 11:55:19 metal3 rgmanager[22099]: [pvevm] Task still active, waiting
Aug  8 11:55:19 metal3 rgmanager[22119]: [pvevm] Task still active, waiting
Aug  8 11:55:20 metal3 rgmanager[22153]: [pvevm] Task still active, waiting
Aug  8 11:55:20 metal3 pvestatd[4808]: status update time (17.134 seconds)
Aug  8 11:55:20 metal3 rgmanager[22174]: [pvevm] Task still active, waiting
Aug  8 11:55:20 metal3 rgmanager[22198]: [pvevm] Task still active, waiting
Aug  8 11:55:20 metal3 rgmanager[22222]: [pvevm] Task still active, waiting
Aug  8 11:55:20 metal3 rgmanager[22242]: [pvevm] Task still active, waiting
Aug  8 11:55:20 metal3 rgmanager[22262]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 kernel: vmbr0: port 3(tap116i0) entering disabled state
Aug  8 11:55:21 metal3 rgmanager[22288]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 rgmanager[22308]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 rgmanager[22328]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 rgmanager[22348]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 rgmanager[22370]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 rgmanager[22395]: [pvevm] Task still active, waiting
Aug  8 11:55:21 metal3 pvevm: <root@pam> end task UPID:metal3:00004924:008769B2:55C5D1EA:qmshutdown:116:root@pam: OK
Aug  8 11:55:21 metal3 kernel: vmbr0: port 7(tap111i0) entering disabled state
Aug  8 11:55:21 metal3 kernel: vmbr0: port 11(tap101i0) entering disabled state
Aug  8 11:55:21 metal3 kernel: vmbr0: port 8(tap112i0) entering disabled state
Aug  8 11:55:22 metal3 kernel: vmbr2: port 1(tap112i1) entering disabled state
Aug  8 11:55:22 metal3 pvevm: <root@pam> end task UPID:metal3:0000491F:00876998:55C5D1EA:qmshutdown:111:root@pam: OK
Aug  8 11:55:22 metal3 rgmanager[22439]: [pvevm] Task still active, waiting
Aug  8 11:55:22 metal3 pvevm: <root@pam> end task UPID:metal3:00004920:0087699E:55C5D1EA:qmshutdown:101:root@pam: OK
Aug  8 11:55:22 metal3 rgmanager[22459]: [pvevm] Task still active, waiting
Aug  8 11:55:22 metal3 pvevm: <root@pam> end task UPID:metal3:00004921:008769A3:55C5D1EA:qmshutdown:112:root@pam: OK
Aug  8 11:55:22 metal3 rgmanager[22479]: [pvevm] Task still active, waiting
Aug  8 11:55:22 metal3 rgmanager[22509]: [pvevm] Task still active, waiting
Aug  8 11:55:22 metal3 rgmanager[22529]: [pvevm] Task still active, waiting
Aug  8 11:55:23 metal3 kernel: vmbr0: port 4(tap102i0) entering disabled state
Aug  8 11:55:23 metal3 kernel: vmbr0: port 9(tap115i0) entering disabled state
Aug  8 11:55:23 metal3 pvevm: <root@pam> end task UPID:metal3:00004922:008769A8:55C5D1EA:qmshutdown:115:root@pam: OK
Aug  8 11:55:23 metal3 pvevm: <root@pam> end task UPID:metal3:00004923:008769AD:55C5D1EA:qmshutdown:102:root@pam: OK
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #27 tap112i1,  fe80::443a:6ff:fe3e:b400#123, interface stats: received=0, sent=0,  dropped=0, active_time=5247 secs
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #26 tap112i0,  fe80::18dc:59ff:fe3e:b607#123, interface stats: received=0, sent=0,  dropped=0, active_time=5247 secs
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #23 tap102i0,  fe80::4ca6:b5ff:fef9:67ec#123, interface stats: received=0, sent=0,  dropped=0, active_time=86760 secs
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #22 tap101i0,  fe80::2489:cdff:fe51:f021#123, interface stats: received=0, sent=0,  dropped=0, active_time=87733 secs
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #19 tap115i0,  fe80::c0c9:5dff:feb8:d5b8#123, interface stats: received=0, sent=0,  dropped=0, active_time=87744 secs
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #15 tap111i0,  fe80::e888:75ff:fe7d:d1dc#123, interface stats: received=0, sent=0,  dropped=0, active_time=87744 secs
Aug  8 11:55:25 metal3 ntpd[3301]: Deleting interface #13 tap116i0,  fe80::ccc1:3cff:fecf:b3#123, interface stats: received=0, sent=0,  dropped=0, active_time=88086 secs
Aug  8 11:55:25 metal3 ntpd[3301]: peers refreshed
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:55:27 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:55:27 metal3 corosync[3672]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:55:27 metal3 corosync[3672]:   [CMAN  ] quorum regained, resuming activity
Aug  8 11:55:27 metal3 corosync[3672]:   [QUORUM] This node is within the primary component and will provide service.
Aug  8 11:55:27 metal3 corosync[3672]:   [QUORUM] Members[2]: 2 3
Aug  8 11:55:27 metal3 corosync[3672]:   [QUORUM] Members[2]: 2 3
Aug  8 11:55:27 metal3 pmxcfs[3423]: [status] notice: node has quorum
Aug  8 11:55:27 metal3 corosync[3672]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:55:27 metal3 corosync[3672]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:55:27 metal3 corosync[3672]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:2 left:0)
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal3 corosync[3672]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:55:27 metal3 fenced[3852]: receive_start 2:5 add node with started_count 1
Aug  8 11:55:27 metal3 fenced[3852]: receive_start 1:9 add node with started_count 5
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: received sync request (epoch 1/3396/00000009)
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: received sync request (epoch 1/3396/00000009)
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: received all states
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: leader is 1/3396
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: synced members: 1/3396, 2/3373, 3/3423
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: received all states
Aug  8 11:55:27 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:55:49 metal3 pvedaemon[1033327]: <root@pam> successful auth for user 'root@pam'
Aug  8 11:55:52 metal3 pveproxy[2036]: proxy detected vanished client connection
Aug  8 11:56:09 metal3 kernel: tg3 0000:01:00.1: eth1: Link is up at  1000 Mbps, full duplex            <====== master switch up again  after rebooting and thus triggering again STP topology reformation!
Aug  8 11:56:09 metal3 kernel: tg3 0000:01:00.1: eth1: Flow control is off for TX and off for RX
Aug  8 11:56:09 metal3 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Aug  8 11:56:09 metal3 kernel: vmbr0: port 1(eth1) entering listening state
Aug  8 11:56:19 metal3 corosync[3672]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 11:56:24 metal3 kernel: vmbr0: port 1(eth1) entering learning state
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:56:31 metal3 pmxcfs[3423]: [status] notice: node lost quorum
Aug  8 11:56:31 metal3 corosync[3672]:   [QUORUM] Members[2]: 2 3
Aug  8 11:56:31 metal3 corosync[3672]:   [CMAN  ] quorum lost, blocking activity
Aug  8 11:56:31 metal3 corosync[3672]:   [QUORUM] This node is within  the non-primary component and will NOT provide any services.
Aug  8 11:56:31 metal3 corosync[3672]:   [QUORUM] Members[1]: 3
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:56:31 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:56:31 metal3 corosync[3672]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:56:31 metal3 corosync[3672]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.3) ; members(old:3 left:2)
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: members: 3/3423
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: members: 2/3373, 3/3423
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: members: 3/3423
Aug  8 11:56:31 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:56:31 metal3 corosync[3672]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:56:31 metal3 kernel: dlm: closing connection to node 1
Aug  8 11:56:31 metal3 kernel: dlm: closing connection to node 2
Aug  8 11:56:31 metal3 pmxcfs[3423]: [status] notice: cpg_send_message retried 1 times
Aug  8 11:56:39 metal3 kernel: vmbr0: topology change detected, sending tcn bpdu
Aug  8 11:56:39 metal3 kernel: vmbr0: port 1(eth1) entering forwarding state
Aug  8 11:57:20 metal3 pvedaemon[1023068]: <root@pam> successful auth for user 'root@pam'
Aug  8 11:57:25 metal3 pveproxy[6658]: proxy detected vanished client connection
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 11:58:47 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 11:58:47 metal3 corosync[3672]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 11:58:47 metal3 corosync[3672]:   [CMAN  ] quorum regained, resuming activity
Aug  8 11:58:47 metal3 corosync[3672]:   [QUORUM] This node is within the primary component and will provide service.
Aug  8 11:58:47 metal3 corosync[3672]:   [QUORUM] Members[2]: 2 3
Aug  8 11:58:47 metal3 corosync[3672]:   [QUORUM] Members[2]: 2 3
Aug  8 11:58:47 metal3 pmxcfs[3423]: [status] notice: node has quorum
Aug  8 11:58:47 metal3 corosync[3672]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:58:47 metal3 corosync[3672]:   [QUORUM] Members[3]: 1 2 3
Aug  8 11:58:47 metal3 corosync[3672]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:2 left:0)
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal3 corosync[3672]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 11:58:47 metal3 fenced[3852]: receive_start 1:13 add node with started_count 5
Aug  8 11:58:47 metal3 fenced[3852]: receive_start 2:9 add node with started_count 1
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: received sync request (epoch 1/3396/0000000D)
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: received sync request (epoch 1/3396/0000000D)
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: received all states
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: leader is 1/3396
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: synced members: 1/3396, 2/3373, 3/3423
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: received all states
Aug  8 11:58:47 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 12:03:39 metal3 pmxcfs[3423]: [status] notice: received log
Aug  8 12:03:39 metal3 pmxcfs[3423]: [status] notice: received log
Aug  8 12:06:26 metal3 corosync[3672]:   [TOTEM ] A processor failed, forming new configuration.
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.2)
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 12:06:38 metal3 corosync[3672]:   [QUORUM] Members[2]: 1 3
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] New Configuration:
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.1)
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] #011r(0) ip(192.168.100.3)
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] Members Left:
Aug  8 12:06:38 metal3 corosync[3672]:   [CLM   ] Members Joined:
Aug  8 12:06:38 metal3 corosync[3672]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  8 12:06:38 metal3 kernel: dlm: closing connection to node 2
Aug  8 12:06:38 metal3 corosync[3672]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.100.1) ; members(old:3 left:1)
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: members: 1/3396, 3/3423
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: starting data syncronisation
Aug  8 12:06:38 metal3 corosync[3672]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  8 12:06:38 metal3 fenced[3852]: receive_start 1:14 add node with started_count 5
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: received sync request (epoch 1/3396/0000000E)
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: received sync request (epoch 1/3396/0000000E)
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: received all states
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: leader is 1/3396
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: synced members: 1/3396, 3/3423
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: received all states
Aug  8 12:06:38 metal3 pmxcfs[3423]: [dcdb] notice: all data is up to date
Aug  8 12:06:38 metal3 pmxcfs[3423]: [status] notice: dfsm_deliver_queue: queue length 32
Aug  8 12:07:17 metal3 glusterfs: [2015-08-08 10:07:17.676648] C  [client-handshake.c:127:rpc_client_ping_timer_expired]  0-systems-client-1: server 192.168.100.2:49152 has not responded in the  last 42 seconds, disconnecting.
<snip>

Node 3 had most (if not all) VMs active.

Note that this node only notices once that "Quorum Dissolved" - the other two nodes noticed that twice.
Also there are a number of "rgmanager[x]: [pvevm] Task still active, waiting" messages with changing PIDs.

Hopefully you'll find other interesting information in there. Please let me know what this tells to you!

Again, if a new test would help, let me know what to watch out for..

Thanks.

PS: Forum limits didn't allow me to post all text here - see uncensored syslog of node 1, node 2 and node 3.
 
Is glusterfs fully functional after the network failure?

It was functional, after rebooting the nodes at least.

I didn't check right after the network failure since to me missing Quorum was the main problem.

Why?
 
I didn't check right after the network failure since to me missing Quorum was the main problem.

Why?

Just hunting in the dark. This requires step by step analysis on the hosts, with access to all relevant logs.
Also, is it reproducible, or just happened once?
 
Just hunting in the dark. This requires step by step analysis on the hosts, with access to all relevant logs.
Also, is it reproducible, or just happened once?

I tried only once.

As said, I would try it again (it's just about rebooting the master switch), but I'd like to have some plan what I should monitor/try during that process, besides syslog.
So, if you have any directions or idea, please let me know and I'll try again trying to collect all information that could be helpful...
 
Well, I just tried again and it is exactly reproducible.

After network goes down I get a "#1: Quorum Dissolved" on all nodes and all VMs are being shut down.

When network is up again after two minutes, the cluster is still inoperable.

None of the nodes is able to shut down properly (it blocks while stopping cman) - I have to do a hard reset.

With 2 nodes up, i have to restart cman and rgmanager manually. "cman" said [OK] for all items, but in the Proxmox GUI I can't see PVECluster nor RGManager running. VMs are not being started.

After a few minutes (with node 3 not yet rebooted) the cluster starts to operate and start the VMs. Don's know why this is delayed - I was writing this post when RGManager came up by itself.

When the third node came up, it had no Quorum. After starting cman and rgmanager on that node, the cluster was fully operational again.


Full syslog output (lots of information) can be found at http://indunet.it/temp/proxmox-quorum-fail/ (messages have been synchronized and colorized to easily distinguish the nodes).


PLEASE, give me some hint what I could do to solve this problem!? I'm completely stuck right now..
What causes "quorum dissolved" and why does Proxmox become completely unstable?
 
Three Proxmox 3.4 nodes (all enterprise updates applied) in HA setup.
This leaves the Cluster in an unacceptable state, meaning that the switch is effectively a single point of failure.
.

If your aim is to eliminate the switch as a single point of failure, I would try to create a bond device with your port2 and port3 devices, and set the mode of the bond to be active-backup

You can look at https://www.kernel.org/doc/Documentation/networking/bonding.txt, the keyword is active-backup
 
If your aim is to eliminate the switch as a single point of failure, I would try to create a bond device with your port2 and port3 devices, and set the mode of the bond to be active-backup

You can look at https://www.kernel.org/doc/Documentation/networking/bonding.txt, the keyword is active-backup


Please correct me if I'm wrong. But IMHO bonding won't help here.

Bonding just selects one of two interfaces as the active one. It can't understand what is happening behind the switches.

It will just help in case one of the switches completely fail (which, yes, matches the scenario when a switch reboots), but not if on one node a NIC fails or a cable is broken.


Example:

On node 2, the cable between the primary NIC and the primary switch is broken/missing.

Node 2 will then use the secondary NIC since that still has a link - so far so good.

Node 1 (and also node 3) however has both NICs in good state and keeps using the primary NIC. However, the switch attached to it can't reach node 2, meaning that the connection between node 2 and the other two is broken.



Using multiple switches and STP the network will understand that, for node 1 and 2 to be able to communicate, the backup switch must be used, while nodes 1 and 3 can still talk to each other using the primary switch. STP will find a path as long as there is a physical possibility - it's just damn slow.


If I'm missing something, please let me know!
 
Hi

I don't have the hardware at hand to test but
according to http://www.cloudibee.com/network-bonding-modes/


  • Mode 1 (active-backup)
    This mode places one of the interfaces into a backup state and will only make it active if the link is lost by the active interface. Only one slave in the bond is active at an instance of time. A different slave becomes active only when the active slave fails. This mode provides fault tolerance.

If you pull out the cable or the NIC is defect, you will lose the Ethernet Link on your active interface, and that shoud trigger the failover.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!