Hello all,
I'm running into this odd issue with joining new nodes simultaneously to my existing cluster. This seems to destabilizes it and revert all the hosts back to standalone machines. I'm wondering if anyone has run into this issue before and was able to figure out why it happens?
Observed behavior
========== Phase 1 — transient healthy cluster ==========
Cluster briefly forms with all nodes:
Members[5]: 1 2 3 4 5
node has quorum
pmxcfs: starting data synchronisation
========== Phase 2 — corosync instability ==========
Within seconds:
Token loss:
Token has not been received
Messaging failures:
cpg_send_message failed: CS_ERR_TRY_AGAIN
Rapid membership changes
========== Phase 3 — link / membership flapping ==========
Across nodes:
Links dropping:
host X link: down
Repeated membership churn:
Members joined: 1 2 3
Members left: 1 2 3
Failed to receive the leave message
Retransmit List
========== Phase 4 — split / inconsistent state ==========
Nodes disagree on membership:
ignore sync request from wrong member
remove message from non-member
pmxcfs queue buildup and retries:
cpg_send_message retried
dfsm_deliver_queue growing
========== Phase 5 — quorum loss and pmxcfs failure ==========
Cluster partitions:
Members[2]: 4 5
Members left: 1 2 3
node lost quorum
pmxcfs failure:
received write while not quorate
leaving CPG group
quorum_initialize failed
cmap_initialize failed
I'm running into this odd issue with joining new nodes simultaneously to my existing cluster. This seems to destabilizes it and revert all the hosts back to standalone machines. I'm wondering if anyone has run into this issue before and was able to figure out why it happens?
Observed behavior
========== Phase 1 — transient healthy cluster ==========
Cluster briefly forms with all nodes:
Members[5]: 1 2 3 4 5
node has quorum
pmxcfs: starting data synchronisation
========== Phase 2 — corosync instability ==========
Within seconds:
Token loss:
Token has not been received
Messaging failures:
cpg_send_message failed: CS_ERR_TRY_AGAIN
Rapid membership changes
========== Phase 3 — link / membership flapping ==========
Across nodes:
Links dropping:
host X link: down
Repeated membership churn:
Members joined: 1 2 3
Members left: 1 2 3
Failed to receive the leave message
Retransmit List
========== Phase 4 — split / inconsistent state ==========
Nodes disagree on membership:
ignore sync request from wrong member
remove message from non-member
pmxcfs queue buildup and retries:
cpg_send_message retried
dfsm_deliver_queue growing
========== Phase 5 — quorum loss and pmxcfs failure ==========
Cluster partitions:
Members[2]: 4 5
Members left: 1 2 3
node lost quorum
pmxcfs failure:
received write while not quorate
leaving CPG group
quorum_initialize failed
cmap_initialize failed