[SOLVED] Corosync/HA logic

mohnewald · Jan 8, 2019

Hello,

3 Node Test Cluster. Test Case: Cut off node10 (master) cluster network connection

ha-manager statuson node08:
quorum OK
master node10 (old timestamp - dead?, Tue Jan 8 17:12:52 2019)
lrm node08 (active, Tue Jan 8 17:13:31 2019)
lrm node09 (idle, Tue Jan 8 17:13:34 2019)
lrm node10 (old timestamp - dead?, Tue Jan 8 17:12:59 2019)
service vm:100 (node08, started)

ha-manager statuson node09:
quorum OK
master node10 (old timestamp - dead?, Tue Jan 8 17:12:52 2019)
lrm node08 (active, Tue Jan 8 17:13:41 2019)
lrm node09 (idle, Tue Jan 8 17:13:40 2019)
lrm node10 (old timestamp - dead?, Tue Jan 8 17:12:59 2019)
service vm:100 (node08, started)

ha-manager statuson node10:
quorum No quorum on node 'node10'!
master node10 (old timestamp - dead?, Tue Jan 8 17:12:52 2019)
lrm node08 (old timestamp - dead?, Tue Jan 8 17:12:57 2019)
lrm node09 (old timestamp - dead?, Tue Jan 8 17:12:56 2019)
lrm node10 (old timestamp - dead?, Tue Jan 8 17:12:59 2019)
service vm:100 (node08, started)

=> expected behavior: Since node10 got cut off and lost quorum, it should fence itself

BUT: node08 rebooted itself. Uptime after a few moments:
---------------------
node08: 17:18:39 up 0 min, 0 users, load average: 1.77, 0.42, 0.14
node09: 17:18:39 up 24 min, 0 users, load average: 0.92, 0.91, 0.52
node10: 17:18:40 up 24 min, 0 users, load average: 0.26, 0.33, 0.34

Why did node08 reboot and not node10 which was completly out of quorum?

# pvecm status
Quorum information
------------------
Date: Tue Jan 8 17:29:08 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1/5620
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.0.8 (local)
0x00000002 1 10.10.0.9
0x00000003 1 10.10.0.10

t.lamprecht · Jan 9, 2019

mohnewald said:
3 Node Test Cluster. Test Case: Cut off node10 (master) cluster network connection

How did you do this? ifdown (or similar) on interfaces is not a useful test, as corosync (cluster communication stack) 2.4 has still some issues with that (it's not a real test for an outtage anyway, nonetheless corosync 3.X will have this fixed).

mohnewald said:
Why did node08 reboot and not node10 which was completly out of quorum?

Can you please post excerpts from the syslog/journal from all three nodes around the time this happened? Anything mentioning corosync, pmxcfs/pve-cluster and pve-ha-* is relevant.
Else it's hard to tell what the reason was. For now it looks like all nodes lost quorum, but only node8 got fenced as it was the only node with configured HA services (no point in fencing nodes without services).

mohnewald · Jan 16, 2019

1.) I testet it by setting packetloss to 100% in my vmware test environment.
2.) hm....i cant reproduce the random fencinng now. Maybe i testet it wrong
3.) No need to have a look at the logs since this thead can be closed

nice to know was: no point in fencing nodes without services

Search

Search

[SOLVED] Corosync/HA logic

mohnewald

Well-Known Member

t.lamprecht

Proxmox Staff Member

mohnewald

Well-Known Member