[SOLVED] Corosync/HA logic

mohnewald

Well-Known Member
Aug 21, 2018
50
4
48
59
Hello,

3 Node Test Cluster. Test Case: Cut off node10 (master) cluster network connection

ha-manager statuson node08:
quorum OK
master node10 (old timestamp - dead?, Tue Jan 8 17:12:52 2019)
lrm node08 (active, Tue Jan 8 17:13:31 2019)
lrm node09 (idle, Tue Jan 8 17:13:34 2019)
lrm node10 (old timestamp - dead?, Tue Jan 8 17:12:59 2019)
service vm:100 (node08, started)

ha-manager statuson node09:
quorum OK
master node10 (old timestamp - dead?, Tue Jan 8 17:12:52 2019)
lrm node08 (active, Tue Jan 8 17:13:41 2019)
lrm node09 (idle, Tue Jan 8 17:13:40 2019)
lrm node10 (old timestamp - dead?, Tue Jan 8 17:12:59 2019)
service vm:100 (node08, started)

ha-manager statuson node10:
quorum No quorum on node 'node10'!
master node10 (old timestamp - dead?, Tue Jan 8 17:12:52 2019)
lrm node08 (old timestamp - dead?, Tue Jan 8 17:12:57 2019)
lrm node09 (old timestamp - dead?, Tue Jan 8 17:12:56 2019)
lrm node10 (old timestamp - dead?, Tue Jan 8 17:12:59 2019)
service vm:100 (node08, started)

=> expected behavior: Since node10 got cut off and lost quorum, it should fence itself


BUT: node08 rebooted itself. Uptime after a few moments:
---------------------
node08: 17:18:39 up 0 min, 0 users, load average: 1.77, 0.42, 0.14
node09: 17:18:39 up 24 min, 0 users, load average: 0.92, 0.91, 0.52
node10: 17:18:40 up 24 min, 0 users, load average: 0.26, 0.33, 0.34

Why did node08 reboot and not node10 which was completly out of quorum?


# pvecm status
Quorum information
------------------
Date: Tue Jan 8 17:29:08 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1/5620
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.0.8 (local)
0x00000002 1 10.10.0.9
0x00000003 1 10.10.0.10
 
3 Node Test Cluster. Test Case: Cut off node10 (master) cluster network connection
How did you do this? ifdown (or similar) on interfaces is not a useful test, as corosync (cluster communication stack) 2.4 has still some issues with that (it's not a real test for an outtage anyway, nonetheless corosync 3.X will have this fixed).
Why did node08 reboot and not node10 which was completly out of quorum?

Can you please post excerpts from the syslog/journal from all three nodes around the time this happened? Anything mentioning corosync, pmxcfs/pve-cluster and pve-ha-* is relevant.
Else it's hard to tell what the reason was. For now it looks like all nodes lost quorum, but only node8 got fenced as it was the only node with configured HA services (no point in fencing nodes without services).
 
1.) I testet it by setting packetloss to 100% in my vmware test environment.
2.) hm....i cant reproduce the random fencinng now. Maybe i testet it wrong
3.) No need to have a look at the logs since this thead can be closed

nice to know was: no point in fencing nodes without services
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!