2 systems not in quorum

Zack Coffey

New Member
Mar 7, 2016
16
0
1
43
I have 2 Proxmox 4.3 systems setup and working for the most part. They used to have a quorum but now they seem not to. I just updated both and rebooted both systems and the webUI shows pve1 and pve2 nodes, but they seem to go in and out of being available to each other. One minute they will show both green, next minute which ever is the "other" one, will be a red circle. Wait another minute and green comes back, wait another minute and red comes back.

root@pve2:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:38:19 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2/3714332
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.235.128.56 (local)




root@pve1:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:38:54 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/3714380
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.235.128.49 (local)
 
PVE1

root@pve1:~# service pve-cluster restart
root@pve1:~# service pvedaemon restart
root@pve1:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 pve1 (local)
root@pve1:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:44:34 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/3714900
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.235.128.49 (local)
0x00000002 1 10.235.128.56
 
PVE2

root@pve2:~# service pve-cluster restart
root@pve2:~# service pvedaemon restart
root@pve2:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve2 (local)
root@pve2:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:44:30 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1/3714900
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.235.128.49
0x00000002 1 10.235.128.56 (local)
 
So it looks like multicast is the problem, again. It seems very flaky.

I have both of these hosts connected through 2 different switches for redundancy and congestion mitigation. Well it appears there's some loss of multicast in IT's core switches, so sometimes each host can see each other fine, rest of the time not.

Tried running some ompings and one time I get 19% loss, another time it's 83% loss. Maybe I'm just dumb, but multicast seems a poor way of talking directly between 2 systems.

"So just use unicast!" https://pve.proxmox.com/wiki/Multic....29_instead_of_multicast.2C_if_all_else_fails

Unicast documentation could be better...
 
So I added the unicast line to /etc/pve/corosync.conf, all services restarted ok. Each system can, again, see each other for a moment and then they can't. Here's a tail of syslog.

Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:28 pve2 corosync[1850]: [TOTEM ] A processor failed, forming new configuration.
Dec 1 14:18:29 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725524) was formed. Members left: 1
Dec 1 14:18:29 pve2 corosync[1850]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 1 14:18:29 pve2 pmxcfs[10855]: [dcdb] notice: members: 2/10855
Dec 1 14:18:29 pve2 pmxcfs[10855]: [status] notice: members: 2/10855
Dec 1 14:18:29 pve2 corosync[1850]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 1 14:18:29 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:18:29 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:18:29 pve2 pmxcfs[10855]: [status] notice: node lost quorum
Dec 1 14:19:35 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725528) was formed. Members
Dec 1 14:19:35 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:35 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:36 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725532) was formed. Members
Dec 1 14:19:36 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:36 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:38 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725536) was formed. Members
Dec 1 14:19:38 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:38 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:39 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725540) was formed. Members
Dec 1 14:19:39 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:39 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:40 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725544) was formed. Members
Dec 1 14:19:40 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:40 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
 
corosync.conf...

bindnetaddr.... that shouldn't be the same for both systems should it? If I change it on one, restart services, then the other host gets updated and has the same bindnetaddr which is wrong for itself?
 
root@pve1:/etc/pve# cat corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve2
nodeid: 2
quorum_votes: 1
ring0_addr: pve2
}

node {
name: pve1
nodeid: 1
quorum_votes: 1
ring0_addr: pve1
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Colonel-Cluster
config_version: 2
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.235.128.56
ringnumber: 0
transport: udpu
}

}
 
bindnetaddr should be the same on all nodes since bindnetaddr simply indicates which node is the current leader.
Doesn't seem to matter what I use. I've tried one, the other and just the network.

According to this: https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_installation_terms.html

> As the same Corosync configuration will be used on all nodes, make sure to use a network address as bindnetaddr, not the address of a specific network interface.

So I've tried that and still see the same problems.
 
The file both have a version and a config_version. Version means version of corosync which must be 2 while config_version must change each time the config file is changed.
 
That didn't seem to make a difference. Even if I don't do that, both sides get updated with the right one. But they still keep going in and out of sync.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!