2 systems not in quorum

Zack Coffey · Dec 1, 2016

I have 2 Proxmox 4.3 systems setup and working for the most part. They used to have a quorum but now they seem not to. I just updated both and rebooted both systems and the webUI shows pve1 and pve2 nodes, but they seem to go in and out of being available to each other. One minute they will show both green, next minute which ever is the "other" one, will be a red circle. Wait another minute and green comes back, wait another minute and red comes back.

root@pve2:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:38:19 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2/3714332
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.235.128.56 (local)

root@pve1:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:38:54 2016
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/3714380
Quorate: No

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.235.128.49 (local)

Zack Coffey · Dec 1, 2016

PVE1

root@pve1:~# service pve-cluster restart
root@pve1:~# service pvedaemon restart
root@pve1:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 pve1 (local)
root@pve1:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:44:34 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/3714900
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.235.128.49 (local)
0x00000002 1 10.235.128.56

Zack Coffey · Dec 1, 2016

PVE2

root@pve2:~# service pve-cluster restart
root@pve2:~# service pvedaemon restart
root@pve2:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve2 (local)
root@pve2:~# pvecm status
Quorum information
------------------
Date: Thu Dec 1 11:44:30 2016
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1/3714900
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.235.128.49
0x00000002 1 10.235.128.56 (local)

Zack Coffey · Dec 1, 2016

This URL seems empty yet referenced from several posts and other wiki pages.

https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

Zack Coffey · Dec 1, 2016

So it looks like multicast is the problem, again. It seems very flaky.

I have both of these hosts connected through 2 different switches for redundancy and congestion mitigation. Well it appears there's some loss of multicast in IT's core switches, so sometimes each host can see each other fine, rest of the time not.

Tried running some ompings and one time I get 19% loss, another time it's 83% loss. Maybe I'm just dumb, but multicast seems a poor way of talking directly between 2 systems.

"So just use unicast!" https://pve.proxmox.com/wiki/Multic....29_instead_of_multicast.2C_if_all_else_fails

Unicast documentation could be better...

Zack Coffey · Dec 1, 2016

So I added the unicast line to /etc/pve/corosync.conf, all services restarted ok. Each system can, again, see each other for a moment and then they can't. Here's a tail of syslog.

Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:27 pve2 corosync[1850]: [TOTEM ] Retransmit List: 182 183 184
Dec 1 14:18:28 pve2 corosync[1850]: [TOTEM ] A processor failed, forming new configuration.
Dec 1 14:18:29 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725524) was formed. Members left: 1
Dec 1 14:18:29 pve2 corosync[1850]: [TOTEM ] Failed to receive the leave message. failed: 1
Dec 1 14:18:29 pve2 pmxcfs[10855]: [dcdb] notice: members: 2/10855
Dec 1 14:18:29 pve2 pmxcfs[10855]: [status] notice: members: 2/10855
Dec 1 14:18:29 pve2 corosync[1850]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 1 14:18:29 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:18:29 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:18:29 pve2 pmxcfs[10855]: [status] notice: node lost quorum
Dec 1 14:19:35 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725528) was formed. Members
Dec 1 14:19:35 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:35 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:36 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725532) was formed. Members
Dec 1 14:19:36 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:36 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:38 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725536) was formed. Members
Dec 1 14:19:38 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:38 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:39 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725540) was formed. Members
Dec 1 14:19:39 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:39 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 1 14:19:40 pve2 corosync[1850]: [TOTEM ] A new membership (10.235.128.56:3725544) was formed. Members
Dec 1 14:19:40 pve2 corosync[1850]: [QUORUM] Members[1]: 2
Dec 1 14:19:40 pve2 corosync[1850]: [MAIN ] Completed service synchronization, ready to provide service.

Zack Coffey · Dec 1, 2016

corosync.conf...

bindnetaddr.... that shouldn't be the same for both systems should it? If I change it on one, restart services, then the other host gets updated and has the same bindnetaddr which is wrong for itself?

Zack Coffey · Dec 1, 2016

root@pve1:/etc/pve# cat corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve2
nodeid: 2
quorum_votes: 1
ring0_addr: pve2
}

node {
name: pve1
nodeid: 1
quorum_votes: 1
ring0_addr: pve1
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: Colonel-Cluster
config_version: 2
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.235.128.56
ringnumber: 0
transport: udpu
}

}

mir · Dec 1, 2016

bindnetaddr should be the same on all nodes since bindnetaddr simply indicates which node is the current leader.

Zack Coffey · Dec 1, 2016

mir said:
bindnetaddr should be the same on all nodes since bindnetaddr simply indicates which node is the current leader.

Doesn't seem to matter what I use. I've tried one, the other and just the network.

According to this: https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_installation_terms.html

> As the same Corosync configuration will be used on all nodes, make sure to use a network address as bindnetaddr, not the address of a specific network interface.

So I've tried that and still see the same problems.

mir · Dec 1, 2016

Do you remember to increment config_version each time you make changes to corosync.conf?

Zack Coffey · Dec 1, 2016

Hmm, I'm guessing config_version is different than corosync.conf version entry?

https://linux.die.net/man/5/corosync.conf

This says "version" should only be "2". So I've left it at 2.

mir · Dec 2, 2016

The file both have a version and a config_version. Version means version of corosync which must be 2 while config_version must change each time the config file is changed.

Zack Coffey · Dec 5, 2016

That didn't seem to make a difference. Even if I don't do that, both sides get updated with the right one. But they still keep going in and out of sync.

Search

Search

2 systems not in quorum

Zack Coffey

New Member

Zack Coffey

New Member

Zack Coffey

New Member

Zack Coffey

New Member

Zack Coffey

New Member

Zack Coffey

New Member

Zack Coffey

New Member

Zack Coffey

New Member

mir

Famous Member

Zack Coffey

New Member

mir

Famous Member

Zack Coffey

New Member

mir

Famous Member

Zack Coffey

New Member