[SOLVED] Members list out of Sync

hk135

Renowned Member
Nov 3, 2014
25
0
66
Hi There

I am getting an error of No Such Cluster node when I am trying to migrate to or from a node called pve3-dh4. I checked out /etc/pve/.members on each node and they are not in sync. One node has

{
"nodename": "pve1-dh4",
"version": 11,
"cluster": { "name": "virtus-v4", "version": 6, "nodes": 6, "quorate": 1 },
"nodelist": {
"pve1-dh4": { "id": 2, "online": 1, "ip": "172.16.23.1"},
"pve4-dh4": { "id": 1, "online": 1, "ip": "172.16.23.4"},
"pve-bhf-dh4": { "id": 3, "online": 0},
"pve-archive-dh4": { "id": 4, "online": 1, "ip": "172.16.23.100"},
"pve-shareddb-dh4": { "id": 5, "online": 1, "ip": "172.16.23.101"},
"pve3-dh4": { "id": 6, "online": 1, "ip": "172.16.23.3"}
}
}

Another has

{
"nodename": "pve-archive-dh4",
"version": 7,
"cluster": { "name": "virtus-v4", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"pve1-dh4": { "id": 2, "online": 1, "ip": "172.16.23.1"},
"pve4-dh4": { "id": 1, "online": 1, "ip": "172.16.23.4"},
"pve-archive-dh4": { "id": 4, "online": 1, "ip": "172.16.23.100"},
"pve-shareddb-dh4": { "id": 5, "online": 1, "ip": "172.16.23.101"},
"pve-bhf-dh4": { "id": 3, "online": 0}
}
}

and another has

{
"nodename": "pve-shareddb-dh4",
"version": 6,
"cluster": { "name": "virtus-v4", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"pve1-dh4": { "id": 2, "online": 1, "ip": "172.16.23.1"},
"pve4-dh4": { "id": 1, "online": 1, "ip": "172.16.23.4"},
"pve-archive-dh4": { "id": 4, "online": 1, "ip": "172.16.23.100"},
"pve-shareddb-dh4": { "id": 5, "online": 1, "ip": "172.16.23.101"},
"pve-bhf-dh4": { "id": 3, "online": 0}
}
}

It seems that this list is not being kept in sync between the nodes, is there anything I can do to fix this?

Thanks in advance.
 
Hi,

You got into a corsync split brain by some reason, did you had any failures lately, or cluster config changes?

can you post the output of
Code:
pvecm status
of those nodes

Also, which Proxmox VE version is this,
Code:
pveversion -v
?
 
  • Like
Reactions: hk135
Hi There

In order

root@pve1-dh4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve1-dh4 (local)
6 1 pve3-dh4
1 1 pve4-dh4
4 1 pve-archive-dh4
5 1 pve-shareddb-dh4

root@pve1-dh4:~# pveversion
pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve)

root@pve-archive-dh4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve1-dh4
6 1 pve3-dh4.hv.precedenthost.local
1 1 pve4-dh4
4 1 pve-archive-dh4 (local)
5 1 pve-shareddb-dh4

root@pve-archive-dh4:~# pveversion
pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve)

root@pve-shareddb-dh4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve1-dh4
6 1 pve3-dh4.hv.precedenthost.local
1 1 pve4-dh4
4 1 pve-archive-dh4
5 1 pve-shareddb-dh4 (local)

root@pve-shareddb-dh4:~# pveversion
pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve)

pve-cluster fails sometimes, I usually stop it on all nodes and then start it again and its okay but there have been no major outages. /etc/pve/corosync.conf seems to show all the nodes okay. Its all Proxmox 4.1

Thanks
 
I would have preffered a pvecm status output :)

pve-cluster fails sometimes
Sounds not good, do you know any reason, whats in the logs when it fails? Or is the network where cluster communication happens under heavy load?

Do the corosync.conf files differ between each node?
 
Sorry! Here are the pvecm status outputs

root@pve1-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 15:59:54 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000002
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1 (local)
0x00000006 1 172.16.23.3
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100
0x00000005 1 172.16.23.101

root@pve-archive-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 16:00:17 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000004
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1
0x00000006 1 172.16.23.3
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100 (local)
0x00000005 1 172.16.23.101

root@pve-shareddb-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 16:01:16 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000005
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1
0x00000006 1 172.16.23.3
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100
0x00000005 1 172.16.23.101 (local)

I haven't seen a corellation between network load and pve-cluster loosing sync and mostly the logs have re transmit (I don't have an example atm I'm afraid). We have other applications using corosync running (with different multicast addresses) which don't seem to have the same issue.

The corosync.conf in /etc/pve are identical across nodes.

Thanks
 
Whats with "pve-bhf-dh4"? Planfully offline?

This output looks good, did you just add "pve3-dh4"?

Can you post the pvecm status output from it also? Its a little strange if the corosync.config on ALL nodes are exactly the same. You could try to open it and increase the version count in the totem section and save the file, this should trigger an resync.
 
pve-bhf-dh4 is non-existent at the moment due to a few logistical issues, there will be one eventually!

pve3-dh4 has has been online for about a month but we just started populating it, it was previously a proxmox 3 box that was working okay.

root@pve3-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 16:31:34 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000006
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1
0x00000006 1 172.16.23.3 (local)
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100
0x00000005 1 172.16.23.101
 
Hi All

Just to finish off this post, I resolved the issue by restarting corosync on all the nodes but pve1-dh4 and it all started working again (members lists updating and whatnot).

Thanks for the help with this. Much appreciated.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!