[SOLVED] Members list out of Sync

hk135 · May 20, 2016

Hi There

I am getting an error of No Such Cluster node when I am trying to migrate to or from a node called pve3-dh4. I checked out /etc/pve/.members on each node and they are not in sync. One node has

{
"nodename": "pve1-dh4",
"version": 11,
"cluster": { "name": "virtus-v4", "version": 6, "nodes": 6, "quorate": 1 },
"nodelist": {
"pve1-dh4": { "id": 2, "online": 1, "ip": "172.16.23.1"},
"pve4-dh4": { "id": 1, "online": 1, "ip": "172.16.23.4"},
"pve-bhf-dh4": { "id": 3, "online": 0},
"pve-archive-dh4": { "id": 4, "online": 1, "ip": "172.16.23.100"},
"pve-shareddb-dh4": { "id": 5, "online": 1, "ip": "172.16.23.101"},
"pve3-dh4": { "id": 6, "online": 1, "ip": "172.16.23.3"}
}
}

Another has

{
"nodename": "pve-archive-dh4",
"version": 7,
"cluster": { "name": "virtus-v4", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"pve1-dh4": { "id": 2, "online": 1, "ip": "172.16.23.1"},
"pve4-dh4": { "id": 1, "online": 1, "ip": "172.16.23.4"},
"pve-archive-dh4": { "id": 4, "online": 1, "ip": "172.16.23.100"},
"pve-shareddb-dh4": { "id": 5, "online": 1, "ip": "172.16.23.101"},
"pve-bhf-dh4": { "id": 3, "online": 0}
}
}

and another has

{
"nodename": "pve-shareddb-dh4",
"version": 6,
"cluster": { "name": "virtus-v4", "version": 5, "nodes": 5, "quorate": 1 },
"nodelist": {
"pve1-dh4": { "id": 2, "online": 1, "ip": "172.16.23.1"},
"pve4-dh4": { "id": 1, "online": 1, "ip": "172.16.23.4"},
"pve-archive-dh4": { "id": 4, "online": 1, "ip": "172.16.23.100"},
"pve-shareddb-dh4": { "id": 5, "online": 1, "ip": "172.16.23.101"},
"pve-bhf-dh4": { "id": 3, "online": 0}
}
}

It seems that this list is not being kept in sync between the nodes, is there anything I can do to fix this?

Thanks in advance.

t.lamprecht · May 20, 2016

Hi,

You got into a corsync split brain by some reason, did you had any failures lately, or cluster config changes?

can you post the output of

Code:

pvecm status

of those nodes

Also, which Proxmox VE version is this,

Code:

pveversion -v

?

hk135 · May 20, 2016

Hi There

In order

root@pve1-dh4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve1-dh4 (local)
6 1 pve3-dh4
1 1 pve4-dh4
4 1 pve-archive-dh4
5 1 pve-shareddb-dh4

root@pve1-dh4:~# pveversion
pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve)

root@pve-archive-dh4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve1-dh4
6 1 pve3-dh4.hv.precedenthost.local
1 1 pve4-dh4
4 1 pve-archive-dh4 (local)
5 1 pve-shareddb-dh4

root@pve-archive-dh4:~# pveversion
pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve)

root@pve-shareddb-dh4:~# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
2 1 pve1-dh4
6 1 pve3-dh4.hv.precedenthost.local
1 1 pve4-dh4
4 1 pve-archive-dh4
5 1 pve-shareddb-dh4 (local)

root@pve-shareddb-dh4:~# pveversion
pve-manager/4.1-1/2f9650d4 (running kernel: 4.2.6-1-pve)

pve-cluster fails sometimes, I usually stop it on all nodes and then start it again and its okay but there have been no major outages. /etc/pve/corosync.conf seems to show all the nodes okay. Its all Proxmox 4.1

Thanks

t.lamprecht · May 20, 2016

I would have preffered a pvecm status output

hk135 said:
pve-cluster fails sometimes

Sounds not good, do you know any reason, whats in the logs when it fails? Or is the network where cluster communication happens under heavy load?

Do the corosync.conf files differ between each node?

hk135 · May 20, 2016

Sorry! Here are the pvecm status outputs

root@pve1-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 15:59:54 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000002
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1 (local)
0x00000006 1 172.16.23.3
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100
0x00000005 1 172.16.23.101

root@pve-archive-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 16:00:17 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000004
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1
0x00000006 1 172.16.23.3
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100 (local)
0x00000005 1 172.16.23.101

root@pve-shareddb-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 16:01:16 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000005
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1
0x00000006 1 172.16.23.3
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100
0x00000005 1 172.16.23.101 (local)

I haven't seen a corellation between network load and pve-cluster loosing sync and mostly the logs have re transmit (I don't have an example atm I'm afraid). We have other applications using corosync running (with different multicast addresses) which don't seem to have the same issue.

The corosync.conf in /etc/pve are identical across nodes.

Thanks

t.lamprecht · May 20, 2016

Whats with "pve-bhf-dh4"? Planfully offline?

This output looks good, did you just add "pve3-dh4"?

Can you post the pvecm status output from it also? Its a little strange if the corosync.config on ALL nodes are exactly the same. You could try to open it and increase the version count in the totem section and save the file, this should trigger an resync.

hk135 · May 20, 2016

pve-bhf-dh4 is non-existent at the moment due to a few logistical issues, there will be one eventually!

pve3-dh4 has has been online for about a month but we just started populating it, it was previously a proxmox 3 box that was working okay.

root@pve3-dh4:~# pvecm status
Quorum information
------------------
Date: Fri May 20 16:31:34 2016
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 0x00000006
Ring ID: 728
Quorate: Yes

Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 5
Quorum: 4
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.16.23.1
0x00000006 1 172.16.23.3 (local)
0x00000001 1 172.16.23.4
0x00000004 1 172.16.23.100
0x00000005 1 172.16.23.101

hk135 · May 23, 2016

Hi All

Just to finish off this post, I resolved the issue by restarting corosync on all the nodes but pve1-dh4 and it all started working again (members lists updating and whatnot).

Thanks for the help with this. Much appreciated.

Search

Search

[SOLVED] Members list out of Sync

hk135

Renowned Member

t.lamprecht

Proxmox Staff Member

hk135

Renowned Member

t.lamprecht

Proxmox Staff Member

hk135

Renowned Member

t.lamprecht

Proxmox Staff Member

hk135

Renowned Member

hk135

Renowned Member

We value your privacy