[SOLVED] Cluster split - no way to join again - Proxmox 5.1

massivescale · May 29, 2018

We have a 4-nodes cluster, 3 nodes with actual VMs (pm3, pm4, pm5) and 4th (pmtmp) was just there for quorum (it used to be a 2+1 cluster originally).

After some update pm4 and pm5 lost connectivity to pm3 and pmtmp and the cluster has split. They couldn't ping each other, it was a pve-firewall problem and after stopping pve-firewall the connectivity works again. Tested with ping, omping and passwordless SSH.

The cluster, however, is still in a split state with "Activity blocked". If I'm reading tcpdump correctly, there are no attempts made by pm4 and pm5 to contact pm3 and vice versa.

How can I make the cluster work again? Any way to force one of the hosts to try connecting to the others again?

Code:

root@pm3:~# pvecm status
Quorum information
------------------
Date:             Mon May 28 13:15:37 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000004
Ring ID:          3/519028
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 1.1.2.239
0x00000004          1 1.1.2.252 (local)
root@pm3:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         3          1 pmtmp
         4          1 pm3 (local)
root@pm3:~# pveversion
pve-manager/5.1-43/bdb08029 (running kernel: 4.13.13-5-pve)








root@pm4:~# LC_ALL=C LANG=C pvecm status
Quorum information
------------------
Date:             Mon May 28 14:17:18 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/6048
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 1.1.2.228 (local)
0x00000002          1 1.1.2.240
root@pm4:~# LC_ALL=C LANG=C pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pm4 (local)
         2          1 pm5






root@pmtmp:~# LC_ALL=C LANG=C pvecm status
Quorum information
------------------
Date:             Mon May 28 13:16:54 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000003
Ring ID:          3/519152
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 1.1.2.239 (local)
0x00000004          1 1.1.2.252
root@pmtmp:~# LC_ALL=C LANG=C pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         3          1 pmtmp (local)
         4          1 pm3
root@pmtmp:~# LC_ALL=C LANG=C pveversion
pve-manager/5.1-41/0b958203 (running kernel: 4.13.13-4-pve)



root@pm5:~# LC_ALL=C LANG=C pvecm status
Quorum information
------------------
Date:             Mon May 28 14:17:40 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1/6048
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 1.1.2.228
0x00000002          1 1.1.2.240 (local)
root@pm5:~# LC_ALL=C LANG=C pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pm4
         2          1 pm5 (local)

Alwin · May 30, 2018

Restart corosync and check the output in journal/syslog. It should form a new membership group.

dcsapak · May 30, 2018

i guess ha is not active (else your nodes would have selffenced by now)

if omping works across all 4 nodes, but the cluster does not, try to restart corosync on those nodes with

Code:

systemct restart corosync

massivescale · May 30, 2018

Thanks. Tried it yesterday, without success. I've just managed to gain quorum on the pm4/pm5 split by executing

Code:

pvecm expected 2

When the pm4/5 had their quorum back, I could delete pmtmp and pm3 nodes. I didn't try rejoining pm3 to the cluster yet, waiting for the backups to complete first.

massivescale · May 30, 2018

So in case someone needs this information:

pm3 had to be reinstalled to get rid of old cluster settings
pm3 did not join cluster after reinstall too
moving to unicast communication was the trick that allowed pm3 to join the cluster
Following the unicast instructions was straightforward on pm4/5 which were in the cluster and had read/write access to /etc/pve
to apply the instructions on pm3, I had to:

Code:

systemctl stop pve-cluster
pmxcfs -l

then I could edit /etc/pve/corosync.conf. I had to make sure that "version" on pm3 is the same as "version" in the other 2 nodes. In general, copying the whole file from cluster is a good idea.

the fact that the IP address in totem.interface.bindnetaddr was an address of some random node in the cluster, not the local node, was pretty confusing: this is not an error, it's supposed to be like this for some reason.

Search

Search

[SOLVED] Cluster split - no way to join again - Proxmox 5.1

massivescale

Renowned Member

Alwin

Proxmox Retired Staff

dcsapak

Proxmox Staff Member

massivescale

Renowned Member

massivescale

Renowned Member