[SOLVED] Cluster split - no way to join again - Proxmox 5.1

massivescale

Renowned Member
May 15, 2012
18
4
68
localhost
We have a 4-nodes cluster, 3 nodes with actual VMs (pm3, pm4, pm5) and 4th (pmtmp) was just there for quorum (it used to be a 2+1 cluster originally).

After some update pm4 and pm5 lost connectivity to pm3 and pmtmp and the cluster has split. They couldn't ping each other, it was a pve-firewall problem and after stopping pve-firewall the connectivity works again. Tested with ping, omping and passwordless SSH.

The cluster, however, is still in a split state with "Activity blocked". If I'm reading tcpdump correctly, there are no attempts made by pm4 and pm5 to contact pm3 and vice versa.

How can I make the cluster work again? Any way to force one of the hosts to try connecting to the others again?

Code:
root@pm3:~# pvecm status
Quorum information
------------------
Date:             Mon May 28 13:15:37 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000004
Ring ID:          3/519028
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 1.1.2.239
0x00000004          1 1.1.2.252 (local)
root@pm3:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         3          1 pmtmp
         4          1 pm3 (local)
root@pm3:~# pveversion
pve-manager/5.1-43/bdb08029 (running kernel: 4.13.13-5-pve)








root@pm4:~# LC_ALL=C LANG=C pvecm status
Quorum information
------------------
Date:             Mon May 28 14:17:18 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/6048
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 1.1.2.228 (local)
0x00000002          1 1.1.2.240
root@pm4:~# LC_ALL=C LANG=C pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pm4 (local)
         2          1 pm5






root@pmtmp:~# LC_ALL=C LANG=C pvecm status
Quorum information
------------------
Date:             Mon May 28 13:16:54 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000003
Ring ID:          3/519152
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000003          1 1.1.2.239 (local)
0x00000004          1 1.1.2.252
root@pmtmp:~# LC_ALL=C LANG=C pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         3          1 pmtmp (local)
         4          1 pm3
root@pmtmp:~# LC_ALL=C LANG=C pveversion
pve-manager/5.1-41/0b958203 (running kernel: 4.13.13-4-pve)



root@pm5:~# LC_ALL=C LANG=C pvecm status
Quorum information
------------------
Date:             Mon May 28 14:17:40 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1/6048
Quorate:          No

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      2
Quorum:           3 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 1.1.2.228
0x00000002          1 1.1.2.240 (local)
root@pm5:~# LC_ALL=C LANG=C pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 pm4
         2          1 pm5 (local)
 
Restart corosync and check the output in journal/syslog. It should form a new membership group.
 
i guess ha is not active (else your nodes would have selffenced by now)

if omping works across all 4 nodes, but the cluster does not, try to restart corosync on those nodes with
Code:
systemct restart corosync
 
Thanks. Tried it yesterday, without success. I've just managed to gain quorum on the pm4/pm5 split by executing

Code:
pvecm expected 2

When the pm4/5 had their quorum back, I could delete pmtmp and pm3 nodes. I didn't try rejoining pm3 to the cluster yet, waiting for the backups to complete first.
 
So in case someone needs this information:
  • pm3 had to be reinstalled to get rid of old cluster settings
  • pm3 did not join cluster after reinstall too
  • moving to unicast communication was the trick that allowed pm3 to join the cluster
  • Following the unicast instructions was straightforward on pm4/5 which were in the cluster and had read/write access to /etc/pve
  • to apply the instructions on pm3, I had to:
Code:
systemctl stop pve-cluster
pmxcfs -l
then I could edit /etc/pve/corosync.conf. I had to make sure that "version" on pm3 is the same as "version" in the other 2 nodes. In general, copying the whole file from cluster is a good idea.
  • the fact that the IP address in totem.interface.bindnetaddr was an address of some random node in the cluster, not the local node, was pretty confusing: this is not an error, it's supposed to be like this for some reason.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!