Replacing PVE node in cluster, new node loses quorum

mlanner · Mar 28, 2019

Hi,

We're having some problems replacing a PVE node in a cluster. These are the steps we've taken so far:

Turn off pve03 (out of 3).
From pve01, remove pve03 from the cluster with: pvecm delnode pve03
Unrack the hardware.
Mount a new server to become the new pve03.
Perform clean install of PVE on the new server with the same IP and hostname as the the old node, pve03.
Upgrade new pve03 node.
From pve03, add it to cluster with pvecm add pve01 (or with IP address of pve01).
Done; pve03 is now in the cluster.

The problem we're seeing is that pve03 keeps losing quorum with pve01 and pve02 and when using pvecm status, we see the following output:

Code:

root@pve03:~# pvecm s
Quorum information
------------------
Date:             Wed Mar 27 12:46:05 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/843000
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:         

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.20.13 (local)

Reading here in the forums, it is commonly mentioned that this is likely a unicast vs. multicast issue. Given that the cluster has been working flawlessly for years now, before the hardware replacement, I just don't see how this would be an issue with multicast, especially since we're reusing the exact same switch and switch ports.

The only thing I can explain this with at this point, is that somehow the re-use of the same IP and hostname is creating issues somewhere in the cluster configuration. As far as I can tell, the
/etc/pve/corosync.conf looks good. I've compared it to that of another cluster we have in a different data center and can't find any meaningful differences.

Does anyone have any ideas? Thanks in advance!

Andrei Bogatsky · Mar 28, 2019

Piggybacking on the above message...

I've tried installing using a different hostname (pve04) and IP (192.168.20.14) and the same thing is occurring. `omping` shows both unicast and multicast are working correctly however we're still losing quorum after several minutes. Some timeout I imagine. Restarting the corosync service brings it back for another ~5 minutes and then we lose it again. Rinse and repeat. Just as `mlanner` stated in the post above, there are no discernible differences between the "new" nodes configuration and the existing ones.

Stoiko Ivanov · Mar 29, 2019

have you also tried running the long running omping command (10+ minutes) - to ensure that a multicast querier is active? - please post the output of both omping commands from https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

Andrei Bogatsky · Apr 12, 2019

Ended up being a networking issue. You were right Stoiko. It wasn't speaking multicast.

Stoiko Ivanov · Apr 15, 2019

Glad you found and resolved your issue! Please mark the thread as 'SOLVED' so that others know what to expect. Thanks!

Search

Search

Replacing PVE node in cluster, new node loses quorum

mlanner

Renowned Member

Andrei Bogatsky

New Member

Stoiko Ivanov

Proxmox Staff Member

Andrei Bogatsky

New Member

Stoiko Ivanov

Proxmox Staff Member

We value your privacy