Unexpected behavior when physically move nodes

nicof · Aug 18, 2020

Hello,
I would appreciate if you can throw some light over these issues that we experienced recently.

Scenario
Originally we have a 3 node proxmox cluster (node-1, node-2, node-3) with many VMs in a productive environment:

Hardware: Dell R740
SO: Debian 10.3 Buster
Proxmox version: 6.2-4
SAN storage.
HA goups configured among them.
Latency among them is around 0.140 ms

Recently we've successfully added 3 more nodes (node-4, node-5, node-6) located, physically, at a temporary location, resulting a 6 node cluster. The last 3 nodes are:

Hardware: Dell R740
SO: Debian 10.3 Buster
Proxmox version: 6.2-4
Cluster CEPH Storage installed among them.
No HA groups configured yet.
Latency among them is around 0.140 ms

As I said, the first group of 3 nodes have many productive VMs; but the last 3 nodes don't have any VM yet. Latency between both groups of nodes is around 0.200 ms

Running a pvecm status throw the next:

Code:

node-1 ~ # pvecm status
Cluster information
-------------------
Name:             proxmoxname
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 14 12:53:01 2020
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000006
Ring ID:          1.118
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.200.0.1 (local)
0x00000002          1 172.200.0.2
0x00000003          1 172.200.0.3
0x00000004          1 172.200.0.4
0x00000005          1 172.200.0.5
0x00000006          1 172.200.0.6
node-1 ~ # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 node-1 (local)
         2          1 node-2
         3          1 node-3
         4          1 node-4
         5          1 node-5
         6          1 node-6

Problems
Last weekend we transport the non productive group of nodes to its final destination, and then we have had to deal with many issues.

First, when we shutted down the second group of nodes (node-4, node-5, node-6), the cluster activity was blocked, because "quorum" is 4; so we had to run pvecm expected 3 to recover the normal cluster activity with the first three nodes.

When we physically installed the nodes in their new location, we booted them up, causing the following situation:

Only node-5 got to join to the cluster
When node-5 was joined to the cluster, node-2 and node-3 were rebooted (by themselfs ?)
In each node of the first group we had the following ha error (even when there are no HA groups in the second group of nodes):

Code:

63669 Aug 15 10:52:35 node-1 pmxcfs[2634]: [dcdb] crit: cpg_send_message failed: 6
63670 Aug 15 10:52:35 node-1 pve-ha-lrm[2706]: lost lock 'ha_agent_node-1_lock - cfs lock update failed - Device or resource busy

63917 Aug 15 10:56:11 node-1 pmxcfs[2634]: [dcdb] crit: cpg_send_message failed: 6
63918 Aug 15 10:56:11 node-1 pvesr[9191]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout
63919 Aug 15 10:56:11 node-1 systemd[1]: pvesr.service: Main process exited, code=exited, status=16/n/a
63920 Aug 15 10:56:11 node-1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
63921 Aug 15 10:56:11 node-1 systemd[1]: Failed to start Proxmox VE replication runner.

When node-2 and node-3 reboot was finished, only node-1 and node-5 was joined to the cluster, and the activity was blocked, so we had to run pvecm expected 1, and all VMs were migrated to node-1
We wait some minutes but node-2 and node-3 never had join again to the cluster. Aparently, the reason was a split brain between those nodes.
So we had to disconnect node-4, node-5 and node-6 to get VMs working until we find a solution

NOTE: The communication between groups of nodes was checked. The new latency is around 0.700 ms.

Thank you in advance.

Alwin · Oct 14, 2020

nicof said:
When node-5 was joined to the cluster, node-2 and node-3 were rebooted (by themselfs ?)

Proxmox VE uses corosync for quorum information. HA uses that to decide if it is in the quorate partition or not. If not, then the node will fence itself.

nicof said:
NOTE: The communication between groups of nodes was checked. The new latency is around 0.700 ms.

How did you conduct the check?

nicof said:
When node-2 and node-3 reboot was finished, only node-1 and node-5 was joined to the cluster, and the activity was blocked, so we had to run pvecm expected 1, and all VMs were migrated to node-1

The configuration of expected votes is only temporary, as soon as a node joins the value is reset.

And in general, with 6x nodes you will always run into split brain situations, with a location as failure domain.

Search

Search

Unexpected behavior when physically move nodes

nicof

Member

Alwin

Proxmox Retired Staff

We value your privacy