Unexpected behavior when physically move nodes

nicof

Member
Jul 23, 2020
1
0
6
40
Hello,
I would appreciate if you can throw some light over these issues that we experienced recently.

Scenario
Originally we have a 3 node proxmox cluster (node-1, node-2, node-3) with many VMs in a productive environment:
  • Hardware: Dell R740
  • SO: Debian 10.3 Buster
  • Proxmox version: 6.2-4
  • SAN storage.
  • HA goups configured among them.
  • Latency among them is around 0.140 ms
Recently we've successfully added 3 more nodes (node-4, node-5, node-6) located, physically, at a temporary location, resulting a 6 node cluster. The last 3 nodes are:
  • Hardware: Dell R740
  • SO: Debian 10.3 Buster
  • Proxmox version: 6.2-4
  • Cluster CEPH Storage installed among them.
  • No HA groups configured yet.
  • Latency among them is around 0.140 ms
As I said, the first group of 3 nodes have many productive VMs; but the last 3 nodes don't have any VM yet. Latency between both groups of nodes is around 0.200 ms

Running a pvecm status throw the next:
Code:
node-1 ~ # pvecm status
Cluster information
-------------------
Name:             proxmoxname
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Aug 14 12:53:01 2020
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          0x00000006
Ring ID:          1.118
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.200.0.1 (local)
0x00000002          1 172.200.0.2
0x00000003          1 172.200.0.3
0x00000004          1 172.200.0.4
0x00000005          1 172.200.0.5
0x00000006          1 172.200.0.6
node-1 ~ # pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 node-1 (local)
         2          1 node-2
         3          1 node-3
         4          1 node-4
         5          1 node-5
         6          1 node-6

Problems
Last weekend we transport the non productive group of nodes to its final destination, and then we have had to deal with many issues.

First, when we shutted down the second group of nodes (node-4, node-5, node-6), the cluster activity was blocked, because "quorum" is 4; so we had to run pvecm expected 3 to recover the normal cluster activity with the first three nodes.

When we physically installed the nodes in their new location, we booted them up, causing the following situation:
  • Only node-5 got to join to the cluster
  • When node-5 was joined to the cluster, node-2 and node-3 were rebooted (by themselfs ?)
  • In each node of the first group we had the following ha error (even when there are no HA groups in the second group of nodes):
Code:
63669 Aug 15 10:52:35 node-1 pmxcfs[2634]: [dcdb] crit: cpg_send_message failed: 6
63670 Aug 15 10:52:35 node-1 pve-ha-lrm[2706]: lost lock 'ha_agent_node-1_lock - cfs lock update failed - Device or resource busy

63917 Aug 15 10:56:11 node-1 pmxcfs[2634]: [dcdb] crit: cpg_send_message failed: 6
63918 Aug 15 10:56:11 node-1 pvesr[9191]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout
63919 Aug 15 10:56:11 node-1 systemd[1]: pvesr.service: Main process exited, code=exited, status=16/n/a
63920 Aug 15 10:56:11 node-1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
63921 Aug 15 10:56:11 node-1 systemd[1]: Failed to start Proxmox VE replication runner.
  • When node-2 and node-3 reboot was finished, only node-1 and node-5 was joined to the cluster, and the activity was blocked, so we had to run pvecm expected 1, and all VMs were migrated to node-1
  • We wait some minutes but node-2 and node-3 never had join again to the cluster. Aparently, the reason was a split brain between those nodes.
  • So we had to disconnect node-4, node-5 and node-6 to get VMs working until we find a solution
NOTE: The communication between groups of nodes was checked. The new latency is around 0.700 ms.

Thank you in advance.
 
  • When node-5 was joined to the cluster, node-2 and node-3 were rebooted (by themselfs ?)
Proxmox VE uses corosync for quorum information. HA uses that to decide if it is in the quorate partition or not. If not, then the node will fence itself.

NOTE: The communication between groups of nodes was checked. The new latency is around 0.700 ms.
How did you conduct the check?

  • When node-2 and node-3 reboot was finished, only node-1 and node-5 was joined to the cluster, and the activity was blocked, so we had to run pvecm expected 1, and all VMs were migrated to node-1
The configuration of expected votes is only temporary, as soon as a node joins the value is reset.

And in general, with 6x nodes you will always run into split brain situations, with a location as failure domain.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!