Hello,
I would appreciate if you can throw some light over these issues that we experienced recently.
Scenario
Originally we have a 3 node proxmox cluster (node-1, node-2, node-3) with many VMs in a productive environment:
Running a
Problems
Last weekend we transport the non productive group of nodes to its final destination, and then we have had to deal with many issues.
First, when we shutted down the second group of nodes (node-4, node-5, node-6), the cluster activity was blocked, because "quorum" is 4; so we had to run
When we physically installed the nodes in their new location, we booted them up, causing the following situation:
Thank you in advance.
I would appreciate if you can throw some light over these issues that we experienced recently.
Scenario
Originally we have a 3 node proxmox cluster (node-1, node-2, node-3) with many VMs in a productive environment:
- Hardware: Dell R740
- SO: Debian 10.3 Buster
- Proxmox version: 6.2-4
- SAN storage.
- HA goups configured among them.
- Latency among them is around 0.140 ms
- Hardware: Dell R740
- SO: Debian 10.3 Buster
- Proxmox version: 6.2-4
- Cluster CEPH Storage installed among them.
- No HA groups configured yet.
- Latency among them is around 0.140 ms
Running a
pvecm status
throw the next:
Code:
node-1 ~ # pvecm status
Cluster information
-------------------
Name: proxmoxname
Config Version: 6
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Aug 14 12:53:01 2020
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000006
Ring ID: 1.118
Quorate: Yes
Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 6
Quorum: 4
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.200.0.1 (local)
0x00000002 1 172.200.0.2
0x00000003 1 172.200.0.3
0x00000004 1 172.200.0.4
0x00000005 1 172.200.0.5
0x00000006 1 172.200.0.6
node-1 ~ # pvecm nodes
Membership information
----------------------
Nodeid Votes Name
1 1 node-1 (local)
2 1 node-2
3 1 node-3
4 1 node-4
5 1 node-5
6 1 node-6
Problems
Last weekend we transport the non productive group of nodes to its final destination, and then we have had to deal with many issues.
First, when we shutted down the second group of nodes (node-4, node-5, node-6), the cluster activity was blocked, because "quorum" is 4; so we had to run
pvecm expected 3
to recover the normal cluster activity with the first three nodes.When we physically installed the nodes in their new location, we booted them up, causing the following situation:
- Only node-5 got to join to the cluster
- When node-5 was joined to the cluster, node-2 and node-3 were rebooted (by themselfs ?)
- In each node of the first group we had the following ha error (even when there are no HA groups in the second group of nodes):
Code:
63669 Aug 15 10:52:35 node-1 pmxcfs[2634]: [dcdb] crit: cpg_send_message failed: 6
63670 Aug 15 10:52:35 node-1 pve-ha-lrm[2706]: lost lock 'ha_agent_node-1_lock - cfs lock update failed - Device or resource busy
63917 Aug 15 10:56:11 node-1 pmxcfs[2634]: [dcdb] crit: cpg_send_message failed: 6
63918 Aug 15 10:56:11 node-1 pvesr[9191]: error during cfs-locked 'file-replication_cfg' operation: got lock request timeout
63919 Aug 15 10:56:11 node-1 systemd[1]: pvesr.service: Main process exited, code=exited, status=16/n/a
63920 Aug 15 10:56:11 node-1 systemd[1]: pvesr.service: Failed with result 'exit-code'.
63921 Aug 15 10:56:11 node-1 systemd[1]: Failed to start Proxmox VE replication runner.
- When node-2 and node-3 reboot was finished, only node-1 and node-5 was joined to the cluster, and the activity was blocked, so we had to run
pvecm expected 1
, and all VMs were migrated to node-1 - We wait some minutes but node-2 and node-3 never had join again to the cluster. Aparently, the reason was a split brain between those nodes.
- So we had to disconnect node-4, node-5 and node-6 to get VMs working until we find a solution
Thank you in advance.