Hi there,
We have a proxmox cluster, with each node in the same subnet. The cluster contained 3 hosts and everything worked as it should in the last years. Recently we wanted to add a new host to the cluster. This host is also in the same subnet, nothing changed in the network confugration. We've added the host following the tutorial:
But after about 5 minutes we've had a split brain in the cluster, where the new node wan't visible to the others and vice versa. Status says, that the quorum activity was blocked:
The other nodes can still see each other:
/etc/pve/corosync.conf seems to be the same on the nodes:
and
I have discovered that the version of /etc/pve/.members differs between node1 and node4:
Is that a problem?
The nodes are seeing each other, after we reboot node4, but again after 5 minutes, we have a split brain again.
The syslog of node4:
I've read something about issues with multicast in the forums already, but I believe this is something else, as the infrastructure hasn't changed and the rest of the cluster still working fine. I appreciate any help on this. Thank you.
We have a proxmox cluster, with each node in the same subnet. The cluster contained 3 hosts and everything worked as it should in the last years. Recently we wanted to add a new host to the cluster. This host is also in the same subnet, nothing changed in the network confugration. We've added the host following the tutorial:
Bash:
pvecm add <IP_Address_of_host_in_cluster>
But after about 5 minutes we've had a split brain in the cluster, where the new node wan't visible to the others and vice versa. Status says, that the quorum activity was blocked:
Code:
root@node4:~# pvecm status
Quorum information
------------------
Date: Mon Jan 6 15:44:43 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000004
Ring ID: 4/2067668
Quorate: No
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000004 1 <ip_of_node4> (local)
The other nodes can still see each other:
Code:
root@node1:~# pvecm status
Quorum information
------------------
Date: Mon Jan 6 15:46:25 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 3/2096608
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000003 1 <ip_node3>
0x00000002 1 <ip_node2>
0x00000001 1 <ip_node1> (local)
/etc/pve/corosync.conf seems to be the same on the nodes:
Code:
root@node1:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: node1
nodeid: 1
quorum_votes: 1
ring0_addr: node1
}
node {
name: node2
nodeid: 2
quorum_votes: 1
ring0_addr: node2
}
node {
name: node3
nodeid: 3
quorum_votes: 1
ring0_addr: <ip_node3>
}
node {
name: node4
nodeid: 4
quorum_votes: 1
ring0_addr: <ip_node4>
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: VPSG4
config_version: 7
interface {
bindnetaddr: <ip_node1>
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
and
Code:
root@node4:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: node1
nodeid: 1
quorum_votes: 1
ring0_addr: node1
}
node {
name: node2
nodeid: 2
quorum_votes: 1
ring0_addr: node2
}
node {
name: node3
nodeid: 3
quorum_votes: 1
ring0_addr: <ip_node3>
}
node {
name: ndoe4
nodeid: 4
quorum_votes: 1
ring0_addr: <ip_node4>
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: VPSG4
config_version: 7
interface {
bindnetaddr: <ip_node1>
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
I have discovered that the version of /etc/pve/.members differs between node1 and node4:
Code:
root@node4:~# cat /etc/pve/.members
{
"nodename": "node4",
"version": 7,
"cluster": { "name": "VPSG4", "version": 7, "nodes": 4, "quorate": 0 },
"nodelist": {
"node1": { "id": 1, "online": 0, "ip": "<ip_node1>"},
"node2": { "id": 2, "online": 0, "ip": "<ip_node2>"},
"node3": { "id": 3, "online": 0, "ip": "<ip_node3"},
"node4": { "id": 4, "online": 1, "ip": "<ip_node4>"}
}
}
Code:
root@node1:~# cat /etc/pve/.members
{
"nodename": "chzhc1px010",
"version": 42,
"cluster": { "name": "VPSG4", "version": 7, "nodes": 4, "quorate": 1 },
"nodelist": {
"node1": { "id": 1, "online": 1, "ip": "<ip_node1>"},
"node2": { "id": 2, "online": 1, "ip": "<ip_node2>"},
"node3": { "id": 3, "online": 1, "ip": "<ip_node3>"},
"node4": { "id": 4, "online": 0, "ip": "<ip_node4>"}
}
}
Is that a problem?
The nodes are seeing each other, after we reboot node4, but again after 5 minutes, we have a split brain again.
The syslog of node4:
Code:
root@node4:~# tail /var/log/syslog -n 50
Jan 6 09:50:31 node4 corosync[1472]: warning [CPG ] downlist left_list: 3 received
Jan 6 09:50:31 node4 corosync[1472]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan 6 09:50:31 node4 corosync[1472]: notice [QUORUM] Members[1]: 4
Jan 6 09:50:31 node4 corosync[1472]: notice [MAIN ] Completed service synchronization, ready to provide service.
Jan 6 09:50:31 node4 corosync[1472]: [TOTEM ] A new membership (<ip_node4>:2067668) was formed. Members left: 3 2 1
Jan 6 09:50:31 node4 corosync[1472]: [TOTEM ] Failed to receive the leave message. failed: 3 2 1
Jan 6 09:50:31 node4 corosync[1472]: [CPG ] downlist left_list: 3 received
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: members: 4/1307
Jan 6 09:50:31 node4 pmxcfs[1307]: [status] notice: members: 4/1307
Jan 6 09:50:31 node4 corosync[1472]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan 6 09:50:31 node4 corosync[1472]: [QUORUM] Members[1]: 4
Jan 6 09:50:31 node4 corosync[1472]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 6 09:50:31 node4 pmxcfs[1307]: [status] notice: node lost quorum
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: cpg_send_message retried 1 times
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] crit: received write while not quorate - trigger resync
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] crit: leaving CPG group
Jan 6 09:50:31 node4 pve-ha-lrm[1678]: unable to write lrm status file - closing file '/etc/pve/nodes/node4/lrm_status.tmp.1678' failed - Operation not permitted
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: start cluster connection
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: members: 4/1307
Jan 6 09:50:31 node4 pmxcfs[1307]: [dcdb] notice: all data is up to date
Jan 6 09:51:00 node4 systemd[1]: Starting Proxmox VE replication runner...
Jan 6 09:51:01 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:02 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:03 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:04 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:05 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:06 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:07 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:08 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:09 node4 pvesr[2202]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:51:10 node4 pvesr[2202]: error with cfs lock 'file-replication_cfg': no quorum!
Jan 6 09:51:10 node4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan 6 09:51:10 node4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 6 09:51:10 node4 systemd[1]: pvesr.service: Unit entered failed state.
Jan 6 09:51:10 node4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan 6 09:52:00 node4 systemd[1]: Starting Proxmox VE replication runner...
Jan 6 09:52:01 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:02 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:03 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:04 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:05 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:06 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:07 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:08 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:09 node4 pvesr[2307]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 6 09:52:10 node4 pvesr[2307]: error with cfs lock 'file-replication_cfg': no quorum!
Jan 6 09:52:10 node4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan 6 09:52:10 node4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 6 09:52:10 node4 systemd[1]: pvesr.service: Unit entered failed state.
Jan 6 09:52:10 node4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
I've read something about issues with multicast in the forums already, but I believe this is something else, as the infrastructure hasn't changed and the rest of the cluster still working fine. I appreciate any help on this. Thank you.