First time posting, I'm axle wrapped around this one.
I have a 15 node PVE cluster with CEPH. It has been running peachy since November. Today I went to add another node and it hung on waiting for quorum (I added at the command line). Eventually I had to kill the join. At this point all 15 original nodes were still members of the cluster and visible in the ring (pvecm status looked good). Note that nodes all have the same corosync.conf in /etc and /etc/pve/corosync.conf and are the same version.
I decided to restart corosync on the primary node (pve01) to see if that would change the deadlock. Instead node pve01 became isolated. Eventually every node was a singleton. After a significant amount of surgery, I was able to get 10 nodes back in the cluster. I still have 5 that do not join. Two of them will join a cluster together but not the main cluster. Here is my corosync.conf:
The cluster nodes keep reconfiguring and fencing for no reason. As I was typing this message pve01 dropped out of the cluster (but still thinks it is in it)
Each node is connected by 10G fiber to a dedicated switch. Some nodes have a bond interface with 2 10G nics bonded to the same switch. This all worked perfectly until the failed node add and now things won't stay in sync. Any help is appreciated.
I have a 15 node PVE cluster with CEPH. It has been running peachy since November. Today I went to add another node and it hung on waiting for quorum (I added at the command line). Eventually I had to kill the join. At this point all 15 original nodes were still members of the cluster and visible in the ring (pvecm status looked good). Note that nodes all have the same corosync.conf in /etc and /etc/pve/corosync.conf and are the same version.
I decided to restart corosync on the primary node (pve01) to see if that would change the deadlock. Instead node pve01 became isolated. Eventually every node was a singleton. After a significant amount of surgery, I was able to get 10 nodes back in the cluster. I still have 5 that do not join. Two of them will join a cluster together but not the main cluster. Here is my corosync.conf:
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: cassatpve01
nodeid: 16
quorum_votes: 1
ring0_addr: 10.4.11.35
}
node {
name: mathcspve01
nodeid: 13
quorum_votes: 1
ring0_addr: 10.4.11.142
}
node {
name: pve01
nodeid: 1
quorum_votes: 1
ring0_addr: 10.4.11.130
}
node {
name: pve02
nodeid: 2
quorum_votes: 1
ring0_addr: 10.4.11.131
}
node {
name: pve03
nodeid: 3
quorum_votes: 1
ring0_addr: 10.4.11.132
}
node {
name: pve04
nodeid: 4
quorum_votes: 1
ring0_addr: 10.4.11.133
}
node {
name: pve05
nodeid: 5
quorum_votes: 1
ring0_addr: 10.4.11.134
}
node {
name: pve06
nodeid: 6
quorum_votes: 1
ring0_addr: 10.4.11.135
}
node {
name: pve07
nodeid: 7
quorum_votes: 1
ring0_addr: 10.4.11.136
}
node {
name: pve08
nodeid: 8
quorum_votes: 1
ring0_addr: 10.4.11.137
}
node {
name: pve09
nodeid: 9
quorum_votes: 1
ring0_addr: 10.4.11.138
}
node {
name: pve10
nodeid: 10
quorum_votes: 1
ring0_addr: 10.4.11.139
}
node {
name: pve11
nodeid: 11
quorum_votes: 1
ring0_addr: 10.4.11.129
}
node {
name: pve12
nodeid: 12
quorum_votes: 1
ring0_addr: 10.4.11.6
}
node {
name: pve13
nodeid: 14
quorum_votes: 1
ring0_addr: 10.4.11.7
}
node {
name: pve14
nodeid: 15
quorum_votes: 1
ring0_addr: 10.4.11.5
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: Carleton
config_version: 23
interface {
linknumber: 0
knet_transport: sctp
}
ip_version: ipv4
secauth: on
version: 2
}
The cluster nodes keep reconfiguring and fencing for no reason. As I was typing this message pve01 dropped out of the cluster (but still thinks it is in it)
Each node is connected by 10G fiber to a dedicated switch. Some nodes have a bond interface with 2 10G nics bonded to the same switch. This all worked perfectly until the failed node add and now things won't stay in sync. Any help is appreciated.