Hi, tonight is the second time I face a huge problem when trying to add a node to my existing cluster.
My cluster contains approximately 15 nodes, I use CEPH as storage and everything is working pretty good.
All our nodes and "future" nodes FQDN are contained in /etc/hosts like :
/etc/network/interfaces
(over-provisioned yes)
But, sometimes (not for all adding process) and for no reason, when I try to add a node to this cluster with command
The two times this happened, I only find one-way solution to re-up all my nodes (without the failed node) :
- Shutting down all nodes
- Power-on node after node and check if they are reachable from corosync
Tonight, trouble where a few different : ping between nodes where available but they cannot be in the same cluster. After restarting a lot and a lot all our switches and routers, the "one-way fix" solved the trouble.
I'm confused about this problem because I will have to add a lot of nodes to this cluster and I can't permit this thing everytime, did someone have an idea of what can happen ?
My cluster contains approximately 15 nodes, I use CEPH as storage and everything is working pretty good.
All our nodes and "future" nodes FQDN are contained in /etc/hosts like :
- 100.118.100.1 pve1 pve1.beecluster.abeille.com
- 100.118.100.2 pve2 pve2.beecluster.abeille.com
...
- 100.118.100.254 pve254 pve254.beecluster.abeille.com
/etc/network/interfaces
#Hypervisor interface
auto eno1
iface eno1 inet dhcp
auto eno2
iface eno2 inet manual
#For VMs
auto vmbr0
inet vmbr0 inet manual
bridge_ports eno2
bridge_fd 0
bridge_stp off
#10Gbits CEPH
auto enp1s0f0
iface enp1s0f0 inet static
address 192.168.100.<Pve number>
netmask 255.255.255.0
broadcast 192.168.100.255
mtu 9000
(over-provisioned yes)
But, sometimes (not for all adding process) and for no reason, when I try to add a node to this cluster with command
pvecm add pve1
, I got an infinite load of "waiting for quorum" and everytime, all nodes contained in the cluster are unreachable or restarting.The two times this happened, I only find one-way solution to re-up all my nodes (without the failed node) :
- Shutting down all nodes
- Power-on node after node and check if they are reachable from corosync
Tonight, trouble where a few different : ping between nodes where available but they cannot be in the same cluster. After restarting a lot and a lot all our switches and routers, the "one-way fix" solved the trouble.
I'm confused about this problem because I will have to add a lot of nodes to this cluster and I can't permit this thing everytime, did someone have an idea of what can happen ?
Last edited: