Cluster keeps breaking

Jason Morris

Member
Jun 20, 2017
11
5
23
50
Hello All,

I had a two node cluster for the last month and everything seemed great. I was able to move VMs between nodes and things were fine. Friday I added another node and everything broke.

Now in cluster information is says Standalone node - no cluster defined. On each node I see itself fine but the other nodes show up with a red ex. I tried deleting all of the previous cluster information which looked like it worked at first and I was able to create a new cluster then when I added the nodes back I would get various error as to why it couldn't join.

Now they all show the cluster name next to the datacenter heading but listed as standalone.
 
Check the content of your /etc/hosts files. The following line is important:
Code:
192.168.10.3 node3.domain.com node3

where:
192.168.10.3 is the IP address in the cluster subnet (all nodes have to be in the same, of course)
node3 is the hostname

If this is correct and the cluster is still down check the service status
Code:
systemctl status corosync
respectively it's history.
 
I ended up completely rebuilding the entire cluster from scratch. I re-added the 3rd node and BOOM it all broke again. So I rebuilt it again and threw that box in the dumpster. Found a new third system and things look fine. The only weird thing is on both of the 3rd boxes, I cannot log on it with Firefox. I know I have some DNS weirdness on my network. Could that be the issue?
 
I re-added the 3rd node and BOOM it all broke again

How are you adding the 3rd node to the cluster? What are your package versions? (you can check with 'pveversion -v') Do all nodes have the same versions?

The only weird thing is on both of the 3rd boxes, I cannot log on it with Firefox. I know I have some DNS weirdness on my network. Could that be the issue?

Have you tried accessing directly with the IP addresses (https://ip:8006)? Does that make any difference?
 
How are you adding the 3rd node to the cluster? What are your package versions? (you can check with 'pveversion -v') Do all nodes have the same versions?

I'm adding the nodes through the GUI. Copying the cluster information and adding the node through the GUI. All nodes are the same version and here is the output of that command:

proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1


Have you tried accessing directly with the IP addresses (https://ip:8006)? Does that make any difference?

I only use the IP when accessing the node. I'll flush my FF cache out and see if it makes a difference.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!