Communications failure (in some directions?) in web console - Fixed

symcbean

Member
May 20, 2020
18
0
6
57
I have a 4 node cluster (planning to go to an odd number soon). From the web console
  • on node 1, I can see the status of all nodes/run a shell on all nodes,
  • but on node 2, 3, 4 I cannot connect to node 1: Connection Timed Out (595) / Communication Failure(0)
There does not appear to be any issues with connectivity using ping on the command line. All devices are plugged into the same switch. I can't see anything relevant reported in /var/log/*
corosync.conf is identical on each node (including version number). `pvecm status` reports 4 nodes / same output on all nodes. Corosync appears to be happy on all nodes:

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-06-08 13:09:02 BST; 3 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1042 (corosync)
Tasks: 9 (limit: 4915)
Memory: 147.0M
CGroup: /system.slice/corosync.service
└─1042 /usr/sbin/corosync -f

Jun 09 20:02:51 dev-i02-dg-virt corosync[1042]: [KNET ] rx: host: 3 link: 1 is up
Jun 09 20:02:51 dev-i02-dg-virt corosync[1042]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 09 20:02:51 dev-i02-dg-virt corosync[1042]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [TOTEM ] A new membership (1.2b) was formed. Members joined: 3
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [QUORUM] Members[4]: 1 2 3 4
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [MAIN ] Completed service synchronization, ready to provide service.
 
Last edited:
After some more experimentation, `pvecm updatecerts` did not resolve the problem.

However, the first node was initiall configured on a different sub-net and moved into this network before the cluster was built. Despite `pvecm status` reporting the expected address:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.2.3.20
0x00000002 1 10.2.3.21 (local)
0x00000003 1 10.2.3.22
0x00000004 1 10.2.3.23

It appears that the other nodes are still trying to connect to the historic address:

ssh: connect to host 10.1.0.51 port 22: Connection timed out

Checking on the other hosts, this address appears in /etc/pve/.members (nodes 2,3,4 but no file on node 1) and /etc/pve/priv/known_hosts (I only searched in /etc).

How do I resolve this?
 
The /etc/hosts file on node 1 still had an entry linking the host name to the original address. After manually correcting this, then running `systemctl restart pve-cluster` (which took a nail-biting long time to return) the file /etc/pve/.members was showing the expected addresses and all the nodes were talking to all the nodes.
:D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!