Communications failure (in some directions?) in web console - Fixed

symcbean

Member
May 20, 2020
16
0
6
58
I have a 4 node cluster (planning to go to an odd number soon). From the web console
  • on node 1, I can see the status of all nodes/run a shell on all nodes,
  • but on node 2, 3, 4 I cannot connect to node 1: Connection Timed Out (595) / Communication Failure(0)
There does not appear to be any issues with connectivity using ping on the command line. All devices are plugged into the same switch. I can't see anything relevant reported in /var/log/*
corosync.conf is identical on each node (including version number). `pvecm status` reports 4 nodes / same output on all nodes. Corosync appears to be happy on all nodes:

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-06-08 13:09:02 BST; 3 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1042 (corosync)
Tasks: 9 (limit: 4915)
Memory: 147.0M
CGroup: /system.slice/corosync.service
└─1042 /usr/sbin/corosync -f

Jun 09 20:02:51 dev-i02-dg-virt corosync[1042]: [KNET ] rx: host: 3 link: 1 is up
Jun 09 20:02:51 dev-i02-dg-virt corosync[1042]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 09 20:02:51 dev-i02-dg-virt corosync[1042]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [TOTEM ] A new membership (1.2b) was formed. Members joined: 3
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [CPG ] downlist left_list: 0 received
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [QUORUM] Members[4]: 1 2 3 4
Jun 09 20:02:52 dev-i02-dg-virt corosync[1042]: [MAIN ] Completed service synchronization, ready to provide service.
 
Last edited:
After some more experimentation, `pvecm updatecerts` did not resolve the problem.

However, the first node was initiall configured on a different sub-net and moved into this network before the cluster was built. Despite `pvecm status` reporting the expected address:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.2.3.20
0x00000002 1 10.2.3.21 (local)
0x00000003 1 10.2.3.22
0x00000004 1 10.2.3.23

It appears that the other nodes are still trying to connect to the historic address:

ssh: connect to host 10.1.0.51 port 22: Connection timed out

Checking on the other hosts, this address appears in /etc/pve/.members (nodes 2,3,4 but no file on node 1) and /etc/pve/priv/known_hosts (I only searched in /etc).

How do I resolve this?
 
The /etc/hosts file on node 1 still had an entry linking the host name to the original address. After manually correcting this, then running `systemctl restart pve-cluster` (which took a nail-biting long time to return) the file /etc/pve/.members was showing the expected addresses and all the nodes were talking to all the nodes.
:D