Multihomed cluster (CMAN/corosync multicast configuration)

NStorm

Active Member
Dec 23, 2011
64
2
28
Russia, Rostov-na-Donu
Hello.

I've been running 2-node cluster. It was configured to work by default on vmbr0 interface, which had whole LAN subnet 192.168.9.0/24 behind. Lately I've added a new faster NIC attached to vmbr2 and re-configured cluster to work through it, with the dedicated VLAN 192.168.248.0/30 subnet. I've changed /etc/hosts on both nodes to adapt a new address on .248/30 subnets. Everything worked fine here, cluster switched working on vmbr2/248.0 subnet.
Nodes are called node1 and node2. The "know" each other as 248.1 and 248.2 via /etc/hosts. While DNS records on 9.0 network resolves them to their vmbr0/9.0 addresses (so web interface works from LAN, etc).
Now I want to add 3rd node to the cluster (named node-stor1), which is behind vmbr0/9.0 subnet. It can reach node1 and node2 by their name/ssh on 9.0 subnet. But cluster join fails on "Waiting for quorum" CMAN.
I've read corosync, CMAN and cluster.conf documentation, as well as these URLs: https://pve.proxmox.com/wiki/Multicast_notes and most important: https://fedorahosted.org/cluster/wiki/MultiHome

So the problem is CMAN registers multicast group on vmbr2/248.0 only, despite clusternode config lists a new node behind vmbr0/9.0 subnet. I've tried to enable musticast pings and check if it works. I've used ping, netstat -g and smcroute to check if I can have all 3 nodes in the same multicast group in vmbr0/9.0 subnet - everything works fine, every 3 nodes reply to a multicast ping. So no switch blocks my multicast traffic.
Now the last URL I've posted above gives a good example on how to setup a multihomed multicast CMAN cluster. I've followed it and added an altnames node1a and node2a to /etc/pve/cluster.conf. I've also added these IPs to /etc/hosts of node to resolve node1a and node2a to their 192.168.9.0/24 addresses. After restarting proxmox-cluster and CMAN on these nodes, they register themselves with 2 multicast addresses:
Code:
# pvecm status
Version: 6.2.0
Config Version: 14
Cluster Name: cluster
Cluster Id: 821
Cluster Member: Yes
Cluster Generation: 3256
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 5
Flags: 
Ports Bound: 0  
Node name: node1
Node ID: 2
Multicast addresses: 239.192.3.56 239.192.3.57 
Node addresses: 192.168.248.1 192.168.9.231

But the new node still can't "see" them. I guess that's because new node does not have an altname and registers only for 239.192.3.56 multicast address, while actually 2 other nodes sees this as second interface and registers with 239.192.3.57 multicast address.
Any suggestions to get that 3rd node to join cluster with the following multihomed scheme?

Code:
<NODE2> vmbr0/9 <---
vmbr2/248           \
      ^              \
      |                --> vmbr0/9 <NODE-STOR1>
      v              /
vmbr2/248           /
<NODE1> vmbr0/9 <---
 
Ok, after adding 3rd node an alias as node-stor1a and adding this to the altname of the clusterconf seems like the 3rd node are finally joined the cluster.
But now I'm getting a few similar messages in /var/log/cluster/corosync.log 2-3 per second:
Code:
May 08 15:38:22 corosync [TOTEM ] Automatically recovered ring 0