[SOLVED] How to configure a simple cluster of 3 nodes on unicast

fibo_fr

Member
Apr 21, 2014
40
2
8
Remoulins, France
Hi,
I'm getting mad at trying to configure this.
After my last resinstall of the 3 nodes I did the following:
(sorry, for legacy reasons my nodes are called mynode0, mynode3 and mynode4)
- each machine has 2 eth network connection, a public one and a "private" one, which is in fact on a dedicated network of my hosting provider
- each machine works fine, in cli/ssh as well as in gui/web, all are updated to the last versions
- the /etc/hosts on each machine starts with its public IP and corresponding name, and the 3 lines defining the IPs of the nodes, eg
127.0.0.1 localhost
xx.xxx.xxx.xxx myserver.com myserver
yy.71.84.10 mynode0
yy.71.80.16 mynode3
yy.71.83.18 mynode4
(ip obfuscated of course)
- none hosts any container or vm yet, and will not until the cluster is up and running
- I have copied the ssh keys etc: from mynode0, ssh root@node3 and ssh@node4 connect passwordless (and similar from the 2 others)


I've started pvecm on mynode3
- then edited /etc/corosync/corosync.conf, to activate the two_node mode (in the hope to avoid quorum issues), and other options found in corosync.conf manpage, and the unicast "udpu" as found in the totem manpage:
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: mynode3
nodeid: 1
quorum_votes: 1
ring0_addr: mynode3
}

}

quorum {
auto_tie_breaker: 1
last_man_standing: 1
last_man_standing_window: 10000
provider: corosync_votequorum
two_node: 1
}

totem {
cluster_name: mycluster
config_version: 2
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {
bindnetaddr: yy.71.80.16
ringnumber: 0
}

}

Then reboot the machine to be sure, and once it is on (check pvecm status: ok) use a different ssh window to connect to mynode4 and type "pvecm add mynode3"
It fails...
OK, let's try something else: ssh to mynode3, type "pvecm addnode mynode0"
Fails too.

1 - Any hint at where I went wrong? or is it related to my config and how can I work around it?

2 - Is there some way to resolve the situation without reinstalling (and reinstall fail2ban, postfix, etx)

3 - If no other way than reinstall, what should I do differently?
 
I must confess I find quite frustrating that neither on this forum nor on the wiki I can find step by step scenario for newbies starting from scratch.
So many things can and do go wrong that testing random tricks explained here and there simply dors not work:they certainly need to be done in a defined order... but I could not find where.

Any plan from anyone to write that?
 
In your situation I would:

1. Remove custom corosync config
2. Remove SSH keys (Proxmox VE will transfer them for you when cluster is created)
3. When all nodes are "clean" again, on one node (let's say NODE1) run:

# pvecm create CLUSTERNAME

4. On the other node (let's say NODE2):

# pvecm add IP-ADDRESS-NODE1

5. What does "pvecm status" show now?

If this is working you can change the corosync config step-by-step to see where it goes wrong (don't forget to increase config_version).
If this is not working, start over again and only add udp transport to corosync config.

If you still have problems, post some output and logfiles. Also test your multicast/unicast traffic with omping.
 
  • Like
Reactions: fibo_fr
Thx for these suggestions.
In my attempts to succeed without re-installing, I have followed a different route which seems to work, although yet not totally.
The main point was around the bindnet address which seems to be implicitly defined as /24. I found in https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues the suggestion to remove it completely, which I did (suppressed the ringnumber as well, which maybe was not a good idea).

At the same time, Proxmox messages told me I could not have at the same time auto_tie_breaker: 1 and last_man_standing: 1, and suggested 0 for this last one.
The resulting file is below.
However some things are not fully operational, I have to investigate more to identify and solve them.
(FWIW: I had to fight to bring them operational, forcing manually corosync.conf on the other nodes. Key weapons here are corosync and pmxcfs -l)

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: mynode0
nodeid: 3
quorum_votes: 1
ring0_addr: mynode0
}

node {
name: mynode3
nodeid: 1
quorum_votes: 1
ring0_addr: mynode3
}

node {
name: mynode4
nodeid: 2
quorum_votes: 1
ring0_addr: mynode4
}

}

quorum {
two_node: 1
auto_tie_breaker: 1
last_man_standing: 0
last_man_standing_window: 10000
provider: corosync_votequorum
wait_for_all: 0
}

totem {
cluster_name: mycluster
config_version: 9
ip_version: ipv4
secauth: on
transport: udpu
version: 2
interface {

}

}
 
Last edited:
Once this is done, need to restart at minimum:
service pve-cluster restart
(if any problem here, restart corosync and check that /etc/pve/corosync.conf is the expected version)
service pvedaemon restart
service pve-proxy restart

... on the 3 nodes if gui admin still does not work
 
  • Like
Reactions: chrone

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!