[SOLVED] Adding third host to unicast 3-host cluster fails

Asano · Jan 27, 2018

Hi there,

I'm currently trying to setup a 3-host cluster in a datacenter where multicasting isn't possible (Hetzner) so after I read about UDPU in https://pve.proxmox.com/wiki/Multicast_notes and some tests locally I was fairly positive that this is going to work.
However I only was possible to add one host (no problems at all seemingly) and am now stuck on the third/last. Unfortunately I cannot try to reinstall over and over as I need to order a KVM console each time... So maybe before I do this the next time I thought maybe someone could help me figuring out what's going on as I am clueless at this point...

So here is what I did (and I'm not sure if I got the order right as it wasn't specified to detailed in the link above):

Added the IPs with all host names in every /etc/hosts file
Checked connectivity between each host via hostname -> no problem, ping stable between 0.4 and 0.6 ms which should work well from what I have read about corosync
`pvecm create MyCluster` on host1
edited /etc/pve/corosync.conf on host1 and added `transport: udpu` as well as the intended nodelist where name and ring0_addr of each host is the configured hostname (so 3 nodes from host1 to host3 with nodeid: 1 to nodeid: 3 in total)
restarted corosync on host1
reclaimed quorum on host1 with `pvecm e 1` (otherwise I think I couldn't run the add command on the other hosts)
`pvecm add [HOST1-IP]` on host2 (worked, no errors)
`pvecm add [HOST1-IP]` on host3 (did not work)
- inital error some `trying to aquire cfs lock 'file-corosync_conf' ...` but that were only 2 or 3 and then an endless stuck on `waiting for quorum...`
- restarted several times and tried adding with '-force' did not help
- /etc/corosync/corosync.conf on host3 seems to have the correct copy of the version with `transport: udpu`
- trying to re-add I now get the following (and I'm not sure if the ssh key database error was there the first time too but I now always see it - however I can ssh host1 and ssh host2):
  Code:
```
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
trying to aquire cfs lock 'file-corosync_conf' ...
trying to aquire cfs lock 'file-corosync_conf' ...
node host3 already defined
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.
waiting for quorum...
```

systemctl status corosync.service says

Code:

Jan 27 14:48:49 host3 corosync[6523]: info    [WD    ] no resources configured.
Jan 27 14:48:49 host3 corosync[6523]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jan 27 14:48:49 host3 corosync[6523]: notice  [QUORUM] Using quorum provider corosync_votequorum
Jan 27 14:48:49 host3 corosync[6523]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
Jan 27 14:48:49 host3 corosync[6523]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
Jan 27 14:48:49 host3 corosync[6523]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
Jan 27 14:48:49 host3 systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
Jan 27 14:48:49 host3 systemd[1]: Failed to start Corosync Cluster Engine.
Jan 27 14:48:49 host3 systemd[1]: corosync.service: Unit entered failed state.
Jan 27 14:48:49 host3 systemd[1]: corosync.service: Failed with result 'exit-code'.

`pvecm status` on host3 says `Cannot initialize CMAP service`

`pvecm status` on host1 or host2 says

Code:

Quorum information
------------------
Date:             Sat Jan 27 14:59:27 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/96
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 [HOST1-IP] (local)
0x00000002          1 [HOST2-IP]

So, any ideas how I can fix host3 and add it or at least how to avoid the same problem when reinstalling host3 another time? It would be great if that would work as Proxmox 5 with ZFS migration seems to be a perfect match for such 'lower budget clusters' and has a very well fitting subscription price for this project size compared other solutions (which are more expensive and besides Hyper-V I know of none other which would come with this kind of VM/snapshot replication) ;-)

Thanks for help!

Alwin · Jan 29, 2018

Asano said:
Jan 27 14:48:49 host3 corosync[6523]: error [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'

Looks like something is missing in the corosync.conf. Don't forget to increment the serial if update the config.

In general, I wouldn't recommend doing a setup like this when you don't own the network equipment, as unicast imposes quite some limitation on the number of nodes (network traffic).

guletz · Jan 29, 2018

Hi @Asano,

Maybe this will help you:

https://myatus.com/p/poor-mans-proxmox-cluster/

Good luck!

Asano · Jan 29, 2018

@Alwin that lead to the solution, only it had 'too much' ;-) Some more googling in that direction revealed that this error can be misleading as it also happens when you try to connect a server from a subnet that differs from those of the interface stanza `bindnetaddr`. And this was the case for the third server. On corosync mailinglist one suggestion is to just remove the whole interface stanza (for example here: lists.corosync.org/pipermail/discuss/2015-January/003426.html) which also worked in my case.

@guletz that link looks interesting as well as now that my cluster basis is set up I'm still struggling with a GRE tunnel and good performance/optimized MTU/MSS for a private spanned network for the VMs. Maybe Tinc would be a better approach here than GRE. I'll look into it. However for the basic/host cluster network I just used a very simple IPSec configuration which I think gets the the job done well.

Regarding the unicast limitations from what I read this should be okay. This is a smaller project and will probably never go beyond 3 hosts maybe in 2 or 3 years 4 if it is still running by then... Since Hyper-V came out I've done similar projects with it as it also offers okayish replication for those projects budget class but I never liked it very much and as for really neat setups you normally pay as much if not more for the storage/fiber setup alone as you pay here for everything this is not really a possibility in this budget. So I'll look further into Proxmox and am curious how it will do here.

guletz · Feb 2, 2018

It would be nice to tell us how you soved your problem.

Thx.

Asano · Feb 2, 2018

I thought I did ;-)

tl;dr removed the whole `interface` stanza in the corosync.conf

Edit: Oh and regarding GRE tunnels vs tinc also this is a little ot: I dropped the whole idea of having all VMs in a single subnet. Instead I gave every host a pfSense VM with a unique subnet for the hosts VMs and IPSec tunnels to the pfSenses on the other two hosts. This works well (+ ~0.1-0.2ms latency overhead for VM intern host-to-host connections and ~10% reduced throughput but in return a super simple setup and no headache for me with MTU/MSS mismatches and (R)STP oddities).

Search

Search

[SOLVED] Adding third host to unicast 3-host cluster fails

Asano

Well-Known Member

Alwin

Proxmox Retired Staff

guletz

Distinguished Member

Asano

Well-Known Member

guletz

Distinguished Member

Asano

Well-Known Member

We value your privacy