[SOLVED] Adding third host to unicast 3-host cluster fails

Asano

Well-Known Member
Jan 27, 2018
57
10
48
44
Hi there,

I'm currently trying to setup a 3-host cluster in a datacenter where multicasting isn't possible (Hetzner) so after I read about UDPU in https://pve.proxmox.com/wiki/Multicast_notes and some tests locally I was fairly positive that this is going to work.
However I only was possible to add one host (no problems at all seemingly) and am now stuck on the third/last. Unfortunately I cannot try to reinstall over and over as I need to order a KVM console each time... So maybe before I do this the next time I thought maybe someone could help me figuring out what's going on as I am clueless at this point...

So here is what I did (and I'm not sure if I got the order right as it wasn't specified to detailed in the link above):
  • Added the IPs with all host names in every /etc/hosts file
  • Checked connectivity between each host via hostname -> no problem, ping stable between 0.4 and 0.6 ms which should work well from what I have read about corosync
  • `pvecm create MyCluster` on host1
  • edited /etc/pve/corosync.conf on host1 and added `transport: udpu` as well as the intended nodelist where name and ring0_addr of each host is the configured hostname (so 3 nodes from host1 to host3 with nodeid: 1 to nodeid: 3 in total)
  • restarted corosync on host1
  • reclaimed quorum on host1 with `pvecm e 1` (otherwise I think I couldn't run the add command on the other hosts)
  • `pvecm add [HOST1-IP]` on host2 (worked, no errors)
  • `pvecm add [HOST1-IP]` on host3 (did not work)
    • inital error some `trying to aquire cfs lock 'file-corosync_conf' ...` but that were only 2 or 3 and then an endless stuck on `waiting for quorum...`
    • restarted several times and tried adding with '-force' did not help
    • /etc/corosync/corosync.conf on host3 seems to have the correct copy of the version with `transport: udpu`
    • trying to re-add I now get the following (and I'm not sure if the ssh key database error was there the first time too but I now always see it - however I can ssh host1 and ssh host2):
      Code:
      can't create shared ssh key database '/etc/pve/priv/authorized_keys'
      trying to aquire cfs lock 'file-corosync_conf' ...
      trying to aquire cfs lock 'file-corosync_conf' ...
      node host3 already defined
      copy corosync auth key
      stopping pve-cluster service
      backup old database
      Job for corosync.service failed because the control process exited with error code.
      See "systemctl status corosync.service" and "journalctl -xe" for details.
      waiting for quorum...
  • systemctl status corosync.service says
    Code:
    Jan 27 14:48:49 host3 corosync[6523]: info    [WD    ] no resources configured.
    Jan 27 14:48:49 host3 corosync[6523]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
    Jan 27 14:48:49 host3 corosync[6523]: notice  [QUORUM] Using quorum provider corosync_votequorum
    Jan 27 14:48:49 host3 corosync[6523]: crit    [QUORUM] Quorum provider: corosync_votequorum failed to initialize.
    Jan 27 14:48:49 host3 corosync[6523]: error   [SERV  ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
    Jan 27 14:48:49 host3 corosync[6523]: error   [MAIN  ] Corosync Cluster Engine exiting with status 20 at service.c:356.
    Jan 27 14:48:49 host3 systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a
    Jan 27 14:48:49 host3 systemd[1]: Failed to start Corosync Cluster Engine.
    Jan 27 14:48:49 host3 systemd[1]: corosync.service: Unit entered failed state.
    Jan 27 14:48:49 host3 systemd[1]: corosync.service: Failed with result 'exit-code'.
  • `pvecm status` on host3 says `Cannot initialize CMAP service`
  • `pvecm status` on host1 or host2 says
    Code:
    Quorum information
    ------------------
    Date:             Sat Jan 27 14:59:27 2018
    Quorum provider:  corosync_votequorum
    Nodes:            2
    Node ID:          0x00000001
    Ring ID:          1/96
    Quorate:          Yes
    
    Votequorum information
    ----------------------
    Expected votes:   3
    Highest expected: 3
    Total votes:      2
    Quorum:           2
    Flags:            Quorate
    
    Membership information
    ----------------------
        Nodeid      Votes Name
    0x00000001          1 [HOST1-IP] (local)
    0x00000002          1 [HOST2-IP]

So, any ideas how I can fix host3 and add it or at least how to avoid the same problem when reinstalling host3 another time? It would be great if that would work as Proxmox 5 with ZFS migration seems to be a perfect match for such 'lower budget clusters' and has a very well fitting subscription price for this project size compared other solutions (which are more expensive and besides Hyper-V I know of none other which would come with this kind of VM/snapshot replication) ;-)

Thanks for help!
 
Jan 27 14:48:49 host3 corosync[6523]: error [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!'
Looks like something is missing in the corosync.conf. Don't forget to increment the serial if update the config.

In general, I wouldn't recommend doing a setup like this when you don't own the network equipment, as unicast imposes quite some limitation on the number of nodes (network traffic).
 
@Alwin that lead to the solution, only it had 'too much' ;-) Some more googling in that direction revealed that this error can be misleading as it also happens when you try to connect a server from a subnet that differs from those of the interface stanza `bindnetaddr`. And this was the case for the third server. On corosync mailinglist one suggestion is to just remove the whole interface stanza (for example here: lists.corosync.org/pipermail/discuss/2015-January/003426.html) which also worked in my case.

@guletz that link looks interesting as well as now that my cluster basis is set up I'm still struggling with a GRE tunnel and good performance/optimized MTU/MSS for a private spanned network for the VMs. Maybe Tinc would be a better approach here than GRE. I'll look into it. However for the basic/host cluster network I just used a very simple IPSec configuration which I think gets the the job done well.

Regarding the unicast limitations from what I read this should be okay. This is a smaller project and will probably never go beyond 3 hosts maybe in 2 or 3 years 4 if it is still running by then... Since Hyper-V came out I've done similar projects with it as it also offers okayish replication for those projects budget class but I never liked it very much and as for really neat setups you normally pay as much if not more for the storage/fiber setup alone as you pay here for everything this is not really a possibility in this budget. So I'll look further into Proxmox and am curious how it will do here.
 
I thought I did ;-)

tl;dr removed the whole `interface` stanza in the corosync.conf

Edit: Oh and regarding GRE tunnels vs tinc also this is a little ot: I dropped the whole idea of having all VMs in a single subnet. Instead I gave every host a pfSense VM with a unique subnet for the hosts VMs and IPSec tunnels to the pfSenses on the other two hosts. This works well (+ ~0.1-0.2ms latency overhead for VM intern host-to-host connections and ~10% reduced throughput but in return a super simple setup and no headache for me with MTU/MSS mismatches and (R)STP oddities).
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!