Hi there,
I'm currently trying to setup a 3-host cluster in a datacenter where multicasting isn't possible (Hetzner) so after I read about UDPU in https://pve.proxmox.com/wiki/Multicast_notes and some tests locally I was fairly positive that this is going to work.
However I only was possible to add one host (no problems at all seemingly) and am now stuck on the third/last. Unfortunately I cannot try to reinstall over and over as I need to order a KVM console each time... So maybe before I do this the next time I thought maybe someone could help me figuring out what's going on as I am clueless at this point...
So here is what I did (and I'm not sure if I got the order right as it wasn't specified to detailed in the link above):
So, any ideas how I can fix host3 and add it or at least how to avoid the same problem when reinstalling host3 another time? It would be great if that would work as Proxmox 5 with ZFS migration seems to be a perfect match for such 'lower budget clusters' and has a very well fitting subscription price for this project size compared other solutions (which are more expensive and besides Hyper-V I know of none other which would come with this kind of VM/snapshot replication) ;-)
Thanks for help!
I'm currently trying to setup a 3-host cluster in a datacenter where multicasting isn't possible (Hetzner) so after I read about UDPU in https://pve.proxmox.com/wiki/Multicast_notes and some tests locally I was fairly positive that this is going to work.
However I only was possible to add one host (no problems at all seemingly) and am now stuck on the third/last. Unfortunately I cannot try to reinstall over and over as I need to order a KVM console each time... So maybe before I do this the next time I thought maybe someone could help me figuring out what's going on as I am clueless at this point...
So here is what I did (and I'm not sure if I got the order right as it wasn't specified to detailed in the link above):
- Added the IPs with all host names in every /etc/hosts file
- Checked connectivity between each host via hostname -> no problem, ping stable between 0.4 and 0.6 ms which should work well from what I have read about corosync
- `pvecm create MyCluster` on host1
- edited /etc/pve/corosync.conf on host1 and added `transport: udpu` as well as the intended nodelist where name and ring0_addr of each host is the configured hostname (so 3 nodes from host1 to host3 with nodeid: 1 to nodeid: 3 in total)
- restarted corosync on host1
- reclaimed quorum on host1 with `pvecm e 1` (otherwise I think I couldn't run the add command on the other hosts)
- `pvecm add [HOST1-IP]` on host2 (worked, no errors)
- `pvecm add [HOST1-IP]` on host3 (did not work)
- inital error some `trying to aquire cfs lock 'file-corosync_conf' ...` but that were only 2 or 3 and then an endless stuck on `waiting for quorum...`
- restarted several times and tried adding with '-force' did not help
- /etc/corosync/corosync.conf on host3 seems to have the correct copy of the version with `transport: udpu`
- trying to re-add I now get the following (and I'm not sure if the ssh key database error was there the first time too but I now always see it - however I can ssh host1 and ssh host2):
Code:can't create shared ssh key database '/etc/pve/priv/authorized_keys' trying to aquire cfs lock 'file-corosync_conf' ... trying to aquire cfs lock 'file-corosync_conf' ... node host3 already defined copy corosync auth key stopping pve-cluster service backup old database Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details. waiting for quorum...
- systemctl status corosync.service says
Code:Jan 27 14:48:49 host3 corosync[6523]: info [WD ] no resources configured. Jan 27 14:48:49 host3 corosync[6523]: notice [SERV ] Service engine loaded: corosync watchdog service [7] Jan 27 14:48:49 host3 corosync[6523]: notice [QUORUM] Using quorum provider corosync_votequorum Jan 27 14:48:49 host3 corosync[6523]: crit [QUORUM] Quorum provider: corosync_votequorum failed to initialize. Jan 27 14:48:49 host3 corosync[6523]: error [SERV ] Service engine 'corosync_quorum' failed to load for reason 'configuration error: nodelist or quorum.expected_votes must be configured!' Jan 27 14:48:49 host3 corosync[6523]: error [MAIN ] Corosync Cluster Engine exiting with status 20 at service.c:356. Jan 27 14:48:49 host3 systemd[1]: corosync.service: Main process exited, code=exited, status=20/n/a Jan 27 14:48:49 host3 systemd[1]: Failed to start Corosync Cluster Engine. Jan 27 14:48:49 host3 systemd[1]: corosync.service: Unit entered failed state. Jan 27 14:48:49 host3 systemd[1]: corosync.service: Failed with result 'exit-code'.
- `pvecm status` on host3 says `Cannot initialize CMAP service`
- `pvecm status` on host1 or host2 says
Code:Quorum information ------------------ Date: Sat Jan 27 14:59:27 2018 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 0x00000001 Ring ID: 1/96 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 2 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000001 1 [HOST1-IP] (local) 0x00000002 1 [HOST2-IP]
So, any ideas how I can fix host3 and add it or at least how to avoid the same problem when reinstalling host3 another time? It would be great if that would work as Proxmox 5 with ZFS migration seems to be a perfect match for such 'lower budget clusters' and has a very well fitting subscription price for this project size compared other solutions (which are more expensive and besides Hyper-V I know of none other which would come with this kind of VM/snapshot replication) ;-)
Thanks for help!