[SOLVED] Creating first cluster and have wrong IP

rml

Member
Apr 24, 2019
31
0
11
45
I've made a mess and need some help please as not quite sure what I've done wrong and can see it getting worse at this stage, probably several things but I'm learning

When I come to add the 2nd node to the cluster I can see it's using the peer address of 172.31 when I was hoping it would use the 192 direct connection as understood it was better to have corosync communicating over a separate connection

To explain:

2 instances of proxmox 5.4.3.
  • proxmox.rml - hostname=PVE 172.31.187.51 (also .52) 192.168.0.1
  • proxplay.rml - hostname=proxplay 172.31.187.56 192.168.0.1

Note: Hostname on proxmox.rml is still PVE as I wasn't sure I could change it after I'd added VMs- seen different answers on the forum/ help and didn't want to risk it. Long term plan was to cluster them, move over the VMs to proxrml, demote proxmox and rebuild it correctly.
They both get their IPs to the main network 172. from DHCP (I appreciate that's a bit strange for a server - is it enough to break it?)
192.168 addresses are a direct 1Gb connection cable (no switch) between them - hence the /etc/hosts .cluster entries. Probably wrong?

I've just created a cluster on proxmox but when I look at the Join information it's showing the 172. address. I put proxrml.cluster.rml in the Ring 0. I'm assuming that was wrong


See "Cannot use default address safely in

https://imgur.com/a/sRqkir7

Questions:
1) Can I safely remove this "cluster" without affecting the virtual machines as it's the only one joined to it. So that I can redo it properly, though I'm not sure what I've done wrong
2) Should I have edited the hostname beforehand or, as it's still unique does it not matter? Could I have done that ok with virtual machines/ containers on there?
3) What should I have put in Ring 0 when creating the cluster?
4) I'm just backing up machines from proxmox now. They were created before I made this cluster mess but will that affect them in any way if I need to restore to a fresh install.

Please let me know if I can post more information, I'm still googling so will add what I work out but it's a bit above me at present.

Huge thanks in advance for any help and your patience

proxmox.rml pvecm status output
root@pve:/etc/pve/nodes# pvecm status
Quorum information
------------------
Date: Thu May 2 13:51:08 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/4
Quorate: Yes

Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.10.1 (local)
root@pve:/etc/pve/nodes# cat /etc/pve/.members
{
"nodename": "pve",
"version": 3,
"cluster": { "name": "rmlcluster", "version": 1, "nodes": 1, "quorate": 1 },
"nodelist": {
"pve": { "id": 1, "online": 1, "ip": "172.31.187.51"}
}
}

proxmox.rml ip a
root@pve:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:15:17:a6:cc:0a brd ff:ff:ff:ff:ff:ff
inet 192.168.10.1/24 brd 192.168.10.255 scope global enp4s0f0
valid_lft forever preferred_lft forever
inet6 fe80::215:17ff:fea6:cc0a/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
link/ether 00:25:90:69:e3:e4 brd ff:ff:ff:ff:ff:ff
4: enp4s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:15:17:a6:cc:0b brd ff:ff:ff:ff:ff:ff
inet 172.31.186.107/23 brd 172.31.187.255 scope global enp4s0f1
valid_lft forever preferred_lft forever
inet6 fe80::215:17ff:fea6:cc0b/64 scope link
valid_lft forever preferred_lft forever
5: enp1s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:25:90:69:e3:e5 brd ff:ff:ff:ff:ff:ff
inet 172.31.187.52/23 brd 172.31.187.255 scope global enp1s0f1
valid_lft forever preferred_lft forever
inet6 fe80::225:90ff:fe69:e3e5/64 scope link
valid_lft forever preferred_lft forever
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:25:90:69:e3:e4 brd ff:ff:ff:ff:ff:ff
inet 172.31.187.51/23 brd 172.31.187.255 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::225:90ff:fe69:e3e4/64 scope link
valid_lft forever preferred_lft forever
7: tap100i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN group default qlen 1000
link/ether 96:f7:cc:92:c6:71 brd ff:ff:ff:ff:ff:ff
8: tap102i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN group default qlen 1000
link/ether c6:29:2b:38:dc:13 brd ff:ff:ff:ff:ff:ff
10: tap101i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN group default qlen 1000
link/ether 26:c6:5c:13:fa:b8 brd ff:ff:ff:ff:ff:ff
14: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN group default qlen 1000
link/ether 7a:96:23:55:d1:62 brd ff:ff:ff:ff:ff:ff
21: veth106i0@if20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master fwbr106i0 state UP group default qlen 1000
link/ether fe:d9:e0:49:28:9d brd ff:ff:ff:ff:ff:ff link-netnsid 0
22: fwbr106i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3a:a9:52:ce:7d:95 brd ff:ff:ff:ff:ff:ff
23: fwpr106p0@fwln106i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000
link/ether ee:7b:6b:0f:46:d5 brd ff:ff:ff:ff:ff:ff
24: fwln106i0@fwpr106p0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master fwbr106i0 state UP group default qlen 1000
link/ether 3a:a9:52:ce:7d:95 brd ff:ff:ff:ff:ff:ff


proxmox.rml /etc/hosts
root@pve:/etc/pve/nodes# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
172.31.187.51 proxrml.rml proxrml pve

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# corosync network hosts
192.168.10.1 proxrml.cluster.rml
192.168.10.2 proxplay.cluster.rml

proxmox.rml /etc/network/interfaces
root@proxplay:~# cat /etc/hosts
root@pve:/etc/pve/nodes# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface enp1s0f0 inet manual

auto vmbr0
iface vmbr0 inet dhcp
bridge_ports enp1s0f0
bridge_stp off
bridge_fd 0

auto enp1s0f1
iface enp1s0f1 inet dhcp

auto enp3s0f0
iface enp3s0f0 inet dhcp

auto enp3s0f1
iface enp3s0f1 inet dhcp

auto enp4s0f0
iface enp4s0f0 inet static
address 192.168.10.1
netmask 255.255.255.0

auto enp4s0f1
iface enp4s0f1 inet dhcp

proxmox.rml ip a
root@proxplay:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
link/ether d4:ae:52:d3:5e:95 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether d4:ae:52:d3:5e:96 brd ff:ff:ff:ff:ff:ff
inet 192.168.10.2/24 brd 192.168.10.255 scope global eno2
valid_lft forever preferred_lft forever
inet6 fe80::d6ae:52ff:fed3:5e96/64 scope link
valid_lft forever preferred_lft forever
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether d4:ae:52:d3:5e:95 brd ff:ff:ff:ff:ff:ff
inet 172.31.187.56/23 brd 172.31.187.255 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::d6ae:52ff:fed3:5e95/64 scope link
valid_lft forever preferred_lft forever

proxplay.rml /etc/hosts
root@proxplay:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
172.31.187.56 proxplay.rml proxplay

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# corosync network hosts
192.168.10.1 proxrml.cluster.rml
192.168.10.2 proxplay.cluster.rml

proxplay.rml /etc/network/interfaces
root@proxplay:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eno1 inet manual

auto vmbr0
iface vmbr0 inet dhcp
bridge_ports eno1
bridge_stp off
bridge_fd 0

auto eno2
iface eno2 inet static
address 192.168.10.2
netmask 255.255.255.0

EDIT: I can post the imgur photo links properly
 
Last edited:
1) Can I safely remove this "cluster" without affecting the virtual machines as it's the only one joined to it. So that I can redo it properly, though I'm not sure what I've done wrong
Migrate all your VMs to one node and remove the other node. You can follow the instructions on https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
Also check on the same page 6.5.1. Separate A Node Without Reinstalling

2) Should I have edited the hostname beforehand or, as it's still unique does it not matter? Could I have done that ok with virtual machines/ containers on there?
You can still technically rename the node after adding it to a cluster, but you will have to edit corosync.conf as well:
If your node is in a cluster, where it is not recommended to change its name, adapt /etc/pve/corosync.conf so that the nodes name is changed also there. See https://pve.proxmox.com/pve-docs/chapter-pvecm.html#edit-corosync-conf

3) What should I have put in Ring 0 when creating the cluster?
The network you want the cluster nodes to communicate on.

4) I'm just backing up machines from proxmox now. They were created before I made this cluster mess but will that affect them in any way if I need to restore to a fresh install.
Should you decide to go for a fresh install, you can always restore your guests from the backup files on GUI or CLI (with `qm restore` or `pct restore`)
 
Aha, thanks again

That page looks very useful and I'll sit down and read it all. Potentially "Separate After Cluster Creation"
I may actually find it easier through the CLI rather than the GUI actually but that's what I tried.

Can I just check a couple of things whilst I have you there please to be sure I understand as I'm reading that page.
Migrate all your VMs to one node
You're clear that my current position is that I only have 1 node in the original cluster (which has the VMs on it and the wrong hostname)- I haven't added to the 2nd node yet (that node has no VMs or containers) ?

So should I migrate by adding it to the cluster and doing it that way, even with the wrong IP, or through vzdumps?

and remove the other node
All the pages I've found involve removing a node from a cluster. Have you seen something for removing the last node from a cluster? Or the cluster information from the last node probably makes more sense, as I think that's what i need to do here and then recreate the cluster properly (?)

The network you want the cluster nodes to communicate on.

So in this stage I thought I was being clever by adding in proxrml.cluster.rml into the Ring0 box which should have resolved to itself on the 192. direct connection
https://imgur.com/a/ZGZh3Ba
What should I have entered, should it be the IP of the server creating the cluster, or is a resolvable hostname ok?? I'm not sure what I did wrong in this step to get the join information showing the 172. address rather than the 192 one


Should you decide to go for a fresh install, you can always restore your guests from the backup files on GUI or CLI (with `qm restore` or `pct restore`)
Don't worry if not but are you aware if this will trip Win10 activation?

I'll probably forge ahead if the cluster will work on the 172 address anyway - the traffic isn't large so perhaps not having the direct connection is ok for the moment .


I meant to keep this all clean to play with but inevitably I got a bit excited and need to keep some of the VM safe. Never learn
 
Hi.

You're clear that my current position is that I only have 1 node in the original cluster (which has the VMs on it and the wrong hostname)- I haven't added to the 2nd node yet (that node has no VMs or containers) ?

So should I migrate by adding it to the cluster and doing it that way, even with the wrong IP, or through vzdumps?

Looks like I misunderstood your situation. I thought you had a cluster with 2 nodes and wanted to build it from scratch because it was acting funky. You will not need to migrate any nodes. I suggest you build your node/cluster from scratch.

If you would like to build the node from scratch, take backups of all your guests and copy them over somewhere else. Then do a fresh install on the node and restore your guests like described before.

All the pages I've found involve removing a node from a cluster. Have you seen something for removing the last node from a cluster? Or the cluster information from the last node probably makes more sense, as I think that's what i need to do here and then recreate the cluster properly (?)

Doing a fresh install is the easiest method for removing the last node from a cluster, but you could possibly repeat the steps from "Seperate A Node Without Reinstalling" on it and change it back to standalone mode.

So in this stage I thought I was being clever by adding in proxrml.cluster.rml into the Ring0 box which should have resolved to itself on the 192. direct connection
https://imgur.com/a/ZGZh3Ba
What should I have entered, should it be the IP of the server creating the cluster, or is a resolvable hostname ok?? I'm not sure what I did wrong in this step to get the join information showing the 172. address rather than the 192 one

Resolvable hostnames should be fine. A possible reason which may have caused this IP address discrepancy in your situation is, if you had in /etc/hosts both IPs or if your DNS server returned the local IP instead. So, you could check the hosts file for that or you can use the IP.

Don't worry if not but are you aware if this will trip Win10 activation?
I can't really say much about this, since it's a Microsoft algorithm and we don't know how it functions exactly. Only Microsoft can give you correct information on this. But, from what we've seen before, it seems like if you keep the same PVE version and the same VM config there's a decent possibility that the activation won't get lost (if it's already activated).
 
A possible reason which may have caused this IP address discrepancy in your situation is, if you had in /etc/hosts both IPs or if your DNS server returned the local IP instead. So, you could check the hosts file for that or you can use the IP.

This is the bit that's confusing me. I've posted my /etc/hosts but appreciate that's a lot to look through so I'll try to summarise

proxrml.rml is the machine that I'm creating the cluster on
It get's 172.X.X.X address from dhcp
192.168.10.1 is manually applied by the /etc/network/interfaces

root@pve:~# cat /etc/network/interfaces | grep 192
address 192.168.10.1

And proxrml.cluster.rml is only resolvable from the /etc/hosts
root@pve:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
172.31.187.51 proxrml.rml proxrml pve

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# corosync network hosts
192.168.10.1 proxrml.cluster.rml
192.168.10.2 proxplay.cluster.rml

i.e. The DHCP/ DNS on the router isn't aware of the 192 address

root@pve:~# nslookup proxrml.cluster.rml
Server: 172.31.187.1
Address: 172.31.187.1#53

** server can't find proxrml.cluster.rml: NXDOMAIN

And I entererd proxrml.cluster.rml in the Ring0 box when setting up the cluster

VCMeYjq.png


But the cluster has instead chosen to use the 172. address
tv4IT7V.png


Although the pvecm status seems to recognise the 192. address

What's the significance of 'Cannot use default address safely"?

root@pve:~# cat /etc/hostspvecm status
cat: /etc/hostspvecm: No such file or directory
cat: status: No such file or directory
root@pve:~# pvecm status
Quorum information
------------------
Date: Fri May 3 16:33:20 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/4
Quorate: Yes

Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.10.1 (local)

But /etc/pve/.members has the 172. address. My guess here is that, because it's using the pve hostname, which resolves to 172 address in /etc/hosts.
But I'm pretty sure you can't have two IPs for a hostname in /etc/hosts

root@pve:~# cat /etc/pve/.members
{
"nodename": "pve",
"version": 3,
"cluster": { "name": "rmlcluster", "version": 1, "nodes": 1, "quorate": 1 },
"nodelist": {
"pve": { "id": 1, "online": 1, "ip": "172.31.187.51"}
}
}

So:
Should I add the 192 into dhcp/ dns so it can be resolved (even if not contacted by that router) or should entering it into the /etc/hosts work

I think I need to understand what I should have put in Ring0 or how I should have done it or I'll just hit the same problem if I rebuild the server. Also it would be great to get it working and

Is it insane to just manually change the Ip in /etc/pve/.members?

Am I even close with the idea that it's got something to do with resolving the pve hostname through /etc/hosts?

In an ideal world, I'd get this joined to the cluster, even if it's not pretty (but only if the cluster was correctly working) then I can learn the migration, demiotion, rebuild and promotion steps.

Bit off topic but I've been reading the quorum parts a bit better. These servers are never going to be huge professional production, more hobby. I'm toying with the idea of using an old machine just for quorum, even if it's not powerful enough to actually work as a proper host for machines. Is that mad?

Notes to self:
Process looks easier/ more understandable through CLI for next time
3rd network useful for migrations, see end of https://pve.proxmox.com/wiki/Cluster_Manager
 
Last edited:
Please post your current corosync config: `cat /etc/pve/corosync.conf`
You most likely want to correctly configure corosync for the static 192.168.10.x network there, please also read the corresponding sections of the manual with care:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_configuration

If you plan to use a node just for quorum, this might be of interest for you:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support
 
thanks Chris
/etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: proxrml.cluster.rml
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: rmlcluster
config_version: 1
interface {
bindnetaddr: proxrml.cluster.rml
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

root@pve:~# ping proxrml.cluster.rml
PING proxrml.cluster.rml (192.168.10.1) 56(84) bytes of data.
64 bytes from proxrml.cluster.rml (192.168.10.1): icmp_seq=1 ttl=64 time=0.056 ms
64 bytes from proxrml.cluster.rml (192.168.10.1): icmp_seq=2 ttl=64 time=0.047 ms

Off to read those sections now
 
I *think* multicast is ok?

Over 172. address

root@pve:~# root@pve:~# omping -c 10000 -i 0.001 -F -q proxrml proxplay
proxplay : waiting for response msg
proxplay : joined (S,G) = (*, 232.43.211.234), pinging
proxplay : waiting for response msg
proxplay : server told us to stop

proxplay : unicast, xmt/rcv/%loss = 9882/9882/0%, min/avg/max/std-dev = 0.055/0.123/0.440/0.029
proxplay : multicast, xmt/rcv/%loss = 9882/9882/0%, min/avg/max/std-dev = 0.059/0.135/0.454/0.029

root@proxplay:~# root@proxplay:~# omping -c 10000 -i 0.001 -F -q proxrml proxplay
proxrml : waiting for response msg
proxrml : waiting for response msg
proxrml : joined (S,G) = (*, 232.43.211.234), pinging
proxrml : given amount of query messages was sent

proxrml : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.153/0.300/0.043
proxrml : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.098/0.221/0.372/0.045

Over 192 address

root@pve:~# omping -c 10000 -i 0.001 -F -q proxrml.cluster.rml proxplay.cluster.rml
proxplay.cluster.rml : waiting for response msg
proxplay.cluster.rml : joined (S,G) = (*, 232.43.211.234), pinging
proxplay.cluster.rml : waiting for response msg
proxplay.cluster.rml : server told us to stop

proxplay.cluster.rml : unicast, xmt/rcv/%loss = 9566/9566/0%, min/avg/max/std-dev = 0.039/0.148/3.604/0.058
proxplay.cluster.rml : multicast, xmt/rcv/%loss = 9566/9566/0%, min/avg/max/std-dev = 0.058/0.160/3.616/0.067
root@pve:~#

proxrml : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.059/0.153/0.300/0.043
proxrml : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.098/0.221/0.372/0.045
root@proxplay:~# omping -c 10000 -i 0.001 -F -q proxrml.cluster.rml proxplay.cluster.rml
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : waiting for response msg
proxrml.cluster.rml : joined (S,G) = (*, 232.43.211.234), pinging
proxrml.cluster.rml : given amount of query messages was sent

proxrml.cluster.rml : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.072/0.184/4.819/0.081
proxrml.cluster.rml : multicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 0.086/0.204/4.824/0.080
 
Yes your network seems fine, you can join the new node to the cluster by running `pvecm add 192.168.10.1` on the node you want to join. By this you should definitely use the 192.168.10.x addresses and not the DHCP leased ones
 
Thanks Chris,

Sounds like a good solution and I'll add it manually but, just for my own understanding, is there something I should have done differently so do it through the GUI. Seems wrong.
 
Ah sorry, I overlooked your /etc/hosts config
Code:
172.31.187.51 proxrml.rml proxrml pve
Your hostname resolves to 172.31.187.51? How can you know the IP if you get it assigned via DHCP?
This should resolve to your static IP, the 192.168.10.x I assume?
We do on some occasions rely on resolving the IP address based on the /etc/hostname and /etc/hosts.
So since your hostname resolves to the "wrong" IP, this is probably why you get that IP in the GUI.
 
Ok, I think that makes sense to me, thanks and I completely appreciate that using a point to point cable probably isn't the best way to link the two servers.

You can still know you're IP if it's through DHCP but reserved. Bit off to do it that way I guess but it's handy as all IPs are then locateable from the DHCP server. I must confess I don't know whether I set that in hosts, would guess so.

That said, wouldn't it make sense to have the cluster server use the IP you've specifically chosen to use in the ring0 address when setting up the server?
By 'use' I mean have it as the peer address in that Cluster Join window? If I'm right you want this network to be specifically dedicated to the cluster communications, whereas the other 'normal' traffic should go through the main network.
Having proxmox resolve to that 172 address would be right in all those other normal occasions, with the cluster join ring0 entry working as an override to that for corosync traffic.

If it will all work fine by manually adding it I guess it's a minor thing, just checking my understanding.
I'm still not sure I understand that 'Cannot use default address safely' in the 'cluster join' tab though?

A huge thank you to you and oguz for spending time on me with this one, really appreciate it and thanks for all the hard work.
 
Ok, I think that makes sense to me, thanks and I completely appreciate that using a point to point cable probably isn't the best way to link the two servers
Actually point to point is rather nice, no useless troubles with switches in-between :).
You can still know you're IP if it's through DHCP but reserved. Bit off to do it that way I guess but it's handy as all IPs are then locateable from the DHCP server. I must confess I don't know whether I set that in hosts, would guess so.
Okay yes you can do that, normally you would set the clusters nodes IPs statically and adapt the range for leased IPs in the DHCP server for other clients to avoid collisions.
That said, wouldn't it make sense to have the cluster server use the IP you've specifically chosen to use in the ring0 address when setting up the server?
Well you have to set the ring0 address if you want corosync to run on a separate network, see section 6.4.1 in the docs https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_join_node_to_cluster
I'm still not sure I understand that 'Cannot use default address safely' in the 'cluster join' tab though?
This you will get if the ring0 address (contained in the join info) is different than the join IP , then you have to add the ring0 address manually.

So if you want corosync on the 192.168.10.x and the rest on 172.31.187.x you will use 172.31.187.x to join and 192.168.10.x as ring0
 
Well you have to set the ring0 address

So if you want corosync on the 192.168.10.x and the rest on 172.31.187.x you will use 172.31.187.x to join and 192.168.10.x as ring0

That's kind of my main question. I thought I had set that by putting in proxrml.cluster.rml in the box when setting up the cluster but I still don't know what I should have done differently at this point



However, I added the 2nd one (proxplay) to the cluster but it's screwed up, stopping when asking for quorum. Left it some time but had to ctrl c out so presume it didn't complete

pvecm add proxrml.cluster.rml

I rebooted it and proxrml wouldn't start any of the VMs, complaining about 500 quorum (happy to change it back but just so something worked in the meantime)

pvecm e 1

Proxplay was showing in the GUI at this point but with a red x next to it

I looked at the 2 corosync.confs and the nodes had different IPs i.e. one had the 172 address, 1 had the .cluster.rml (=192.) address

So I've manually changed them both. I've run a diff to confirm they're now identical

corosync.conf
root@pve:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxplay
nodeid: 2
quorum_votes: 1
ring0_addr: proxplay.cluster.rml
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: proxrml.cluster.rml
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: rmlcluster
config_version: 2
interface {
bindnetaddr: proxrml.cluster.rml
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

corosync seems happy on proxplay
root@proxplay:/etc/corosync# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-05-15 21:15:49 BST; 16min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 3773 (corosync)
Tasks: 2 (limit: 4915)
Memory: 38.6M
CPU: 8.906s
CGroup: /system.slice/corosync.service
└─3773 /usr/sbin/corosync -f

May 15 21:15:50 proxplay corosync[3773]: warning [CPG ] downlist left_list: 0 received
May 15 21:15:50 proxplay corosync[3773]: [CPG ] downlist left_list: 0 received
May 15 21:15:50 proxplay corosync[3773]: warning [CPG ] downlist left_list: 0 received
May 15 21:15:50 proxplay corosync[3773]: [CPG ] downlist left_list: 0 received
May 15 21:15:50 proxplay corosync[3773]: notice [QUORUM] This node is within the primary component and will provide service.
May 15 21:15:50 proxplay corosync[3773]: notice [QUORUM] Members[2]: 1 2
May 15 21:15:50 proxplay corosync[3773]: notice [MAIN ] Completed service synchronization, ready to provide service.
May 15 21:15:50 proxplay corosync[3773]: [QUORUM] This node is within the primary component and will provide service.
May 15 21:15:50 proxplay corosync[3773]: [QUORUM] Members[2]: 1 2
May 15 21:15:50 proxplay corosync[3773]: [MAIN ] Completed service synchronization, ready to provide service.


And I can now see the 2 nodes in the GUI but proxplay is spitting errors about keys....

proxplay journalctl -xe
May 15 21:24:44 proxplay pveproxy[6116]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
May 15 21:24:49 proxplay pveproxy[6114]: worker exit
May 15 21:24:49 proxplay pveproxy[6115]: worker exit
May 15 21:24:49 proxplay pveproxy[1935]: worker 6114 finished
May 15 21:24:49 proxplay pveproxy[1935]: worker 6115 finished
May 15 21:24:49 proxplay pveproxy[1935]: starting 2 worker(s)
May 15 21:24:49 proxplay pveproxy[1935]: worker 6127 started
May 15 21:24:49 proxplay pveproxy[1935]: worker 6128 started
May 15 21:24:49 proxplay pveproxy[6127]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
May 15 21:24:49 proxplay pveproxy[6128]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
May 15 21:24:49 proxplay pveproxy[6116]: worker exit
May 15 21:24:49 proxplay pveproxy[1935]: worker 6116 finished
May 15 21:24:49 proxplay pveproxy[1935]: starting 1 worker(s)
May 15 21:24:49 proxplay pveproxy[1935]: worker 6129 started
May 15 21:24:49 proxplay pveproxy[6129]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
May 15 21:24:54 proxplay pveproxy[6127]: worker exit
May 15 21:24:54 proxplay pveproxy[6128]: worker exit
May 15 21:24:54 proxplay pveproxy[1935]: worker 6128 finished
May 15 21:24:54 proxplay pveproxy[1935]: worker 6127 finished
May 15 21:24:54 proxplay pveproxy[1935]: starting 2 worker(s)
May 15 21:24:54 proxplay pveproxy[1935]: worker 6151 started
May 15 21:24:54 proxplay pveproxy[1935]: worker 6152 started
May 15 21:24:54 proxplay pveproxy[6151]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
May 15 21:24:54 proxplay pveproxy[6152]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.
May 15 21:24:54 proxplay pveproxy[6129]: worker exit
May 15 21:24:54 proxplay pveproxy[1935]: worker 6129 finished
May 15 21:24:54 proxplay pveproxy[1935]: starting 1 worker(s)
May 15 21:24:54 proxplay pveproxy[1935]: worker 6153 started
May 15 21:24:54 proxplay pveproxy[6153]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1683.

Is there any way to rescue this now? Somehow complete the rest of the pvecm add process by manually adding keys


If not, what should corosync.conf shown as the node IP before I added the
 
I think that

Code:
pvecm updatecerts

Has got rid of the errors

but if anyone is kind enough to confirm whether corosync.conf and

pvecm status
Quorum information
------------------
Date: Wed May 15 21:59:02 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1/20
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.10.1 (local)
0x00000002 1 192.168.10.2

look ok now?

Is it ok to keep the quorum at 1 in a 2 node cluster? it looks like it may have been set back to 2 anyway (?)
 
This seems strange as your corosync.conf before joining seems ok and the IPs should be resolved correctly. Was the IP for proxplay resolved incorrectly?
Still now sure why you do not use IPs directly as they are not going to change anyway.
If you want the cluster to remain consistent and especially if you plan to later add a quorum device it makes sense to have the expected votes at 2 for the 2 node setup (which is already set indeed).
 
Thanks for sticking with me here Chris, know it must be painful but it really is hugely appreciated.

I think it all comes back to the first node being wrong doesn't it? Look for cat /etc/pve/.members under 'proxmox.rml pvecm status output' in my first post, where only proxrml was in the cluster.

As I understand a little more, shouldn't that IP have always been the 192 (corosync) address, rather than the 172?

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.10.1 (local)
root@pve:/etc/pve/nodes# cat /etc/pve/.members
{
"nodename": "pve",
"version": 3,
"cluster": { "name": "rmlcluster", "version": 1, "nodes": 1, "quorate": 1 },
"nodelist": {
"pve": { "id": 1, "online": 1, "ip": "172.31.187.51"}
}
}


I'm sure you're right about using IPs, rather than resolving hosts. It's just a habit I've got into in they ever need to change. But as resolving doesn't seem to be the issue for this part is it that much of a problem? Apart perhaps from the GUI resolving the ring0 information incorrectly when creating the cluster which you've put down to to the hosts file.
To be clear are we saying that, if I'd put the correct Ring0 IP in it would have been ok?

To be fair to me though I thought I was following the manual https://pve.proxmox.com/wiki/Separate_Cluster_Network

Now configure the /etc/hosts file so that we can use hostnames in the corosync config. This isn't strictly necessary you can also set the addresses directly but helps to keep the overview and is considered as good practice.

and

ringX_addr
Hostname (or IP) of the corosync ringX (X can be 0 or 1) address of this node. There can be also two rings, see Redundant Ring Protocol for setup instructions.

Normally there for corosync defined hostname from the /etc/hosts file for that.

Perhaps the problem here is that the GUI for cluster creation doesn't have an option to get the
bindnetaddr to get corosync to use the right network? Or, more likely that I'm an idiot and still missing something?

crQeAwR


That second sentence from the manual under ringx addr could be clearer but it follows setting up hosts resolution for the corosync network. Surely, if you're right and the GUI is resolving the hostname to the 172 address, this will always be wrong. You just can't have a /etc/hosts hostname resolving to two separate IPS? I'm happy to be wrong but that Ring0 address should have resolved to 192 when entered into the GUI (?)

------- Moving forward --------

Although I've now more confused, both nodes are appearing in the GUI ok after the manual changes to corosync and redoing the keys and, although I'm a bit nervous to move a VM over at the moment it seems to be working.

It's quite possible I've done the manual editing incorrectly....

pvecm status shows the correct corosync information

root@proxplay:~# pvecm status
Quorum information
------------------
Date: Thu May 16 13:55:08 2019
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1/28
Quorate: Yes

Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.10.1
0x00000002 1 192.168.10.2 (local)

Whilst cat /etc/pve/.members is using the 172 addresses, should these be corosync 192 as well, or at least consistent with the pvecm status? Is there anywhere else I should be looking for information?

root@proxplay:~# cat /etc/pve/.members
{
"nodename": "proxplay",
"version": 8,
"cluster": { "name": "rmlcluster", "version": 2, "nodes": 2, "quorate": 1 },
"nodelist": {
"proxplay": { "id": 2, "online": 1, "ip": "172.31.187.56"},
"pve": { "id": 1, "online": 1, "ip": "172.31.187.51"}
}
}
 
Hmm, and the corosync.confs seems out of line as well, in that they mix the 172 and 192 addresses in ring0. But as least they both look the same

proxrml

root@pve:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxplay
nodeid: 2
quorum_votes: 1
ring0_addr: 172.31.187.56
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: proxrml.cluster.rml
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: rmlcluster
config_version: 2
interface {
bindnetaddr: proxrml.cluster.rml
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

proxplay
root@proxplay:~# cat /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxplay
nodeid: 2
quorum_votes: 1
ring0_addr: 172.31.187.56
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: proxrml.cluster.rml
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: rmlcluster
config_version: 2
interface {
bindnetaddr: proxrml.cluster.rml
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
 
Hmm, and the corosync.confs seems out of line as well, in that they mix the 172 and 192 addresses in ring0. But as least they both look the same
This is not correct? Didn't you had the correct config before? Seems that for some reason your name resolves to the wrong IP.
To be fair to me though I thought I was following the manual https://pve.proxmox.com/wiki/Separate_Cluster_Network
Yes you are right, but as you see it can complicate things if your name resolution is off.

In your case I would probably follow the advise of my coleage oguz and start with a clean and fresh install, now that you are more familiar with the system and possible hurdles.