V3 Cluster setup across two sites with udp failed, fixed manually (mostly)

mattmann72 · Jul 11, 2013

I am trying to setup a cluster with 4 machines. Two servers (Proxmox1 and Proxmox2) are in the same location and the other two (Proxmox3 and Proxmox4) are in a different location about 150 miles away (4 hour drive each way). I operate the entire network between them. I have added high priority QoS to all traffic between these servers. I enabled and verified multicast. I also added all hosts to each others hosts.conf files.

The first two servers are successfully in a cluster. When I tried adding the other two, both failed to reach quorum. Upon further investigation they failed with the following error:

Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'

Proxmox3 is able to ssh into Proxmox1, but 1 cannot access 3. Therefore the key on 3 did not get created. I assume this caused communication to fail.

Question 1: Since /root/.ssh/authorized_keys is a symlink to /etc/pve/priv/authorized_keys, how do I manually add in the key so it can get mounted in /etc/pve/priv/authorized_keys?

I then proceeded to install proxmox in a VM on my computer and use this as a template to remove any existing configuration on Proxmox3 (the node I was unable to add). At this point I am only trying to add in this node.

The steps I took are as follows:

On Proxmox1:
pvecm delnode proxmox3
rm /root/.ssh/known_hosts
removed the line in authorized_keys for proxmox3

On Proxmox3:
service pve-cluster stop
service cman stop
rm -Rf /var/lib/pve-cluster/* /var/lib/pve-cluster/.*
rm /etc/cluster/cluster.conf
reboot

As far as I can tell, at this point I should have a system that should be considered freshly installed.

Question 2: Did I miss anything that I should have done undo all changes that the attempt to add to the cluster caused?

So I attempted to readd by IP.

On Proxmox3 I ran: pvecm add <Proxmox1 IP>

And it didn't work. Here is the output:

The authenticity of host '<Proxmox1 IP> (<Proxmox1 IP>)' can't be established.
ECDSA key fingerprint is 2c:a9:f1:86:e1:f3:56:90:a9:37:2c:50:a1:e8:20:4e.
Are you sure you want to continue connecting (yes/no)? yes
root@<Proxmox1 IP>'s password:
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]
waiting for quorum...

It hangs here and doesn't do anything. So I cancelled the operation with ctrl+c.

I proceeded to reverify multicast by following the guide at http://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting Multicast and Unicast both work.

Question 3: Why is this occurring?

So at this point I tried to setup the cluster manually by copying all files across with scp.

Add new host to /etc/cluster/cluster.conf on proxmox1
scp cluster.conf to proxmox3
Add id_rsa.pub from proxmox3 to /etc/pve/priv/authorized_keys on proxmox1
Add id_rsa.pub from each other server to /etc/pve/priv/authorized_keys on proxmox3
ssh from each host to each other host to make entries in known_hosts file
scp /etc/pve/* from proxmox1 to /etc/pve/ on proxmox3

After restarting pve-cluster and cman I got to here:

root@proxmox3:/etc# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
[FAILED]

So I looked in the cluster logs and found:

Jul 10 13:45:20 corosync [MAIN ] Could not open /var/lib/pve-cluster/corosync.authkey: No such file or directory
Jul 10 13:45:20 corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1745.

After checking that the /var/lib/pve-cluster/corosync.authkey file on both proxmox1 and proxmox2 are the same I used scp to copy the file from proxmox1 to proxmox3 server. Then I started cman again.

After restarting cman on all systems it appears to have worked.

root@proxmox1:/var/lib/pve-cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23116 2013-07-10 13:52:15 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 23116 2013-07-10 13:52:15 proxmox3

root@proxmox2:/var/lib/pve-cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23264 2013-07-10 13:53:35 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 23264 2013-07-10 13:53:35 proxmox3

root@proxmox3:/var/log/cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23116 2013-07-10 13:52:15 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 2072 2013-07-10 13:49:41 proxmox3

After repeating the manual steps with proxmox4 I was able to have a fully functional 4 node cluster across two sites.

However the web interface on proxmox3 doesn't work. The web interface on proxmox4 works, but does not show any of the other hosts.

On proxmox3 netstat -a shows it running and I can make an https connection to port 8006, but no data is returned:

root@proxmox3:~# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost.localdomai:85 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 localhost.localdom:smtp *:* LISTEN
tcp 0 0 *:8006 *:* LISTEN
tcp 0 0 *:sunrpc *:* LISTEN
tcp 0 0 *:34065 *:* LISTEN

Anyone know how I can fix this?

*EDIT

Following pvecmforum.proxmox.com/threads/14502-Proxmox-3-0-Cluster-Node-Web-Interface-Problems I was able to fix the issue.

spirit · Jul 11, 2013

something seem to be wrong on the mutlicast layer, maybe is is related to the distance, I really don't known.
How much latency do you have between both sites ?

Also, I would like to warn you about to have a 4 nodes cluster, with 2 nodes on each site.
If your link between sites fail, you will lose quorum on both site.
Same if one datacenter have a power failure.

mattmann72 · Jul 11, 2013

Its a wireless network between the two sites.

root@proxmox4:/var/log/pveproxy# ping proxmox2 -c 10
--- proxmox2.sbbnet.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 6.807/8.829/10.296/0.928 ms

Currently this is just for testing. I will be bringing up 2 more nodes at another location next week. If this all works well we will not be upgrading to vmware 5 and instead moving everything to proxmox with a DRBD backend.

mattmann72 · Jul 11, 2013

I just discovered that proxmox4 is not getting the changes in /etc/pve like the other hosts are. So I tried to scp /etc/pve/* from proxmox1 to /etc/pve/ on proxmox4. After restarting pve-cluster and cman it is now fixed. I am adding in this step to my post above.

Search

Search

V3 Cluster setup across two sites with udp failed, fixed manually (mostly)

mattmann72

New Member

spirit

Distinguished Member

mattmann72

New Member

mattmann72

New Member

We value your privacy