I am trying to setup a cluster with 4 machines. Two servers (Proxmox1 and Proxmox2) are in the same location and the other two (Proxmox3 and Proxmox4) are in a different location about 150 miles away (4 hour drive each way). I operate the entire network between them. I have added high priority QoS to all traffic between these servers. I enabled and verified multicast. I also added all hosts to each others hosts.conf files.
The first two servers are successfully in a cluster. When I tried adding the other two, both failed to reach quorum. Upon further investigation they failed with the following error:
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
Proxmox3 is able to ssh into Proxmox1, but 1 cannot access 3. Therefore the key on 3 did not get created. I assume this caused communication to fail.
Question 1: Since /root/.ssh/authorized_keys is a symlink to /etc/pve/priv/authorized_keys, how do I manually add in the key so it can get mounted in /etc/pve/priv/authorized_keys?
I then proceeded to install proxmox in a VM on my computer and use this as a template to remove any existing configuration on Proxmox3 (the node I was unable to add). At this point I am only trying to add in this node.
The steps I took are as follows:
On Proxmox1:
pvecm delnode proxmox3
rm /root/.ssh/known_hosts
removed the line in authorized_keys for proxmox3
On Proxmox3:
service pve-cluster stop
service cman stop
rm -Rf /var/lib/pve-cluster/* /var/lib/pve-cluster/.*
rm /etc/cluster/cluster.conf
reboot
As far as I can tell, at this point I should have a system that should be considered freshly installed.
Question 2: Did I miss anything that I should have done undo all changes that the attempt to add to the cluster caused?
So I attempted to readd by IP.
On Proxmox3 I ran: pvecm add <Proxmox1 IP>
And it didn't work. Here is the output:
The authenticity of host '<Proxmox1 IP> (<Proxmox1 IP>)' can't be established.
ECDSA key fingerprint is 2c:a9:f1:86:e1:f3:56:90:a9:37:2c:50:a1:e8:20:4e.
Are you sure you want to continue connecting (yes/no)? yes
root@<Proxmox1 IP>'s password:
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]
waiting for quorum...
It hangs here and doesn't do anything. So I cancelled the operation with ctrl+c.
I proceeded to reverify multicast by following the guide at http://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting Multicast and Unicast both work.
Question 3: Why is this occurring?
So at this point I tried to setup the cluster manually by copying all files across with scp.
Add new host to /etc/cluster/cluster.conf on proxmox1
scp cluster.conf to proxmox3
Add id_rsa.pub from proxmox3 to /etc/pve/priv/authorized_keys on proxmox1
Add id_rsa.pub from each other server to /etc/pve/priv/authorized_keys on proxmox3
ssh from each host to each other host to make entries in known_hosts file
scp /etc/pve/* from proxmox1 to /etc/pve/ on proxmox3
After restarting pve-cluster and cman I got to here:
root@proxmox3:/etc# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
[FAILED]
So I looked in the cluster logs and found:
Jul 10 13:45:20 corosync [MAIN ] Could not open /var/lib/pve-cluster/corosync.authkey: No such file or directory
Jul 10 13:45:20 corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1745.
After checking that the /var/lib/pve-cluster/corosync.authkey file on both proxmox1 and proxmox2 are the same I used scp to copy the file from proxmox1 to proxmox3 server. Then I started cman again.
After restarting cman on all systems it appears to have worked.
root@proxmox1:/var/lib/pve-cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23116 2013-07-10 13:52:15 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 23116 2013-07-10 13:52:15 proxmox3
root@proxmox2:/var/lib/pve-cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23264 2013-07-10 13:53:35 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 23264 2013-07-10 13:53:35 proxmox3
root@proxmox3:/var/log/cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23116 2013-07-10 13:52:15 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 2072 2013-07-10 13:49:41 proxmox3
After repeating the manual steps with proxmox4 I was able to have a fully functional 4 node cluster across two sites.
However the web interface on proxmox3 doesn't work. The web interface on proxmox4 works, but does not show any of the other hosts.
On proxmox3 netstat -a shows it running and I can make an https connection to port 8006, but no data is returned:
root@proxmox3:~# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost.localdomai:85 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 localhost.localdom:smtp *:* LISTEN
tcp 0 0 *:8006 *:* LISTEN
tcp 0 0 *:sunrpc *:* LISTEN
tcp 0 0 *:34065 *:* LISTEN
Anyone know how I can fix this?
*EDIT
Following pvecmforum.proxmox.com/threads/14502-Proxmox-3-0-Cluster-Node-Web-Interface-Problems I was able to fix the issue.
The first two servers are successfully in a cluster. When I tried adding the other two, both failed to reach quorum. Upon further investigation they failed with the following error:
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
Proxmox3 is able to ssh into Proxmox1, but 1 cannot access 3. Therefore the key on 3 did not get created. I assume this caused communication to fail.
Question 1: Since /root/.ssh/authorized_keys is a symlink to /etc/pve/priv/authorized_keys, how do I manually add in the key so it can get mounted in /etc/pve/priv/authorized_keys?
I then proceeded to install proxmox in a VM on my computer and use this as a template to remove any existing configuration on Proxmox3 (the node I was unable to add). At this point I am only trying to add in this node.
The steps I took are as follows:
On Proxmox1:
pvecm delnode proxmox3
rm /root/.ssh/known_hosts
removed the line in authorized_keys for proxmox3
On Proxmox3:
service pve-cluster stop
service cman stop
rm -Rf /var/lib/pve-cluster/* /var/lib/pve-cluster/.*
rm /etc/cluster/cluster.conf
reboot
As far as I can tell, at this point I should have a system that should be considered freshly installed.
Question 2: Did I miss anything that I should have done undo all changes that the attempt to add to the cluster caused?
So I attempted to readd by IP.
On Proxmox3 I ran: pvecm add <Proxmox1 IP>
And it didn't work. Here is the output:
The authenticity of host '<Proxmox1 IP> (<Proxmox1 IP>)' can't be established.
ECDSA key fingerprint is 2c:a9:f1:86:e1:f3:56:90:a9:37:2c:50:a1:e8:20:4e.
Are you sure you want to continue connecting (yes/no)? yes
root@<Proxmox1 IP>'s password:
copy corosync auth key
stopping pve-cluster service
Stopping pve cluster filesystem: pve-cluster.
backup old database
Starting pve cluster filesystem : pve-clustercan't create shared ssh key database '/etc/pve/priv/authorized_keys'
.
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]
waiting for quorum...
It hangs here and doesn't do anything. So I cancelled the operation with ctrl+c.
I proceeded to reverify multicast by following the guide at http://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting Multicast and Unicast both work.
Question 3: Why is this occurring?
So at this point I tried to setup the cluster manually by copying all files across with scp.
Add new host to /etc/cluster/cluster.conf on proxmox1
scp cluster.conf to proxmox3
Add id_rsa.pub from proxmox3 to /etc/pve/priv/authorized_keys on proxmox1
Add id_rsa.pub from each other server to /etc/pve/priv/authorized_keys on proxmox3
ssh from each host to each other host to make entries in known_hosts file
scp /etc/pve/* from proxmox1 to /etc/pve/ on proxmox3
After restarting pve-cluster and cman I got to here:
root@proxmox3:/etc# /etc/init.d/cman start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... corosync died: Could not read cluster configuration Check cluster logs for details
[FAILED]
So I looked in the cluster logs and found:
Jul 10 13:45:20 corosync [MAIN ] Could not open /var/lib/pve-cluster/corosync.authkey: No such file or directory
Jul 10 13:45:20 corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1745.
After checking that the /var/lib/pve-cluster/corosync.authkey file on both proxmox1 and proxmox2 are the same I used scp to copy the file from proxmox1 to proxmox3 server. Then I started cman again.
After restarting cman on all systems it appears to have worked.
root@proxmox1:/var/lib/pve-cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23116 2013-07-10 13:52:15 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 23116 2013-07-10 13:52:15 proxmox3
root@proxmox2:/var/lib/pve-cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23264 2013-07-10 13:53:35 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 23264 2013-07-10 13:53:35 proxmox3
root@proxmox3:/var/log/cluster# pvecm nodes
Node Sts Inc Joined Name
1 M 23116 2013-07-10 13:52:15 proxmox1
2 M 23264 2013-07-10 13:53:35 proxmox2
3 M 2072 2013-07-10 13:49:41 proxmox3
After repeating the manual steps with proxmox4 I was able to have a fully functional 4 node cluster across two sites.
However the web interface on proxmox3 doesn't work. The web interface on proxmox4 works, but does not show any of the other hosts.
On proxmox3 netstat -a shows it running and I can make an https connection to port 8006, but no data is returned:
root@proxmox3:~# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost.localdomai:85 *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 localhost.localdom:smtp *:* LISTEN
tcp 0 0 *:8006 *:* LISTEN
tcp 0 0 *:sunrpc *:* LISTEN
tcp 0 0 *:34065 *:* LISTEN
Anyone know how I can fix this?
*EDIT
Following pvecmforum.proxmox.com/threads/14502-Proxmox-3-0-Cluster-Node-Web-Interface-Problems I was able to fix the issue.
Last edited: