Hi,
I'm attempting to install a 4 node cluster where the members mostly reside on different switches. Two of the servers (the first addition and the third) joined the cluster correctly. The second node fails to join with the following error:
root@pxmxv1001:/etc/pve# ssh pxmxv1002
Password:
Linux pxmxv1002 2.6.32-37-pve #1 SMP Wed Feb 11 10:00:27 CET 2015 x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Apr 9 14:18:23 2015 from pxmxv1001.tableauprod.net
root@pxmxv1002:~# pvecm add pxmxv1001
unable to copy ssh ID
root@pxmxv1002:~# ssh-copy-id -vv pxmxv1001
OpenSSH_6.0p1 Debian-4+deb7u2, OpenSSL 1.0.1e 11 Feb 2013
Pseudo-terminal will not be allocated because stdin is not a terminal.
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug2: ssh_connect: needpriv 0
ssh: Could not resolve hostname umask 077; test -d ~/.ssh || mkdir ~/.ssh ; cat >> ~/.ssh/authorized_keys && (test -x /sbin/restorec: Name or service not known
It appears that ssh is misparsing the line in ssh-copy-id. This seems strange because all the nodes were installed with the same 3.4 version and I didn't get this error on the 03 or 04 nodes.
I'd read in another forum that this could be due to Multicast issues (though I don't understand why I would get this error). Oddly omping doesn't even respond for the 03 node, which appears to be a healthy member of the cluster:
root@pxmxv1001:/etc/pve# omping 239.192.89.254 pxmxv1001 pxmxv1002 pxmxv1003 pxmxv1004
239.192.89.254 : waiting for response msg
pxmxv1002 : waiting for response msg
pxmxv1003 : waiting for response msg
pxmxv1004 : waiting for response msg
^C
239.192.89.254 : response message never received
pxmxv1002 : response message never received
pxmxv1003 : response message never received
pxmxv1004 : response message never received
So, Multicast looks to be an issue, but one node (03) apparently joined and another node (04) partially joined, but went into an error state. I've validated that the switches have IGMP snooping turned on.
Can you offer any suggestions for the pvecm add <master> failure on pxmxv1002? Also, do you have any suggestions as to how to complete the cluster sync for pxmxv1004?
Thanks!
------------------------
(more output below)
root@pxmxv1001:/etc/pve# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: pxmx-c01
Cluster Id: 22949
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: pxmxv1001
Node ID: 1
Multicast addresses: 239.192.89.254
Node addresses: 10.194.0.50
root@pxmxv1001:/etc/pve# pvecm nodes
Node Sts Inc Joined Name
1 M 12 2015-04-09 16:20:22 pxmxv1001
2 M 16 2015-04-09 16:20:22 pxmxv1003
3 X 0 pxmxv1004
Output of pveversion -v:
root@pxmxv1002:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1
I'm attempting to install a 4 node cluster where the members mostly reside on different switches. Two of the servers (the first addition and the third) joined the cluster correctly. The second node fails to join with the following error:
root@pxmxv1001:/etc/pve# ssh pxmxv1002
Password:
Linux pxmxv1002 2.6.32-37-pve #1 SMP Wed Feb 11 10:00:27 CET 2015 x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Apr 9 14:18:23 2015 from pxmxv1001.tableauprod.net
root@pxmxv1002:~# pvecm add pxmxv1001
unable to copy ssh ID
root@pxmxv1002:~# ssh-copy-id -vv pxmxv1001
OpenSSH_6.0p1 Debian-4+deb7u2, OpenSSL 1.0.1e 11 Feb 2013
Pseudo-terminal will not be allocated because stdin is not a terminal.
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug2: ssh_connect: needpriv 0
ssh: Could not resolve hostname umask 077; test -d ~/.ssh || mkdir ~/.ssh ; cat >> ~/.ssh/authorized_keys && (test -x /sbin/restorec: Name or service not known
It appears that ssh is misparsing the line in ssh-copy-id. This seems strange because all the nodes were installed with the same 3.4 version and I didn't get this error on the 03 or 04 nodes.
I'd read in another forum that this could be due to Multicast issues (though I don't understand why I would get this error). Oddly omping doesn't even respond for the 03 node, which appears to be a healthy member of the cluster:
root@pxmxv1001:/etc/pve# omping 239.192.89.254 pxmxv1001 pxmxv1002 pxmxv1003 pxmxv1004
239.192.89.254 : waiting for response msg
pxmxv1002 : waiting for response msg
pxmxv1003 : waiting for response msg
pxmxv1004 : waiting for response msg
^C
239.192.89.254 : response message never received
pxmxv1002 : response message never received
pxmxv1003 : response message never received
pxmxv1004 : response message never received
So, Multicast looks to be an issue, but one node (03) apparently joined and another node (04) partially joined, but went into an error state. I've validated that the switches have IGMP snooping turned on.
Can you offer any suggestions for the pvecm add <master> failure on pxmxv1002? Also, do you have any suggestions as to how to complete the cluster sync for pxmxv1004?
Thanks!
------------------------
(more output below)
root@pxmxv1001:/etc/pve# pvecm status
Version: 6.2.0
Config Version: 3
Cluster Name: pxmx-c01
Cluster Id: 22949
Cluster Member: Yes
Cluster Generation: 16
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: pxmxv1001
Node ID: 1
Multicast addresses: 239.192.89.254
Node addresses: 10.194.0.50
root@pxmxv1001:/etc/pve# pvecm nodes
Node Sts Inc Joined Name
1 M 12 2015-04-09 16:20:22 pxmxv1001
2 M 16 2015-04-09 16:20:22 pxmxv1003
3 X 0 pxmxv1004
Output of pveversion -v:
root@pxmxv1002:~# pveversion -v
proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1