[SOLVED] new cluster node only partially working

woodstock · Jan 3, 2017

After adding an new cluster node I recogniced that some nodes cannot migrate to the new node.
Everything is on the same no-subscription version:

Code:

proxmox-ve: 4.4-77 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.9-1~bpo80+1

Investigating further I saw, that some (2 of 5) nodes show the new node as down when connecting to their GUI.
Console output from pvecm status seems ok on all nodes:

Code:

Quorum information
------------------
Date:  Tue Jan  3 09:24:55 2017
Quorum provider:  corosync_votequorum
Nodes:  6
Node ID:  0x00000003
Ring ID:  1/9076
Quorate:  Yes

Votequorum information
----------------------
Expected votes:  6
Highest expected: 6
Total votes:  6
Quorum:  4
Flags:  Quorate

Membership information
----------------------
  Nodeid  Votes Name
0x00000001  1 172.16.**.*11
0x00000002  1 172.16.**.*12
0x00000003  1 172.16.**.*13 (local)
0x00000005  1 172.16.**.*14
0x00000004  1 172.16.**.*15
0x00000006  1 172.16.**.*16

But I can see differences in pvecm nodes output. Only nodes showing the added node as down are showing the fqdn of the new node:

Code:

# pvecm nodes

Membership information
----------------------
  Nodeid  Votes Name
  1  1 node1
  2  1 node2
  3  1 node3 (local)
  5  1 node4
  4  1 node5
  6  1 node6.example.com

Any idea how to fix this?
Thanks.

dcsapak · Jan 3, 2017

have you verified that multicast works ?

see
http://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast

woodstock · Jan 3, 2017

Just did that and now I'm even more confused:

Only the newly added node seems to have a multicast IP:

Code:

# corosync-cmapctl -g totem.interface.0.mcastaddr
totem.interface.0.mcastaddr (str) = 239.192.9.133

On all other nodes it's:

Code:

# corosync-cmapctl -g totem.interface.0.mcastaddr
Can't get key totem.interface.0.mcastaddr. Error CS_ERR_NOT_EXIST

So omping never get's any response.

But pvecm shows quorum on all nodes (even the new one) and all nodes are listet as members.

Some additional information:

All nodes are on the same switch with eth0 as dedicated cluster interface (bridge is on eth1).
Corosync traffic is prioritized on these switch ports (5404,5405 udp).

I also can see corosync listening on the multicast IP on all nodes.
Output is the same on all nodes:

Code:

# netstat -tulpn | grep corosync
udp  0  0 172.16.**.**:5404  0.0.0.0:*  3239/corosync
udp  0  0 239.192.9.133:5405  0.0.0.0:*  3239/corosync
udp  0  0 172.16.**.**:5405  0.0.0.0:*  3239/corosync

Firewall is enabled but allow rules for corosync are in place.

How can I resolve this?
Thanks

dcsapak · Jan 3, 2017

can you post the content of the /etc/pve/corosync.conf and the /etc/hosts file of the nodes were this is not working?

woodstock · Jan 3, 2017

/etc/pve/corosync.conf (IP replaced):

Code:

# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
  name: node5
  nodeid: 4
  quorum_votes: 1
  ring0_addr: node5
  }

  node {
  name: node3
  nodeid: 3
  quorum_votes: 1
  ring0_addr: node3
  }

  node {
  name: node4
  nodeid: 5
  quorum_votes: 1
  ring0_addr: node4
  }

  node {
  name: node2
  nodeid: 2
  quorum_votes: 1
  ring0_addr: node2
  }

  node {
  name: node6
  nodeid: 6
  quorum_votes: 1
  ring0_addr: node6
  }

  node {
  name: node1
  nodeid: 1
  quorum_votes: 1
  ring0_addr: node1
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: PVEC2
  config_version: 10
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
  bindnetaddr: <IP_OF_FIRST_NODE>
  ringnumber: 0
  }

}

/etc/hosts (same on all hosts except the pvelocalhost appended at the local nodes line, IPs and domain name replaced):

Code:

127.0.0.1 localhost.localdomain localhost

172.16.**.*1 node1.example.com node1
172.16.**.*2 node2.example.com node2
172.16.**.*3 node3.example.com node3
172.16.**.*4 node4.example.com node4
172.16.**.*5 node5.example.com node5
172.16.**.*6 node6.example.com node6 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1  ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Thanks for looking into this.

woodstock · Jan 4, 2017

After fixing a wrong firewall ruleset (missed to allow corosync ports on multicast ip ranges) and restarting corosync all nodes are now showing all other nodes as up in their web interfaces.

Output of corosync tools is ok on all nodes as well:

Code:

# corosync-cfgtool -s
Printing ring status.
Local node ID 4
RING ID 0
  id  = 172.16.**.*5
  status  = ring 0 active with no faults

Code:

# corosync-cmapctl | grep status
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
runtime.totem.pg.mrp.srp.members.3.status (str) = joined
runtime.totem.pg.mrp.srp.members.4.status (str) = joined
runtime.totem.pg.mrp.srp.members.5.status (str) = joined
runtime.totem.pg.mrp.srp.members.6.status (str) = joined

What still makes me wonder is the failing omping (I tryed with and without giving the multicast IP) as mentioned on https://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting.

Code:

# omping node1 node2 node3 node4 node5 node6
node1 : waiting for response msg
node2 : waiting for response msg
node3 : waiting for response msg
node4 : waiting for response msg
node5 : waiting for response msg
...
node1 : response message never received
node2 : response message never received
node3 : response message never received
node4 : response message never received
node6 : response message never received

Is this supposed to work or is this somehow outdated?

fabian · Jan 4, 2017

woodstock said:
What still makes me wonder is the failing omping (I tryed with and without giving the multicast IP) as mentioned on https://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting.

Code:

# omping node1 node2 node3 node4 node5 node6 node1 : waiting for response msg node2 : waiting for response msg node3 : waiting for response msg node4 : waiting for response msg node5 : waiting for response msg ... node1 : response message never received node2 : response message never received node3 : response message never received node4 : response message never received node6 : response message never received

Is this supposed to work or is this somehow outdated?

did you start omping on all nodes simultaneously? if you start it only on one node, it won't work

woodstock · Jan 4, 2017

Oh my ... thanks.

Hopefully last question: ist this output ok or should I investigate further regarding multicast packet loss?

Code:

node1 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.035/0.098/0.367/0.036
node1 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node2 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.039/0.187/4.067/0.159
node2 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node3 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.034/0.102/0.944/0.048
node3 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node4 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.087/0.842/0.031
node4 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node6 :  unicast, xmt/rcv/%loss = 8793/8793/0%, min/avg/max/std-dev = 0.031/0.086/3.684/0.060
node6 : multicast, xmt/rcv/%loss = 8793/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

Thanks

dcsapak · Jan 5, 2017

your output indicates that multicast does not work at all (loss 100%) so i would investigate what goes wrong here

woodstock · Jan 5, 2017

Seems I'm still missing something for multicast in my FW ruleset.
Temporarily disabled the node firewall rules and all is fine now.

Thanks for your help.

Search

Search

[SOLVED] new cluster node only partially working

woodstock

Renowned Member

dcsapak

Proxmox Staff Member

woodstock

Renowned Member

dcsapak

Proxmox Staff Member

woodstock

Renowned Member

woodstock

Renowned Member

fabian

Proxmox Staff Member

woodstock

Renowned Member

dcsapak

Proxmox Staff Member

woodstock

Renowned Member