[SOLVED] new cluster node only partially working

woodstock

Renowned Member
Feb 18, 2016
45
2
73
After adding an new cluster node I recogniced that some nodes cannot migrate to the new node.
Everything is on the same no-subscription version:

Code:
proxmox-ve: 4.4-77 (running kernel: 4.4.35-1-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.13-1-pve: 4.4.13-56
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.10-1-pve: 4.4.10-54
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.9-1~bpo80+1

Investigating further I saw, that some (2 of 5) nodes show the new node as down when connecting to their GUI.
Console output from pvecm status seems ok on all nodes:

Code:
Quorum information
------------------
Date:  Tue Jan  3 09:24:55 2017
Quorum provider:  corosync_votequorum
Nodes:  6
Node ID:  0x00000003
Ring ID:  1/9076
Quorate:  Yes

Votequorum information
----------------------
Expected votes:  6
Highest expected: 6
Total votes:  6
Quorum:  4
Flags:  Quorate

Membership information
----------------------
  Nodeid  Votes Name
0x00000001  1 172.16.**.*11
0x00000002  1 172.16.**.*12
0x00000003  1 172.16.**.*13 (local)
0x00000005  1 172.16.**.*14
0x00000004  1 172.16.**.*15
0x00000006  1 172.16.**.*16

But I can see differences in pvecm nodes output. Only nodes showing the added node as down are showing the fqdn of the new node:

Code:
# pvecm nodes

Membership information
----------------------
  Nodeid  Votes Name
  1  1 node1
  2  1 node2
  3  1 node3 (local)
  5  1 node4
  4  1 node5
  6  1 node6.example.com

Any idea how to fix this?
Thanks.
 
Just did that and now I'm even more confused:

Only the newly added node seems to have a multicast IP:
Code:
# corosync-cmapctl -g totem.interface.0.mcastaddr
totem.interface.0.mcastaddr (str) = 239.192.9.133

On all other nodes it's:
Code:
# corosync-cmapctl -g totem.interface.0.mcastaddr
Can't get key totem.interface.0.mcastaddr. Error CS_ERR_NOT_EXIST

So omping never get's any response.

But pvecm shows quorum on all nodes (even the new one) and all nodes are listet as members.

Some additional information:

All nodes are on the same switch with eth0 as dedicated cluster interface (bridge is on eth1).
Corosync traffic is prioritized on these switch ports (5404,5405 udp).

I also can see corosync listening on the multicast IP on all nodes.
Output is the same on all nodes:
Code:
# netstat -tulpn | grep corosync
udp  0  0 172.16.**.**:5404  0.0.0.0:*  3239/corosync
udp  0  0 239.192.9.133:5405  0.0.0.0:*  3239/corosync
udp  0  0 172.16.**.**:5405  0.0.0.0:*  3239/corosync

Firewall is enabled but allow rules for corosync are in place.


How can I resolve this?
Thanks
 
can you post the content of the /etc/pve/corosync.conf and the /etc/hosts file of the nodes were this is not working?
 
/etc/pve/corosync.conf (IP replaced):
Code:
# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
  name: node5
  nodeid: 4
  quorum_votes: 1
  ring0_addr: node5
  }

  node {
  name: node3
  nodeid: 3
  quorum_votes: 1
  ring0_addr: node3
  }

  node {
  name: node4
  nodeid: 5
  quorum_votes: 1
  ring0_addr: node4
  }

  node {
  name: node2
  nodeid: 2
  quorum_votes: 1
  ring0_addr: node2
  }

  node {
  name: node6
  nodeid: 6
  quorum_votes: 1
  ring0_addr: node6
  }

  node {
  name: node1
  nodeid: 1
  quorum_votes: 1
  ring0_addr: node1
  }

}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: PVEC2
  config_version: 10
  ip_version: ipv4
  secauth: on
  version: 2
  interface {
  bindnetaddr: <IP_OF_FIRST_NODE>
  ringnumber: 0
  }

}

/etc/hosts (same on all hosts except the pvelocalhost appended at the local nodes line, IPs and domain name replaced):
Code:
127.0.0.1 localhost.localdomain localhost

172.16.**.*1 node1.example.com node1
172.16.**.*2 node2.example.com node2
172.16.**.*3 node3.example.com node3
172.16.**.*4 node4.example.com node4
172.16.**.*5 node5.example.com node5
172.16.**.*6 node6.example.com node6 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1  ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Thanks for looking into this.
 
After fixing a wrong firewall ruleset (missed to allow corosync ports on multicast ip ranges) and restarting corosync all nodes are now showing all other nodes as up in their web interfaces.

Output of corosync tools is ok on all nodes as well:

Code:
# corosync-cfgtool -s
Printing ring status.
Local node ID 4
RING ID 0
  id  = 172.16.**.*5
  status  = ring 0 active with no faults

Code:
# corosync-cmapctl | grep status
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
runtime.totem.pg.mrp.srp.members.3.status (str) = joined
runtime.totem.pg.mrp.srp.members.4.status (str) = joined
runtime.totem.pg.mrp.srp.members.5.status (str) = joined
runtime.totem.pg.mrp.srp.members.6.status (str) = joined

What still makes me wonder is the failing omping (I tryed with and without giving the multicast IP) as mentioned on https://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting.

Code:
# omping node1 node2 node3 node4 node5 node6
node1 : waiting for response msg
node2 : waiting for response msg
node3 : waiting for response msg
node4 : waiting for response msg
node5 : waiting for response msg
...
node1 : response message never received
node2 : response message never received
node3 : response message never received
node4 : response message never received
node6 : response message never received

Is this supposed to work or is this somehow outdated?
 
What still makes me wonder is the failing omping (I tryed with and without giving the multicast IP) as mentioned on https://pve.proxmox.com/wiki/Multicast_notes#Troubleshooting.

Code:
# omping node1 node2 node3 node4 node5 node6
node1 : waiting for response msg
node2 : waiting for response msg
node3 : waiting for response msg
node4 : waiting for response msg
node5 : waiting for response msg
...
node1 : response message never received
node2 : response message never received
node3 : response message never received
node4 : response message never received
node6 : response message never received

Is this supposed to work or is this somehow outdated?

did you start omping on all nodes simultaneously? if you start it only on one node, it won't work ;)
 
Oh my ... thanks.

Hopefully last question: ist this output ok or should I investigate further regarding multicast packet loss?

Code:
node1 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.035/0.098/0.367/0.036
node1 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node2 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.039/0.187/4.067/0.159
node2 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node3 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.034/0.102/0.944/0.048
node3 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node4 :  unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.028/0.087/0.842/0.031
node4 : multicast, xmt/rcv/%loss = 10000/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
node6 :  unicast, xmt/rcv/%loss = 8793/8793/0%, min/avg/max/std-dev = 0.031/0.086/3.684/0.060
node6 : multicast, xmt/rcv/%loss = 8793/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

Thanks
 
your output indicates that multicast does not work at all (loss 100%) so i would investigate what goes wrong here
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!