[SOLVED] (I think) multicast problem

Stefano Giunchi · Jan 9, 2017

We have a low-budget 3-nodes (5yr old servers, with 32gb ram) cluster, with ceph in two of them.
They are connected with an HP 1920G switch, 2 nic for ceph and 2 for corosync and lan.

After some minutes, the cluster stops working (each node sees only itself as online) and after some time the nodes reboot.

I tried changing switch for the corosync network, without success.
I also enabled IGMP Snooping on both the switches, without success.

Thanks for any help

journalctl -u corosync -u pve-cluster -b:

Jan 09 11:21:21 piedone systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 09 11:21:21 piedone pmxcfs[2172]: [quorum] crit: quorum_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [quorum] crit: can't initialize service
Jan 09 11:21:21 piedone pmxcfs[2172]: [confdb] crit: cmap_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [confdb] crit: can't initialize service
Jan 09 11:21:21 piedone pmxcfs[2172]: [dcdb] crit: cpg_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [dcdb] crit: can't initialize service
Jan 09 11:21:21 piedone pmxcfs[2172]: [status] crit: cpg_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [status] crit: can't initialize service
Jan 09 11:21:23 piedone systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 09 11:21:23 piedone systemd[1]: Starting Corosync Cluster Engine...
Jan 09 11:21:23 piedone corosync[2215]: [MAIN ] Corosync Cluster Engine ('2.4.0'): started and ready to provide service.
Jan 09 11:21:23 piedone corosync[2215]: [MAIN ] Corosync built-in features: augeas systemd pie relro bindnow
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] The network interface [10.73.73.3] is now up.
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync configuration map access [0]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: cmap
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync configuration service [1]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: cfg
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: cpg
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync profile loading service [4]
Jan 09 11:21:23 piedone corosync[2258]: [QUORUM] Using quorum provider corosync_votequorum
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: votequorum
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: quorum
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] A new membership (10.73.73.3:60708) was formed. Members joined: 1
Jan 09 11:21:23 piedone corosync[2258]: [QUORUM] Members[1]: 1
Jan 09 11:21:23 piedone corosync[2258]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 09 11:21:24 piedone corosync[2197]: Starting Corosync Cluster Engine (corosync): [ OK ]
Jan 09 11:21:24 piedone systemd[1]: Started Corosync Cluster Engine.
Jan 09 11:21:27 piedone pmxcfs[2172]: [status] notice: update cluster info (cluster name soasi, version = 5)
Jan 09 11:21:27 piedone pmxcfs[2172]: [dcdb] notice: members: 1/2172
Jan 09 11:21:27 piedone pmxcfs[2172]: [dcdb] notice: all data is up to date
Jan 09 11:21:27 piedone pmxcfs[2172]: [status] notice: members: 1/2172
Jan 09 11:21:27 piedone pmxcfs[2172]: [status] notice: all data is up to date

root@piedone:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: bambino
nodeid: 3
quorum_votes: 1
ring0_addr: bambino
}

node {
name: px1
nodeid: 2
quorum_votes: 1
ring0_addr: px1
}

node {
name: piedone
nodeid: 1
quorum_votes: 1
ring0_addr: piedone
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: soasi
config_version: 5
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.73.73.3
ringnumber: 0
}

}

Stefano Giunchi · Jan 9, 2017

as a workaround, I would like to convert it in unicast.
Is it possible? I only found how to create a new cluster in unicast, not convert an existing cluster.

Stefano Giunchi · Jan 9, 2017

now I used "pvecm expected 1" so I could start the vms and have the nodes to not reboot.
Any help on the multicast problem is really appreciated.

wolfgang · Jan 9, 2017

Hi

with this command you can test your multicast

omping -c 10000 -i 0.001 -F -q <node1 ip> <node2 ip> <node3 ip>

to change to unicast read
man corosync.conf
and edit /etc/pve/corosync.conf

t.lamprecht · Jan 9, 2017

Stefano Giunchi said:
After some minutes, the cluster stops working (each node sees only itself as online) and after some time the nodes reboot.

You probably have setup IGMP snooping on the switch but have no IGMP querier active in the network.
Look in your switches settings and disable it for now and recheck if the cluster works.
IGMP snooping is a technique to send only those interfaces multicast messages who belong to the respective multicast group, this reduces traffic in the network and is normally recommended.
But, for that to work a IGMP querier needs to be active so that a valid multicast member table is available.
If snooping is available but no querier active the switch sees no members and thus discards the multicast packages, this gets in effect only after (default) 5 minutes after the first multicast sending.
(this explanation simplyfies some stuff to explain the basic intend of IGMP snooping only)

For more information about cluster network requirements see: http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

Stefano Giunchi · Jan 9, 2017

Thank you both, I will be back on those servers tomorrow.
I did forget to mention that I already tried omping, and it worked. Anyway, I think it's a multicast problem because with a previous version of the switch firmware enabling the IGMP snooping the nodes did see the others again.
This time it doesn't work anymore, tried with an HP and a ZYXEL switch, with snooping enabled and not.

Anyway, thanks for the explanation of the IGMP snooping & querier.

Tomorrow I'll try to separate the cluster network and as last resource the unicast method.

Thanks

Stefano Giunchi · Jan 10, 2017

I resolved using unicast.
As it's less problematic than multicast and useful with 4 nodes max, can i suggest to use it as a default for new installations?
I think there are much more installations with 4 nodes or less than 5 or more.

Thank you for support and explanations.

Search

Search

[SOLVED] (I think) multicast problem

Stefano Giunchi

Renowned Member

Stefano Giunchi

Renowned Member

Stefano Giunchi

Renowned Member

wolfgang

Proxmox Retired Staff

t.lamprecht

Proxmox Staff Member

Stefano Giunchi

Renowned Member

Stefano Giunchi

Renowned Member

We value your privacy