[SOLVED] (I think) multicast problem

Stefano Giunchi

Renowned Member
Jan 17, 2016
84
12
73
50
Forlì, Italy
www.soasi.com
We have a low-budget 3-nodes (5yr old servers, with 32gb ram) cluster, with ceph in two of them.
They are connected with an HP 1920G switch, 2 nic for ceph and 2 for corosync and lan.

After some minutes, the cluster stops working (each node sees only itself as online) and after some time the nodes reboot.

I tried changing switch for the corosync network, without success.
I also enabled IGMP Snooping on both the switches, without success.

Thanks for any help

journalctl -u corosync -u pve-cluster -b:

Jan 09 11:21:21 piedone systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 09 11:21:21 piedone pmxcfs[2172]: [quorum] crit: quorum_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [quorum] crit: can't initialize service
Jan 09 11:21:21 piedone pmxcfs[2172]: [confdb] crit: cmap_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [confdb] crit: can't initialize service
Jan 09 11:21:21 piedone pmxcfs[2172]: [dcdb] crit: cpg_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [dcdb] crit: can't initialize service
Jan 09 11:21:21 piedone pmxcfs[2172]: [status] crit: cpg_initialize failed: 2
Jan 09 11:21:21 piedone pmxcfs[2172]: [status] crit: can't initialize service
Jan 09 11:21:23 piedone systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 09 11:21:23 piedone systemd[1]: Starting Corosync Cluster Engine...
Jan 09 11:21:23 piedone corosync[2215]: [MAIN ] Corosync Cluster Engine ('2.4.0'): started and ready to provide service.
Jan 09 11:21:23 piedone corosync[2215]: [MAIN ] Corosync built-in features: augeas systemd pie relro bindnow
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] The network interface [10.73.73.3] is now up.
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync configuration map access [0]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: cmap
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync configuration service [1]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: cfg
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: cpg
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync profile loading service [4]
Jan 09 11:21:23 piedone corosync[2258]: [QUORUM] Using quorum provider corosync_votequorum
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: votequorum
Jan 09 11:21:23 piedone corosync[2258]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jan 09 11:21:23 piedone corosync[2258]: [QB ] server name: quorum
Jan 09 11:21:23 piedone corosync[2258]: [TOTEM ] A new membership (10.73.73.3:60708) was formed. Members joined: 1
Jan 09 11:21:23 piedone corosync[2258]: [QUORUM] Members[1]: 1
Jan 09 11:21:23 piedone corosync[2258]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 09 11:21:24 piedone corosync[2197]: Starting Corosync Cluster Engine (corosync): [ OK ]
Jan 09 11:21:24 piedone systemd[1]: Started Corosync Cluster Engine.
Jan 09 11:21:27 piedone pmxcfs[2172]: [status] notice: update cluster info (cluster name soasi, version = 5)
Jan 09 11:21:27 piedone pmxcfs[2172]: [dcdb] notice: members: 1/2172
Jan 09 11:21:27 piedone pmxcfs[2172]: [dcdb] notice: all data is up to date
Jan 09 11:21:27 piedone pmxcfs[2172]: [status] notice: members: 1/2172
Jan 09 11:21:27 piedone pmxcfs[2172]: [status] notice: all data is up to date


root@piedone:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: bambino
nodeid: 3
quorum_votes: 1
ring0_addr: bambino
}

node {
name: px1
nodeid: 2
quorum_votes: 1
ring0_addr: px1
}

node {
name: piedone
nodeid: 1
quorum_votes: 1
ring0_addr: piedone
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: soasi
config_version: 5
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.73.73.3
ringnumber: 0
}

}
 
Hi

with this command you can test your multicast

omping -c 10000 -i 0.001 -F -q <node1 ip> <node2 ip> <node3 ip>

to change to unicast read
man corosync.conf
and edit /etc/pve/corosync.conf
 
After some minutes, the cluster stops working (each node sees only itself as online) and after some time the nodes reboot.

You probably have setup IGMP snooping on the switch but have no IGMP querier active in the network.
Look in your switches settings and disable it for now and recheck if the cluster works.
IGMP snooping is a technique to send only those interfaces multicast messages who belong to the respective multicast group, this reduces traffic in the network and is normally recommended.
But, for that to work a IGMP querier needs to be active so that a valid multicast member table is available.
If snooping is available but no querier active the switch sees no members and thus discards the multicast packages, this gets in effect only after (default) 5 minutes after the first multicast sending.
(this explanation simplyfies some stuff to explain the basic intend of IGMP snooping only)

For more information about cluster network requirements see: http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements
 
Thank you both, I will be back on those servers tomorrow.
I did forget to mention that I already tried omping, and it worked. Anyway, I think it's a multicast problem because with a previous version of the switch firmware enabling the IGMP snooping the nodes did see the others again.
This time it doesn't work anymore, tried with an HP and a ZYXEL switch, with snooping enabled and not.

Anyway, thanks for the explanation of the IGMP snooping & querier.

Tomorrow I'll try to separate the cluster network and as last resource the unicast method.

Thanks
 
I resolved using unicast.
As it's less problematic than multicast and useful with 4 nodes max, can i suggest to use it as a default for new installations?
I think there are much more installations with 4 nodes or less than 5 or more.

Thank you for support and explanations.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!