corosync problem

srozanski

New Member
Nov 5, 2015
18
0
1
Poland
Hi,

I have 3 nodes:

shv1shv2shv3
hostnameokokok
dnsokokok
sshokokok
pvecm statusokokerr
systemctl status corosyncokokerr
systemctl status pve-clusterok (rw)ok (rw)err (ro)
apt-get update && apt-get dist-upgradeok (not restarted)ok (restared)ok (restared)
MULTICAST ADDRESS: netstat -g (vmbr0 1 239.192.2.227)okokerr
MULTICAST PING (239.192.2.227)

Code:
ping 239.192.2.227PING 239.192.2.227 (239.192.2.227) 56(84) bytes of data.
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.218 ms (DUP!)
okokerr
MULTICAST PING (239.192.2.227)

Code:
ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) 56(84) bytes of data.
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 10.64.2.3: icmp_seq=1 ttl=64 time=0.131 ms (DUP!)
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.241 ms (DUP!)
okokok
date / hwclockokokok
ntp installedokokok

History of my problem:

After one reboot of shv2 there was e problem with quorum. Corosync problems. So I updated all nodes. After rebbot shv was ok, but after restart node shv3 problem return on node shv3 (after upgrade). What can be a problem? Some info around... In logs there was a problem with ntp debian servers, but I have set in /etc/ntp.conf my local ntp server, and i've hashed all default debian servers... ?!?

VERSIONS
-----------

Not restarted after update (Broken live migration / some vm are online), so on shv1 is older kernel
Code:
Linux shv1 4.2.2-1-pve #1 SMP Mon Oct 5 18:23:31 CEST 2015 x86_64 GNU/Linux

shv1:~# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
openvswitch-switch: 2.3.2-1

Code:
Linux shv2 4.2.3-2-pve #1 SMP Sun Nov 15 16:08:19 CET 2015 x86_64 GNU/Linux

shv2:~# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
openvswitch-switch: 2.3.2-1

Code:
Linux shv3 4.2.3-2-pve #1 SMP Sun Nov 15 16:08:19 CET 2015 x86_64 GNU/Linux

shv3:~# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
openvswitch-switch: 2.3.2-1


SHV3 NODE PROBLEM LOGS
-------------------------------

Code:
systemctl status pve-cluster.service 
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: active (running) since Fri 2015-12-04 15:42:25 CET; 12min ago
  Process: 2152 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0
  Process: 2136 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 2150 (pmxcfs)
   CGroup: /system.slice/pve-cluster.service
           └─2150 /usr/bin/pmxcfs


Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [quorum] crit: quorum_initialize failed: 2
Dec 04 15:54:36 shv3 pmxcfs[2150]: [confdb] crit: cmap_initialize failed: 2
Dec 04 15:54:36 shv3 pmxcfs[2150]: [dcdb] crit: cpg_initialize failed: 2
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_initialize failed: 2

Code:
systemctl status corosync.service 
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: failed (Result: exit-code) since Fri 2015-12-04 15:43:25 CET; 11min ago
  Process: 2658 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAIL


Dec 04 15:42:25 shv3 corosync[2721]: [QB    ] server name: cpg
Dec 04 15:42:25 shv3 corosync[2721]: [SERV  ] Service engine loaded: corosync profile lo
Dec 04 15:42:25 shv3 corosync[2721]: [QUORUM] Using quorum provider corosync_votequorum
Dec 04 15:42:25 shv3 corosync[2721]: [QUORUM] Quorum provider: corosync_votequorum faile
Dec 04 15:42:25 shv3 corosync[2721]: [SERV  ] Service engine 'corosync_quorum' failed to
Dec 04 15:42:25 shv3 corosync[2721]: [MAIN  ] Corosync Cluster Engine exiting with statu
Dec 04 15:43:25 shv3 corosync[2658]: Starting Corosync Cluster Engine (corosync): [FAILE
Dec 04 15:43:25 shv3 systemd[1]: corosync.service: control process exited, code=exited s
Dec 04 15:43:25 shv3 systemd[1]: Failed to start Corosync Cluster Engine.
Dec 04 15:43:25 shv3 systemd[1]: Unit corosync.service entered failed state.
 
Last edited:
Corosync seems to be trouble for a lot of people, mulitcast is at the center of the problems. I couldn't get it stable myself, and am looking for another way to do clustering without multicast. Some people run corosync in unicast but I couldn't find the info to get that working either. I suspect the attempt at multicast first left things in an unclean state.

http://pve.proxmox.com/wiki/Multicast_notes#Use_unicast_instead_of_multicast_.28if_all_else_fails.29

Also the instructions expect you to edit /etc/pve/cluster.conf and assume you have a working cluster to propagate the changes!
 
Last edited:
do you have an igmp querier configured on your network for multicast ?

I set kernel options for linux bridges:

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping
echo 0 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts

but I didn't set:

echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

My switches has turn off igmp, so multicast is like broadcast now. Some tests:




MULTICAST ADDRESS - netstat -g (vmbr0 1 239.192.2.227)
SHV1

ok
SHV2

ok
SHV3

err
MULTICAST PING (239.192.2.227)
Code:
ping 239.192.2.227PING 239.192.2.227 (239.192.2.227) 56(84) bytes of data.
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.012 ms
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.156 ms (DUP!)
okokerr
MULTICAST PING (224.0.0.1)
Code:
ping 224.0.0.1PING 224.0.0.1 (224.0.0.1) 56(84) bytes of data.
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.011 ms
64 bytes from 10.64.2.3: icmp_seq=1 ttl=64 time=0.272 ms (DUP!)
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.275 ms (DUP!)
okokok

At the beginning 3 node cluster have been working until week ago when i restarted one node. Switches wasn't touch. Anybody have some idea why the last node can't establish quorum?
 
Last edited:
I do not use bonding on vmbr0, my network sets:

vmbr0:
- eth4 (hypervisiors network)

Code:
IFACE LIST:
----------
1)	UN (lo) 
	 127.0.0.1/8 
2)	UP (eth0) 
3)	DN (eth1) 
4)	DN (eth2) 
5)	DN (eth3) 
6)	UP (eth4) 
7)	DN (eth5) 
8)	UP (ib0) 
9)	UP (ib1) 
10)	UP (ib0.8001@ib0) 
11)	UP (ib1.8001@ib1) 
12)	UP (bond0) 
	 10.27.4.228/16
13)	UP (bond1) 
	 10.64.0.3/16
16)	UP (vmbr0) 
	 10.64.2.3/24
17)	UP (vmbr1) 
	 10.26.4.228/16
18)	UN (tap108i0) 
19)	UP (vmbr0v203) 
20)	UP (eth4.203@eth4) 


BRCTL SHOW:
----------
bridge name	bridge id		STP enabled	interfaces
vmbr0		8000.3464a9b87660	no		eth4
vmbr0v203		8000.3464a9b87660	no		eth4.203
							tap108i0
vmbr1		8000.3863bb4332ec	no		eth0
 
Not sure it's related, but are you sure that they are no problem with
network overlapping ?

13) UP (bond1)
10.64.0.3/16
16) UP (vmbr0)
10.64.2.3/24

You have sense of humor :))))))) And you are RIGHT!!! :D Thank you very much :)) It solved my problem ;)

And adding alsow that iptables rule:

Code:
[COLOR=#242424][FONT=Arial]In case cman crashes with cpg_send_message failed: 9 add those to your rule set:[/FONT][/COLOR]
iptables -A INPUT -m addrtype --dst-type MULTICAST -j ACCEPT
iptables -A INPUT -p udp -m state --state NEW -m multiport –dports 5404,5405 -j ACCEPT
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!