corosync problem

srozanski · Dec 4, 2015

Hi,

I have 3 nodes:

[TABLE="width: 800"]
[TR]
[TD][/TD]
[TD]shv1[/TD]
[TD]shv2[/TD]
[TD]shv3[/TD]
[/TR]
[TR]
[TD]hostname[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[TR]
[TD]dns[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[TR]
[TD]ssh[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[TR]
[TD]pvecm status[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]err[/TD]
[/TR]
[TR]
[TD]systemctl status corosync[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]err[/TD]
[/TR]
[TR]
[TD]systemctl status pve-cluster[/TD]
[TD]ok (rw)[/TD]
[TD]ok (rw)[/TD]
[TD]err (ro)[/TD]
[/TR]
[TR]
[TD]apt-get update && apt-get dist-upgrade[/TD]
[TD]ok (not restarted)[/TD]
[TD]ok (restared)[/TD]
[TD]ok (restared)[/TD]
[/TR]
[TR]
[TD]MULTICAST ADDRESS: netstat -g (vmbr0 1 239.192.2.227)[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]err[/TD]
[/TR]
[TR]
[TD]MULTICAST PING (239.192.2.227)

Code:

ping 239.192.2.227PING 239.192.2.227 (239.192.2.227) 56(84) bytes of data.
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.218 ms (DUP!)

[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]err[/TD]
[/TR]
[TR]
[TD]MULTICAST PING (239.192.2.227)

Code:

ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) 56(84) bytes of data.
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 10.64.2.3: icmp_seq=1 ttl=64 time=0.131 ms (DUP!)
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.241 ms (DUP!)

[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[TR]
[TD]date / hwclock[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[TR]
[TD]ntp installed[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[/TABLE]

History of my problem:

After one reboot of shv2 there was e problem with quorum. Corosync problems. So I updated all nodes. After rebbot shv was ok, but after restart node shv3 problem return on node shv3 (after upgrade). What can be a problem? Some info around... In logs there was a problem with ntp debian servers, but I have set in /etc/ntp.conf my local ntp server, and i've hashed all default debian servers... ?!?

VERSIONS
-----------

Not restarted after update (Broken live migration / some vm are online), so on shv1 is older kernel

Code:

Linux shv1 4.2.2-1-pve #1 SMP Mon Oct 5 18:23:31 CEST 2015 x86_64 GNU/Linux

shv1:~# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.2-1-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
openvswitch-switch: 2.3.2-1

Code:

Linux shv2 4.2.3-2-pve #1 SMP Sun Nov 15 16:08:19 CET 2015 x86_64 GNU/Linux

shv2:~# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
openvswitch-switch: 2.3.2-1

Code:

Linux shv3 4.2.3-2-pve #1 SMP Sun Nov 15 16:08:19 CET 2015 x86_64 GNU/Linux

shv3:~# pveversion -v
proxmox-ve: 4.0-22 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-21
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie
openvswitch-switch: 2.3.2-1

SHV3 NODE PROBLEM LOGS
-------------------------------

Code:

systemctl status pve-cluster.service 
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: active (running) since Fri 2015-12-04 15:42:25 CET; 12min ago
  Process: 2152 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0
  Process: 2136 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 2150 (pmxcfs)
   CGroup: /system.slice/pve-cluster.service
           └─2150 /usr/bin/pmxcfs


Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_send_message failed: 9
Dec 04 15:54:36 shv3 pmxcfs[2150]: [quorum] crit: quorum_initialize failed: 2
Dec 04 15:54:36 shv3 pmxcfs[2150]: [confdb] crit: cmap_initialize failed: 2
Dec 04 15:54:36 shv3 pmxcfs[2150]: [dcdb] crit: cpg_initialize failed: 2
Dec 04 15:54:36 shv3 pmxcfs[2150]: [status] crit: cpg_initialize failed: 2

Code:

systemctl status corosync.service 
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: failed (Result: exit-code) since Fri 2015-12-04 15:43:25 CET; 11min ago
  Process: 2658 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAIL


Dec 04 15:42:25 shv3 corosync[2721]: [QB    ] server name: cpg
Dec 04 15:42:25 shv3 corosync[2721]: [SERV  ] Service engine loaded: corosync profile lo
Dec 04 15:42:25 shv3 corosync[2721]: [QUORUM] Using quorum provider corosync_votequorum
Dec 04 15:42:25 shv3 corosync[2721]: [QUORUM] Quorum provider: corosync_votequorum faile
Dec 04 15:42:25 shv3 corosync[2721]: [SERV  ] Service engine 'corosync_quorum' failed to
Dec 04 15:42:25 shv3 corosync[2721]: [MAIN  ] Corosync Cluster Engine exiting with statu
Dec 04 15:43:25 shv3 corosync[2658]: Starting Corosync Cluster Engine (corosync): [FAILE
Dec 04 15:43:25 shv3 systemd[1]: corosync.service: control process exited, code=exited s
Dec 04 15:43:25 shv3 systemd[1]: Failed to start Corosync Cluster Engine.
Dec 04 15:43:25 shv3 systemd[1]: Unit corosync.service entered failed state.

RobFantini · Dec 5, 2015

Have you tested that multicast works ? If not check here for tests: https://pve.proxmox.com/wiki/Troubl...luster_issues#Diagnosis_from_first_principles

Erk · Dec 5, 2015

Corosync seems to be trouble for a lot of people, mulitcast is at the center of the problems. I couldn't get it stable myself, and am looking for another way to do clustering without multicast. Some people run corosync in unicast but I couldn't find the info to get that working either. I suspect the attempt at multicast first left things in an unclean state.

http://pve.proxmox.com/wiki/Multicast_notes#Use_unicast_instead_of_multicast_.28if_all_else_fails.29

Also the instructions expect you to edit /etc/pve/cluster.conf and assume you have a working cluster to propagate the changes!

spirit · Dec 6, 2015

do you have an igmp querier configured on your network for multicast ?

srozanski · Dec 7, 2015

spirit said:
do you have an igmp querier configured on your network for multicast ?

I set kernel options for linux bridges:

echo 0 > /sys/devices/virtual/net/vmbr0/bridge/multicast_snooping
echo 0 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts

but I didn't set:

echo 1 > /sys/devices/virtual/net/vmbr0/bridge/multicast_querier

My switches has turn off igmp, so multicast is like broadcast now. Some tests:

[TABLE="class: cms_table, width: 800"]
[TR]
[TD]

MULTICAST ADDRESS - netstat -g (vmbr0 1 239.192.2.227)[/TD]
[TD]SHV1

ok[/TD]
[TD]SHV2

ok[/TD]
[TD]SHV3

err[/TD]
[/TR]
[TR]
[TD]MULTICAST PING (239.192.2.227)

Code:

ping 239.192.2.227PING 239.192.2.227 (239.192.2.227) 56(84) bytes of data.
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.012 ms
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.156 ms (DUP!)

[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]err[/TD]
[/TR]
[TR]
[TD]MULTICAST PING (224.0.0.1)

Code:

ping 224.0.0.1PING 224.0.0.1 (224.0.0.1) 56(84) bytes of data.
64 bytes from 10.64.2.2: icmp_seq=1 ttl=64 time=0.011 ms
64 bytes from 10.64.2.3: icmp_seq=1 ttl=64 time=0.272 ms (DUP!)
64 bytes from 10.64.2.1: icmp_seq=1 ttl=64 time=0.275 ms (DUP!)

[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[TD]ok[/TD]
[/TR]
[/TABLE]

At the beginning 3 node cluster have been working until week ago when i restarted one node. Switches wasn't touch. Anybody have some idea why the last node can't establish quorum?

spirit · Dec 7, 2015

DUP packet is strange.

do you use bonding ? if yes, which mode ?

srozanski · Dec 7, 2015

I do not use bonding on vmbr0, my network sets:

vmbr0:
- eth4 (hypervisiors network)

Code:

IFACE LIST:
----------
1)	UN (lo) 
	 127.0.0.1/8 
2)	UP (eth0) 
3)	DN (eth1) 
4)	DN (eth2) 
5)	DN (eth3) 
6)	UP (eth4) 
7)	DN (eth5) 
8)	UP (ib0) 
9)	UP (ib1) 
10)	UP (ib0.8001@ib0) 
11)	UP (ib1.8001@ib1) 
12)	UP (bond0) 
	 10.27.4.228/16
13)	UP (bond1) 
	 10.64.0.3/16
16)	UP (vmbr0) 
	 10.64.2.3/24
17)	UP (vmbr1) 
	 10.26.4.228/16
18)	UN (tap108i0) 
19)	UP (vmbr0v203) 
20)	UP (eth4.203@eth4) 


BRCTL SHOW:
----------
bridge name	bridge id		STP enabled	interfaces
vmbr0		8000.3464a9b87660	no		eth4
vmbr0v203		8000.3464a9b87660	no		eth4.203
							tap108i0
vmbr1		8000.3863bb4332ec	no		eth0

spirit · Dec 7, 2015

Not sure it's related, but are you sure that they are no problem with
network overlapping ?

13) UP (bond1)
10.64.0.3/16
16) UP (vmbr0)
10.64.2.3/24

srozanski · Dec 7, 2015

spirit said:
Not sure it's related, but are you sure that they are no problem with
network overlapping ?

13) UP (bond1)
10.64.0.3/16
16) UP (vmbr0)
10.64.2.3/24

You have sense of humor

)))))) And you are RIGHT!!!

Thank you very much

) It solved my problem

And adding alsow that iptables rule:

Code:

[COLOR=#242424][FONT=Arial]In case cman crashes with cpg_send_message failed: 9 add those to your rule set:[/FONT][/COLOR]
iptables -A INPUT -m addrtype --dst-type MULTICAST -j ACCEPT
iptables -A INPUT -p udp -m state --state NEW -m multiport –dports 5404,5405 -j ACCEPT

Search

Search

corosync problem

srozanski

New Member

RobFantini

Famous Member

Erk

Renowned Member

spirit

Distinguished Member

srozanski

New Member

spirit

Distinguished Member

srozanski

New Member

spirit

Distinguished Member

srozanski

New Member

We value your privacy