Corosync Service won't start after node reboot

Wojciech Szpunar

New Member
Jun 3, 2016
20
0
1
42
I found some issue with corosync after node restart. When you reboot machine corosync is failing to start on one of nodes. Not sure why this is happening. Can somebody help me out with narrowing down where problem is?


systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
Active: failed (Result: exit-code) since Fri 2016-06-03 14:42:28 IST; 17min ago
Process: 1946 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)

Jun 03 14:41:27 proxmoxn2 corosync[1957]: [SERV ] Service engine loaded: corosync configuration service [1]
Jun 03 14:41:27 proxmoxn2 corosync[1957]: [QB ] server name: cfg
Jun 03 14:41:27 proxmoxn2 corosync[1957]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 03 14:41:27 proxmoxn2 corosync[1957]: [QB ] server name: cpg
Jun 03 14:41:27 proxmoxn2 corosync[1957]: [SERV ] Service engine loaded: corosync profile loading service [4]
Jun 03 14:41:27 proxmoxn2 corosync[1957]: [QUORUM] Using quorum provider corosync_votequorum
Jun 03 14:42:28 proxmoxn2 corosync[1946]: Starting Corosync Cluster Engine (corosync): [FAILED]
Jun 03 14:42:28 proxmoxn2 systemd[1]: corosync.service: control process exited, code=exited status=1
Jun 03 14:42:28 proxmoxn2 systemd[1]: Failed to start Corosync Cluster Engine.
Jun 03 14:42:28 proxmoxn2 systemd[1]: Unit corosync.service entered failed state.


journalctl -xn
-- Logs begin at Fri 2016-06-03 14:41:15 IST, end at Fri 2016-06-03 14:59:50 IST. --
Jun 03 14:59:38 proxmoxn2 pmxcfs[1931]: [dcdb] crit: cpg_initialize failed: 2
Jun 03 14:59:38 proxmoxn2 pmxcfs[1931]: [status] crit: cpg_initialize failed: 2
Jun 03 14:59:44 proxmoxn2 pmxcfs[1931]: [quorum] crit: quorum_initialize failed: 2
Jun 03 14:59:44 proxmoxn2 pmxcfs[1931]: [confdb] crit: cmap_initialize failed: 2
Jun 03 14:59:44 proxmoxn2 pmxcfs[1931]: [dcdb] crit: cpg_initialize failed: 2
Jun 03 14:59:44 proxmoxn2 pmxcfs[1931]: [status] crit: cpg_initialize failed: 2
Jun 03 14:59:50 proxmoxn2 pmxcfs[1931]: [quorum] crit: quorum_initialize failed: 2
Jun 03 14:59:50 proxmoxn2 pmxcfs[1931]: [confdb] crit: cmap_initialize failed: 2
Jun 03 14:59:50 proxmoxn2 pmxcfs[1931]: [dcdb] crit: cpg_initialize failed: 2
Jun 03 14:59:50 proxmoxn2 pmxcfs[1931]: [status] crit: cpg_initialize failed: 2


pvecm status
Cannot initialize CMAP service
pvecm nodes
Cannot initialize CMAP service

root@proxmoxn2:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.106.234 proxmoxn2.servergate.local proxmoxn2 pvelocalhost
192.168.106.235 proxmoxn1.servergate.local proxmoxn1
192.168.106.231 proxmoxn3.servergate.local proxmoxn3

Thanks

Wojtek
 
Hi,

Can you please send the output of

form one node
cat /etc/pve/corosync.conf

from all nodes
cat /etc/network/interfaces
pveversion -v
 
Hi,

Node with that machine on it and from which I took previous snaps:

root@proxmoxn1:~# cat /etc/pve/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: proxmoxn3
nodeid: 3
quorum_votes: 1
ring0_addr: proxmoxn3
}

node {
name: proxmoxn1
nodeid: 1
quorum_votes: 1
ring0_addr: proxmoxn1
}

node {
name: proxmoxn2
nodeid: 2
quorum_votes: 1
ring0_addr: proxmoxn2
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cloud
config_version: 5
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 192.168.106.235
ringnumber: 0
}

}

And other requested snips:

root@proxmoxn1:~# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage part of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet manual
slaves eth0 eth2
bond_miimon 100
bond_mode balance-rr

auto bond1
iface bond1 inet manual
slaves eth1 eth3
bond_miimon 100
bond_mode balance-rr

auto vmbr0
iface vmbr0 inet static
address 192.168.106.235
netmask 255.255.255.0
gateway 192.168.106.1
bridge_ports bond0
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

auto vmbr1
iface vmbr1 inet static
address 192.168.106.120
netmask 255.255.255.0
bridge_ports bond1
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

root@proxmoxn1:~# pveversion -v
proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie


root@proxmoxn2:~# cat /etc/network/interfaces && pveversion -v
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage part of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet manual
slaves eth0 eth2
bond_miimon 100
bond_mode balance-rr

auto bond1
iface bond1 inet manual
slaves eth1 eth3
bond_miimon 100
bond_mode balance-rr

auto vmbr0
iface vmbr0 inet static
address 192.168.106.234
netmask 255.255.255.0
gateway 192.168.106.1
bridge_ports bond0
bridge_stp off
bridge_fd 0

auto vmbr1
iface vmbr1 inet static
address 192.168.106.121
netmask 255.255.255.0
bridge_ports bond1
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie



root@proxmoxn3:~# cat /etc/network/interfaces && pveversion -v
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage part of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet manual
slaves eth0 eth2
bond_miimon 100
bond_mode balance-rr

auto bond1
iface bond1 inet manual
slaves eth1 eth3
bond_miimon 100
bond_mode balance-rr

auto vmbr0
iface vmbr0 inet static
address 192.168.106.231
netmask 255.255.255.0
gateway 192.168.106.1
bridge_ports bond0
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

auto vmbr1
iface vmbr1 inet static
address 192.168.106.122
netmask 255.255.255.0
bridge_ports bond1
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
 
Please make an update to current version.
 
Something of note: I just had the same issue after reconfiguring the Bridges to use Openvswitch instead of Linux bridging

Nothing I did would bring up the corosync service so as a last ditch effort, I checked the /etc/network/interfaces files and found that vmbr1 was first and vmbr0 was second ... I switched those around in the interfaces files and rebooted the node and all was well once it came back up ... apparently since vmbr1 was initialized first and vmbr0 second corosync couldn't start

I don't know if this is a bug but I thought I'd bring this to the attention of everyone in case it requires further attention
 
We've observed a new issue with this "Corosync not starting after node reboot"
If an IPv4 gateway isn't available upon boot up then corosync won't start ... bear in mind, IPv6 gateway is fine at this time
This seems like a serious issue as there's not a guarantee that nodes will always have an Internet connection which shouldn't be necessary for them to operate normally. Also, why would an IPv4 gateway be necessary when an IPv6 gateway is available?
This has caused serious outage issues for our customers a couple of times (we virtualize firewalls so if the node carrying the gateway for other nodes or itself can't bring up the firewall when it boots up because it can't reach the Internet then the entire 9 node cluster may also be down ... this may happen when reboots are scheduled for overnight in order to apply kernel updates, etc.)
 
First I added 4 bridges without assigning IP addresses and rebooted. In this case the bridges didn't come up automatically, so I started them manually, but corosync was running I guess (not verified).
Then I added an IP address to vmbr1 and rebooted, then the bridges came up automatically, but corosync crashed.
Then I tried to remove the IP again, rebooted, no luck. In /etc/network/interfaces the IP was gone, but with ifconfig I could still see it assigned to vmbr1. So ifdown/ifup vmbr1, then the IP was gone, but no chance to start corosync. Next Reboot, corosync came up again and network config correct like in /etc/network/interfaces. I thought it might have been a problem with manually editing interfaces in between, but final edit was via GUI.

Then I reconfigured the other vmbr's again and removed .<VLAN> suffix from some NICs to get plain ethx interfaces on the bridges. Before I tested with VLAN e.g. eth2.11. All changes were made via GUI. After first reboot corosync didn't start and ifconfig output was wrong, compared to /etc/network/interfaces. 2nd boot networking ok (ifconfig and interfaces file), but corosync down.
3rd Reboot networking and corosync ok.

So in both cases I needed 3 reboots to get networking and corosync operational.

proxmox-ve: 4.4-86 (running kernel: 4.4.49-1-pve)
pve-manager: 4.4-13 (running version: 4.4-13/7ea56165)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+1
libqb0: 1.0.1-1
pve-cluster: 4.0-49
qemu-server: 4.0-110
pve-firmware: 1.1-11
libpve-common-perl: 4.0-94
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-97
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
openvswitch-switch: 2.6.0-2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!