pve-cluster.service errors and corosync.service failed to start at boot time

Gaspar

New Member
May 18, 2016
3
2
1
30
Hello,

I have this error after reboot :

Code:
# service corosync status
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: failed (Result: exit-code) since Wed 2016-05-18 11:58:55 CEST; 6min ago
  Process: 1219 ExecStart=/usr/share/corosync/corosync start (code=exited, status=1/FAILURE)

May 18 11:58:55 castor corosync[1228]: [MAIN  ] Corosync Cluster Engine ('2.3.5.15-e2b6b'): started and ready to provide service.
May 18 11:58:55 castor corosync[1228]: [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
May 18 11:58:55 castor corosync[1228]: [MAIN  ] parse error in config: No interfaces defined
May 18 11:58:55 castor corosync[1228]: [MAIN  ] Corosync Cluster Engine exiting with status 8 at main.c:1278.
May 18 11:58:55 castor corosync[1219]: Starting Corosync Cluster Engine (corosync): [FAILED]
May 18 11:58:55 castor systemd[1]: corosync.service: control process exited, code=exited status=1
May 18 11:58:55 castor systemd[1]: Failed to start Corosync Cluster Engine.
May 18 11:58:55 castor systemd[1]: Unit corosync.service entered failed state.

I think this is happenning because of errors when pve-cluster.service is starting :

Code:
# service pve-cluster status
● pve-cluster.service - The Proxmox VE cluster filesystem
   Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled)
   Active: active (running) since Wed 2016-05-18 11:58:55 CEST; 6min ago
  Process: 1190 ExecStartPost=/usr/bin/pvecm updatecerts --silent (code=exited, status=0/SUCCESS)
  Process: 1126 ExecStart=/usr/bin/pmxcfs $DAEMON_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 1188 (pmxcfs)
   CGroup: /system.slice/pve-cluster.service
           └─1188 /usr/bin/pmxcfs

May 18 12:05:18 castor pmxcfs[1188]: [dcdb] crit: cpg_initialize failed: 2
May 18 12:05:18 castor pmxcfs[1188]: [status] crit: cpg_initialize failed: 2
May 18 12:05:24 castor pmxcfs[1188]: [quorum] crit: quorum_initialize failed: 2
May 18 12:05:24 castor pmxcfs[1188]: [confdb] crit: cmap_initialize failed: 2
May 18 12:05:24 castor pmxcfs[1188]: [dcdb] crit: cpg_initialize failed: 2
May 18 12:05:24 castor pmxcfs[1188]: [status] crit: cpg_initialize failed: 2
May 18 12:05:30 castor pmxcfs[1188]: [quorum] crit: quorum_initialize failed: 2
May 18 12:05:30 castor pmxcfs[1188]: [confdb] crit: cmap_initialize failed: 2
May 18 12:05:30 castor pmxcfs[1188]: [dcdb] crit: cpg_initialize failed: 2
May 18 12:05:30 castor pmxcfs[1188]: [status] crit: cpg_initialize failed: 2

When I start manually pve-cluster, it starts correctly and corosync too.

NB: I am working with UDPU because I can't use multicast.
Here my proxmox version :
Code:
# pveversion -v
proxmox-ve: 4.2-49 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-4 (running version: 4.2-4/2660193c)
pve-kernel-4.4.8-1-pve: 4.4.8-49
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-74
pve-firmware: 1.1-8
libpve-common-perl: 4.0-60
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-16
pve-container: 1.0-63
pve-firewall: 2.0-26
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1

Is anyone could help me ?

Thanks!
 
Seems there are syntax error in your config ("parse error in config: No interfaces defined")
 
Yes I didn't use interface address because nodes are not in the same subnet. I try with one of the node's address and with 0.0.0.0 but corosync can't start. With this configuration it works but not at boot time.

Code:
cat /etc/pve/corosync.conf
totem {
  version: 2
  secauth: on
  cluster_name: dioscures
  config_version: 6
  ip_version: ipv4
  transport: udpu
  interface {
  ringnumber: 0
  }
}
 
Good Morning,

Thank you for your answer. I work with Gaspar on this cluster.

Here is the full file:

Code:
root@castor:~# cat /etc/corosync/corosync.conf
totem {
  version: 2
  secauth: on
  cluster_name: dioscures
  config_version: 6
  ip_version: ipv4
  transport: udpu
  interface {
    ringnumber: 0
  }
}

nodelist {
  node {
    ring0_addr: castor
    name: castor
    nodeid: 1
    quorum_votes: 1
  }
  node {
    ring0_addr: pollux
    name: pollux
    nodeid: 2
    quorum_votes: 1
  }
}

quorum {
  provider: corosync_votequorum
  expected_votes: 1
  two_node: 1
}

logging {
  to_syslog: yes
  debug: off
}


bindnetaddr with either the public adress of the first node or 0.0.0.0 doesn't permit the second node to join. So we removed it and the second node joined.
It works somehow, but the node cannot join automatically at boot time. We have to restart pve-cluster manually.

Code:
service pve-cluster restart

Best regards,

Thierry
 
Apparently the network interface eth0 isn't ready and fully configured with its address received by dhcp when pmxcfs starts.

Extracts from syslog:

Code:
May 18 10:12:50 castor kernel: [   14.031771] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
May 18 10:12:50 castor pmxcfs[1163]: [quorum] crit: quorum_initialize failed: 2
May 18 10:12:52 castor corosync[1230]:  [MAIN  ] parse error in config: No interfaces defined
May 18 10:12:52 castor ntpd_intres[1110]: host name not found: 0.debian.pool.ntp.org
May 18 10:12:53 castor kernel: [   17.587743] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
May 18 10:12:53 castor dhclient: bound to {public IP address} -- renewal in 2147483648 seconds.
May 18 10:12:56 castor pmxcfs[1163]: [quorum] crit: quorum_initialize failed: 2
May 18 10:12:58 castor ntpd_intres[1110]: DNS 0.debian.pool.ntp.org -> 195.154.174.209

By that time corosync has already read its configuration file, tried to connect and failed.

What is surprising is that it doesn't continuously try or re-try when the NIC comes up.

We will try to postpone the start of pmxcfs until the NIC is ready.

Thank you for your attention.
Best regards,

Thierry
 
Ok I found a solution, I don't know if it's the better one :

I change the file /etc/systemd/system/multi-user.target.wants/corosync.service

Code:
root@castor:~# diff -u corosync.service.old corosync.service.new
--- corosync.service.old    2016-05-18 17:44:53.249169049 +0200
+++ corosync.service.new    2016-05-18 17:45:50.849885682 +0200
@@ -11,6 +11,8 @@
[Service]
ExecStart=/usr/share/corosync/corosync start
ExecStop=/usr/share/corosync/corosync stop
+Restart=on-failure
+RestartSec=5s
Type=forking
 
  • Like
Reactions: jerel and Gilles

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!