Cluster fails to start on boot

gerasiov

New Member
Aug 2, 2012
3
0
1
Hi there.
I have two host cluster.
It works. But second node fails to start cluster related staff.

/etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="lvknet-cluster1" config_version="11">

<cman keyfile="/var/lib/pve-cluster/corosync.authkey">
</cman>

<clusternodes>
<clusternode name="glagol" votes="2" nodeid="1"/>
<clusternode name="coffin" votes="1" nodeid="2"/></clusternodes>

</cluster>
bootlog:

Thu Aug 2 12:36:53 2012: Starting pve cluster filesystem : pve-cluster.
Thu Aug 2 12:36:54 2012: Starting OpenBSD Secure Shell server: sshd.
Thu Aug 2 12:36:54 2012: Starting cluster:
Thu Aug 2 12:36:54 2012: Checking if cluster has been disabled at boot... [ OK ]
Thu Aug 2 12:36:54 2012: Checking Network Manager... [ OK ]
Thu Aug 2 12:36:54 2012: Global setup... [ OK ]
Thu Aug 2 12:36:54 2012: Loading kernel modules... [ OK ]
Thu Aug 2 12:36:54 2012: Mounting configfs... [ OK ]
Thu Aug 2 12:36:54 2012: Starting cman... [ OK ]
Thu Aug 2 12:36:59 2012: Waiting for quorum... Timed-out waiting for cluster
Thu Aug 2 12:37:44 2012: [FAILED]
Thu Aug 2 12:37:44 2012: Stopping cluster:
Thu Aug 2 12:37:44 2012: Stopping dlm_controld... [ OK ]
Thu Aug 2 12:37:44 2012: Stopping fenced... [ OK ]
Thu Aug 2 12:37:44 2012: Stopping cman... [ OK ]
Thu Aug 2 12:37:45 2012: Waiting for corosync to shutdown:[ OK ]
Thu Aug 2 12:37:45 2012: Unloading kernel modules... [ OK ]
Thu Aug 2 12:37:45 2012: Unmounting configfs... [ OK ]

syslog:

Aug 2 12:36:53 coffin nullmailer[1564]: Rescanning queue.
Aug 2 12:36:53 coffin pmxcfs[1575]: [quorum] crit: quorum_initialize failed: 6
Aug 2 12:36:53 coffin pmxcfs[1575]: [quorum] crit: can't initialize service
Aug 2 12:36:53 coffin pmxcfs[1575]: [confdb] crit: confdb_initialize failed: 6
Aug 2 12:36:53 coffin pmxcfs[1575]: [quorum] crit: can't initialize service
Aug 2 12:36:53 coffin pmxcfs[1575]: [dcdb] crit: cpg_initialize failed: 6
Aug 2 12:36:53 coffin pmxcfs[1575]: [quorum] crit: can't initialize service
Aug 2 12:36:53 coffin pmxcfs[1575]: [dcdb] crit: cpg_initialize failed: 6
Aug 2 12:36:53 coffin pmxcfs[1575]: [quorum] crit: can't initialize service
Aug 2 12:36:54 coffin kernel: DLM (built Jul 9 2012 08:30:07) installed
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Corosync Cluster Engine ('1.4.3'): started and ready to provide service.
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Corosync built-in features: nss
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Successfully parsed cman config
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Successfully configured openais services to load
Aug 2 12:36:55 coffin corosync[1696]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Aug 2 12:36:55 coffin corosync[1696]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Aug 2 12:36:55 coffin corosync[1696]: [TOTEM ] The network interface [192.168.2.80] is now up.
Aug 2 12:36:55 coffin corosync[1696]: [QUORUM] Using quorum provider quorum_cman
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Aug 2 12:36:55 coffin corosync[1696]: [CMAN ] CMAN 1342001054 (built Jul 11 2012 12:04:18) started
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: openais cluster membership service B.01.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: openais event service B.01.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: openais message service B.03.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: openais distributed locking service B.03.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: openais timer service A.01.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync configuration service
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync profile loading service
Aug 2 12:36:55 coffin corosync[1696]: [QUORUM] Using quorum provider quorum_cman
Aug 2 12:36:55 coffin corosync[1696]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] CLM CONFIGURATION CHANGE
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] New Configuration:
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] Members Left:
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] Members Joined:
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] CLM CONFIGURATION CHANGE
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] New Configuration:
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] #011r(0) ip(192.168.2.80)
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] Members Left:
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] Members Joined:
Aug 2 12:36:55 coffin corosync[1696]: [CLM ] #011r(0) ip(192.168.2.80)
Aug 2 12:36:55 coffin corosync[1696]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 2 12:36:55 coffin corosync[1696]: [QUORUM] Members[1]: 2
Aug 2 12:36:55 coffin corosync[1696]: [QUORUM] Members[1]: 2
Aug 2 12:36:55 coffin corosync[1696]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.80) ; members(old:0 left:0)
Aug 2 12:36:55 coffin corosync[1696]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 2 12:36:59 coffin pmxcfs[1575]: [status] notice: update cluster info (cluster name lvknet-cluster1, version = 11)
Aug 2 12:36:59 coffin pmxcfs[1575]: [dcdb] notice: members: 2/1575
Aug 2 12:36:59 coffin pmxcfs[1575]: [dcdb] notice: all data is up to date
Aug 2 12:36:59 coffin pmxcfs[1575]: [dcdb] notice: members: 2/1575
Aug 2 12:36:59 coffin pmxcfs[1575]: [dcdb] notice: all data is up to date
Aug 2 12:37:03 coffin kernel: vmbr808: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vlan21: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vlan808: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vmbr802: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vlan802: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vmbr8: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vmbr21: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: eth0: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vlan8: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vlan807: no IPv6 routers present
Aug 2 12:37:03 coffin kernel: vmbr807: no IPv6 routers present
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] CLM CONFIGURATION CHANGE
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] New Configuration:
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] #011r(0) ip(192.168.2.80)
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] Members Left:
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] Members Joined:
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] CLM CONFIGURATION CHANGE
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] New Configuration:
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] #011r(0) ip(192.168.2.80)
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] Members Left:
Aug 2 12:37:24 coffin corosync[1696]: [CLM ] Members Joined:
Aug 2 12:37:24 coffin corosync[1696]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 2 12:37:24 coffin corosync[1696]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.80) ; members(old:1 left:0)
Aug 2 12:37:24 coffin corosync[1696]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] CLM CONFIGURATION CHANGE
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] New Configuration:
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] #011r(0) ip(192.168.2.80)
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] Members Left:
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] Members Joined:
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] CLM CONFIGURATION CHANGE
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] New Configuration:
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] #011r(0) ip(192.168.2.80)
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] Members Left:
Aug 2 12:37:27 coffin corosync[1696]: [CLM ] Members Joined:
Aug 2 12:37:27 coffin corosync[1696]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 2 12:37:27 coffin corosync[1696]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.80) ; members(old:1 left:0)
Aug 2 12:37:27 coffin corosync[1696]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 2 12:37:44 coffin corosync[1696]: [SERV ] Unloading all Corosync service engines.
Aug 2 12:37:44 coffin corosync[1696]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service
Aug 2 12:37:44 coffin corosync[1696]: [SERV ] Service engine unloaded: corosync configuration service
Aug 2 12:37:44 coffin pmxcfs[1575]: [status] crit: cpg_dispatch failed: 2
Aug 2 12:37:44 coffin pmxcfs[1575]: [status] crit: cpg_leave failed: 2

But if I start cman manually, it works:
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Unfencing self... [ OK ]

Any ideas?
 
No, master node has vote=2 and it starts ok.

But then slave tries to start and fails. And I cant find out why. (Not easy to debug, because I had to reboot it every time to reproduce.)
 
Aha, after I've added sleep 120 in the begining of init.d/cman script, it has started. So I think the problem is network-related. (Slow bridge device start, or slow negotiation on cisco router.)

I'll continue my experiments.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!