cman keeps crashing

mdevilz

New Member
Dec 9, 2011
29
0
1
Code:
root@blackmesa:~# pveversion -v
pve-manager: 2.0-38 (pve-manager/2.0/af81df02)
running kernel: 2.6.32-7-pve
proxmox-ve-2.6.32: 2.0-60
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.88-2pve1
clvm: 2.02.88-2pve1
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-1
pve-cluster: 1.0-23
qemu-server: 2.0-25
pve-firmware: 1.0-15
libpve-common-perl: 1.0-17
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-12
vncterm: 1.0-2
vzctl: 3.0.30-2pve1
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-5
ksm-control-daemon: 1.1-1

On two separate clusters (With 2 nodes each) cman keeps crashing. I can provide more details if you help me find them :)

Thanks!
 

mdevilz

New Member
Dec 9, 2011
29
0
1
What exactly do you mean with 'crashing'?

Turns to stopped on Service Viewer. And quorum is lost.

And then on both systems I have seen errors in the syslog similar to this.

Code:
Mar  9 15:06:06 proton pmxcfs[92900]: [status] crit: cpg_send_message failed: 9
Mar  9 15:06:06 proton pmxcfs[92900]: [status] crit: cpg_send_message failed: 9
Mar  9 15:06:06 proton pmxcfs[92900]: [status] crit: cpg_send_message failed: 9
Mar  9 15:06:06 proton pmxcfs[92900]: [status] crit: cpg_send_message failed: 9
Mar  9 15:06:06 proton pmxcfs[92900]: [status] crit: cpg_send_message failed: 9
 

mdevilz

New Member
Dec 9, 2011
29
0
1
Nothing that I can find. Was thing in messages was for corosync and that was yesterday
Code:
Mar 12 22:12:42 novaprospekt corosync[189622]:   [TOTEM ] Retransmit List: 41a 424 425 427 3ec 3ef 40b 41d 41e 41f 420 421 422 40c 40d 40e 40f 410 411 412 413 414 415 416 417 418 419 41b 41c 426
Mar 12 22:12:42 novaprospekt corosync[189622]:   [TOTEM ] Retransmit List: 419 41b 41c 426 3ed 3ee 405 406 407 408 409 40a 423 40c 40d 40e 40f 410 411 412 413 414 415 416 417 418 41a 424 425 427
Mar 12 22:12:42 novaprospekt corosync[189622]:   [TOTEM ] Retransmit List: 41a 424 425 427 3ec 3ef 40b 41d 41e 41f 420 421 422 40c 40d 40e 40f 410 411 412 413 414 415 416 417 418 419 41b 41c 426
Mar 12 22:12:42 novaprospekt corosync[189622]:   [TOTEM ] Retransmit List: 419 41b 41c 426 3ed 3ee 405 406 407 408 409 40a 423 40c 40d 40e 40f 410 411 412 413 414 415 416 417 418 41a 424 425 427
 

mdevilz

New Member
Dec 9, 2011
29
0
1
Output of cman
Code:
root@proton:~# /etc/init.d/cman restart
Stopping cluster:
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Waiting for corosync to shutdown:[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Unfencing self... [  OK  ]

This is after issuing "pvecm e 1" on both machines. Otherwise quroum fails.

Here is my corosync.log

Code:
[ATTACH]815.vB[/ATTACH]

After which



root@proton:/var/log/cluster# /etc/init.d/cman status
Found stale pid file

pvecm s
cman_tool: Cannot open connection to cman, is it running ?
 

Attachments

  • corosync.zip
    15.7 KB · Views: 11

mdevilz

New Member
Dec 9, 2011
29
0
1
Nothing has changed.

Same thing is happening on two different setups.

1 has a managed switch (multicast enabled) the other is a dumb switch.

Hardware has not changed in either. Only stuff I have done to them is keep them up to date with patches.
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,137
534
133
Austria
www.proxmox.com
Do you still get the error if you stop all OpenVZ containers (reboot, do not start cointainers)?
 

mdevilz

New Member
Dec 9, 2011
29
0
1
I will disable the starting of the VM's on boot and restart the nodes and see if that makes a difference.
 

mdevilz

New Member
Dec 9, 2011
29
0
1
Same thing happens. Did not see any telltale things in any of the logs I looked it.

I have run apt-get update/dist-upgrade and haven't had anything for a couple of weeks.

here is current pveversion -v

pve-manager: 2.0-38 (pve-manager/2.0/af81df02)
running kernel: 2.6.32-7-pve
proxmox-ve-2.6.32: 2.0-60
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.88-2pve1
clvm: 2.02.88-2pve1
corosync-pve: 1.4.1-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-1
pve-cluster: 1.0-23
qemu-server: 2.0-25
pve-firmware: 1.0-15
libpve-common-perl: 1.0-17
libpve-access-control: 1.0-17
libpve-storage-perl: 2.0-12
vncterm: 1.0-2
vzctl: 3.0.30-2pve1
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-5
ksm-control-daemon: 1.1-1
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,137
534
133
Austria
www.proxmox.com
what happens if you stop the pve-cluster service:

# /etc/init.d/pve-cluster stop

Does the corosync messages disappear? (please start the server again after the test)
 

mdevilz

New Member
Dec 9, 2011
29
0
1
It turns into

Mar 22 23:47:42 proton pvedaemon[1863]: WARNING: ipcc_send_rec failed: Connection refused

Would it matter that there are two different clusters on the same subnet?

For example it is setup like this currently.

Fiber in -> Managed Switch (1) -> {node1 cluster a, node2 cluster a} directly connected, {unmanaged switch connected to managed switch}->{node1, clusterb, node2 clusterb}
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,137
534
133
Austria
www.proxmox.com
It turns into

Mar 22 23:47:42 proton pvedaemon[1863]: WARNING: ipcc_send_rec failed: Connection refused

Would it matter that there are two different clusters on the same subnet?

Well, should be easy to test by disconnecting the unmanaged switch (clusterb)?

In general, that should not matter as long as you use different cluster names (multicast IP
is computed from cluster name).
 

mdevilz

New Member
Dec 9, 2011
29
0
1
Shut off the second cluster. No change

I restarted cman and pve-cluster and I got the totem trying to retransmit (both were talking had quroum)

Code:
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1c8 1c9 1ca 1b3 1b4 1b7 1b8 1b9 1cd 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1cb 1d0 1d1 [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1cb 1d0 1d1 1b2 1b5 1b6 1cc 1ce 1cf 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1c8 1c9 1ca [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1c8 1c9 1ca 1b3 1b4 1b7 1b8 1b9 1cd 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1cb 1d0 1d1 [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1cb 1d0 1d1 1b2 1b5 1b6 1cc 1ce 1cf 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1c8 1c9 1ca [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1c8 1c9 1ca 1b3 1b4 1b7 1b8 1b9 1cd 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1cb 1d0 1d1 [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1cb 1d0 1d1 1b2 1b5 1b6 1cc 1ce 1cf 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1c8 1c9 1ca [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1c8 1c9 1ca 1b3 1b4 1b7 1b8 1b9 1cd 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1cb 1d0 1d1 [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] Retransmit List: 1cb 1d0 1d1 1b2 1b5 1b6 1cc 1ce 1cf 1ba 1bb 1bc 1bd 1be 1bf 1c0 1c1 1c2 1c3 1c4 1c5 1c6 1c7 1c8 1c9 1ca [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:00 novaprospekt corosync[210647]:   [TOTEM ] FAILED TO RECEIVE[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:03 novaprospekt fenced[210704]: cluster is down, exiting[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:03 novaprospekt pmxcfs[211136]: [quorum] crit: quorum_dispatch failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:03 novaprospekt dlm_controld[210722]: cluster is down, exiting[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:03 novaprospekt dlm_controld[210722]: daemon cpg_dispatch error 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:03 novaprospekt pmxcfs[211136]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:03 novaprospekt pmxcfs[211136]: [confdb] crit: confdb_dispatch failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:05 novaprospekt pmxcfs[211136]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:05 novaprospekt pmxcfs[211136]: [dcdb] crit: cpg_dispatch failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:05 novaprospekt kernel: dlm: closing connection to node 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:05 novaprospekt kernel: dlm: closing connection to node 1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:07 novaprospekt pmxcfs[211136]: [dcdb] crit: cpg_leave failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:09 novaprospekt pmxcfs[211136]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:09 novaprospekt pmxcfs[211136]: [dcdb] crit: cpg_dispatch failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:11 novaprospekt pmxcfs[211136]: [dcdb] crit: cpg_leave failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [quorum] crit: quorum_initialize failed: 6[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [quorum] crit: can't initialize service[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [confdb] crit: confdb_initialize failed: 6[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [quorum] crit: can't initialize service[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [dcdb] notice: start cluster connection[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [dcdb] crit: cpg_initialize failed: 6[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:13 novaprospekt pmxcfs[211136]: [quorum] crit: can't initialize service[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [dcdb] notice: start cluster connection[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [dcdb] crit: cpg_initialize failed: 6[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [quorum] crit: can't initialize service[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Mar 25 00:22:15 novaprospekt pmxcfs[211136]: [status] crit: cpg_send_message failed: 9

and cman stopped working on one node.[/FONT][/COLOR]
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!