Constantly Losing Quorum

Jplanter

New Member
Jun 19, 2012
12
0
1
Good Morning

So I have been having this problem consistently with various nodes within my cluster for about a month now.

I have 4 nodes in my cluster (px1,px2,px3,px5)

Here are the specs for PX1,PX2,PX3:
(2)500GB sata drives
(2) AMD Opteron 6168
64GB DDR3
(2) Intel 8257EB Gigabit Ethernet
(2) Intel 82576 Gigabit Ethernet

Here are the specs for PX5:
(2)500GB sata drives
(2) AMD Opteron 6234
64GB DDR3
(4) Intel 82576 Gigabit Ethernet


The problem is that 1 or 2 nodes will lose quorum randomly. It appears to be running corosync when suddenly it gets a FAILED TO RECEIVE error and loses quorum.

In this example PX3 has lost quorum.

Here are the logs:

Code:
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1c a1d
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1e a1f
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1c a1d
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] Retransmit List: a1e a1f
Aug 20 09:19:05 px3 corosync[3677]: [TOTEM ] FAILED TO RECEIVE
Aug 20 09:19:29 px3 pmxcfs[3520]: [quorum] crit: quorum_dispatch failed: 2
Aug 20 09:19:29 px3 dlm_controld[3750]: cluster is down, exiting
Aug 20 09:19:29 px3 dlm_controld[3750]: daemon cpg_dispatch error 2
Aug 20 09:19:29 px3 fenced[3731]: cluster is down, exiting
Aug 20 09:19:29 px3 fenced[3731]: daemon cpg_dispatch error 2
Aug 20 09:19:29 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:29 px3 pmxcfs[3520]: [confdb] crit: confdb_dispatch failed: 2
Aug 20 09:19:31 px3 kernel: dlm: closing connection to node 1
Aug 20 09:19:31 px3 kernel: dlm: closing connection to node 4
Aug 20 09:19:31 px3 kernel: dlm: closing connection to node 3
Aug 20 09:19:32 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:32 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:34 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:34 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:35 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:35 px3 pmxcfs[3520]: [dcdb] crit: cpg_dispatch failed: 2
Aug 20 09:19:36 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:36 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:37 px3 pmxcfs[3520]: [dcdb] crit: cpg_leave failed: 2
Aug 20 09:19:38 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:38 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:39 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:39 px3 pmxcfs[3520]: [dcdb] crit: cpg_dispatch failed: 2
Aug 20 09:19:40 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:40 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:42 px3 pmxcfs[3520]: [dcdb] crit: cpg_leave failed: 2
Aug 20 09:19:44 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:44 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:46 px3 pmxcfs[3520]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: quorum_initialize failed: 6
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:46 px3 pmxcfs[3520]: [confdb] crit: confdb_initialize failed: 6
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:46 px3 pmxcfs[3520]: [dcdb] notice: start cluster connection
Aug 20 09:19:46 px3 pmxcfs[3520]: [dcdb] crit: cpg_initialize failed: 6
Aug 20 09:19:46 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 2
Aug 20 09:19:48 px3 pmxcfs[3520]: [dcdb] notice: start cluster connection
Aug 20 09:19:48 px3 pmxcfs[3520]: [dcdb] crit: cpg_initialize failed: 6
Aug 20 09:19:48 px3 pmxcfs[3520]: [quorum] crit: can't initialize service
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9
Aug 20 09:19:48 px3 pmxcfs[3520]: [status] crit: cpg_send_message failed: 9

Also here are logs from PX1 while PX3 has lost quorum:

Code:
Aug 20 09:19:29 px1 corosync[3824]:   [TOTEM ] Process pause detected for 12061 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]:   [TOTEM ] Process pause detected for 12103 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]:   [TOTEM ] Process pause detected for 12131 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]:   [TOTEM ] Process pause detected for 12201 ms, flushing membership messages.
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] New Configuration:
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] #011r(0) ip(10.10.12.230) 
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] #011r(0) ip(10.10.12.233) 
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] Members Left:
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] Members Joined:
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] CLM CONFIGURATION CHANGE
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] New Configuration:
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] #011r(0) ip(10.10.12.230) 
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] #011r(0) ip(10.10.12.233) 
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] Members Left:
Aug 20 09:19:29 px1 corosync[3824]:   [CLM   ] Members Joined:
Aug 20 09:19:29 px1 corosync[3824]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 20 09:19:29 px1 corosync[3824]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.12.230) ; members(old:2 left:0)
Aug 20 09:19:29 px1 corosync[3824]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 20 09:19:46 px1 pvedaemon[134410]: <root@pam> successful auth for user 'root@pam'
Aug 20 09:21:11 px1 pvedaemon[134803]: starting vnc proxy UPID:px1:00020E93:00C7F417:503263F7:vncshell::root@pam:
Aug  20 09:21:11 px1 pvedaemon[134803]: launch command: /usr/bin/vncterm  -rfbport 5901 -timeout 10 -authpath /nodes/px3 -perm Sys.Console -c  /usr/bin/ssh -c blowfish-cbc -t 10.10.12.232 /bin/bash -l
Aug 20 09:21:11 px1 pvedaemon[134410]: <root@pam> starting task UPID:px1:00020E93:00C7F417:503263F7:vncshell::root@pam:
Aug 20 09:21:11 px1 pvedaemon[134687]: <root@pam> successful auth for user 'root@pam'
Aug 20 09:21:12 px1 pvedaemon[134687]: <root@pam> successful auth for user 'root@pam'
Aug 20 09:21:52 px1 pvedaemon[134410]: <root@pam> successful auth for user 'root@pam'


We are a school district and upgraded from Proxmox 1.9 to 2.1 over the summer, we we're having similar problems in 1.9 but not the exact same problems it was much more stable. We had hoped 2.1 would resolve many of the issues we were having, sometime it will run smooth however most of the time this is the type of activity we are receiving.

Please let me know if there is any more information I should post on the nodes and I would be glad to grab it.

Thank you in advance for any help as this is an urgent situation, we continue to have random down time during school hours due to these issues.

Best Regards,

Jared Planter
I.T Director
Escondido Charter High School
 
This usually indicates problems with multicast traffic. Do you run some kind of firewall between nodes (or iptables)? Is there high load on the network when that error occur?
 
Thank you for the response. I am not running any firewall or iptable between nodes and I have had the errors occur during high and low traffic on the network. Would it help to isolate the node traffic in a VLAN separate from the VM traffic?
 
Here's the pveversion results:

Code:
pve-manager: 2.1-14 (pve-manager/2.1/f32f3f46)
running kernel: 2.6.32-14-pve
proxmox-ve-2.6.32: 2.1-73
pve-kernel-2.6.32-11-pve: 2.6.32-66
pve-kernel-2.6.32-13-pve: 2.6.32-72
pve-kernel-2.6.32-12-pve: 2.6.32-68
pve-kernel-2.6.32-14-pve: 2.6.32-73
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.92-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.8-1
pve-cluster: 1.0-27
qemu-server: 2.0-49
pve-firmware: 1.0-18
libpve-common-perl: 1.0-30
libpve-access-control: 1.0-24
libpve-storage-perl: 2.0-30
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.1-8
ksm-control-daemon: 1.1-1
 
Code:
Aug 20 09:19:29 px1 corosync[3824]:   [TOTEM ] Process pause detected for 12131 ms, flushing membership messages.

Above message would indicate that the kernel does not schedule the corosync task correctly (12 seconds delay!). Maybe there is too much load on those nodes? How many vms do you run? What kind of IO subsystem do you use - HW RAID with BBU?

What is the output of

# pveperf

please try to run that when there is no load on the server.
 
This may be unrelated but with 2.1, I had to change the multicast method on our switches from dense mode multicasting (which floods all ports) to sparse mode multicasting (which builds trees and then unicasts). This may or may be the cause of issues on your network, but I know it took me forever to figure it out so I thought I would throw that in there.
 
Dietmar

I am running a StoneFly SAN which has 12TB of disk space to store all of the VMs, occasionally proxmox will give me I/O errors pertaining to the LVM Target.

Here are how many running vms are on each node:
PX1: 7
PX2: 2
PX3: 8
PX5: 9

PX2 has the least because it loses quorum the most. It is 11:45AM PST right now and so far today PX2 has dropped out of quorum 3 times and only rebooting PX2 resolves it. PX1 has lost quorum once however I run "/etc/init.d/pve-cluster stop" and then "pmxcfs" and it resolves that issue, this fix does not work for PX2.

Here is pveperf for each node from 7AM this morning when there was a very small load on the nodes.

PX1
Code:
[COLOR=#000000][FONT=Arial]CPU BOGOMIPS:      91197.24[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]REGEX/SECOND:      756413[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]HD SIZE:           94.49 GB (/dev/mapper/pve-root)[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]BUFFERED READS:    126.60 MB/sec[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]AVERAGE SEEK TIME: 12.61 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]FSYNCS/SECOND:     831.38[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS EXT:           96.93 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS INT:           0.90 ms (ecsd.info)[/FONT][/COLOR]

PX2
Code:
[COLOR=#000000][FONT=Arial]CPU BOGOMIPS:      91203.00[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]REGEX/SECOND:      765228[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]HD SIZE:           94.49 GB (/dev/mapper/pve-root)[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]BUFFERED READS:    127.54 MB/sec[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]AVERAGE SEEK TIME: 11.09 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]FSYNCS/SECOND:     747.71[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS EXT:           119.74 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS INT:           1.09 ms (ecsd.info)[/FONT][/COLOR]

PX3
Code:
[COLOR=#000000][FONT=Arial]CPU BOGOMIPS:      91193.28[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]REGEX/SECOND:      764121[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]HD SIZE:           94.49 GB (/dev/mapper/pve-root)[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]BUFFERED READS:    125.98 MB/sec[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]AVERAGE SEEK TIME: 8.91 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]FSYNCS/SECOND:     824.12[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS EXT:           106.00 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS INT:           1.14 ms (ecsd.info)[/FONT][/COLOR]

PX5
Code:
[COLOR=#000000][FONT=Arial]CPU BOGOMIPS:      115188.48[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]REGEX/SECOND:      727431[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]HD SIZE:           94.49 GB (/dev/mapper/pve-root)[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]BUFFERED READS:    127.50 MB/sec[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]AVERAGE SEEK TIME: 10.76 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]FSYNCS/SECOND:     1457.67[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS EXT:           89.73 ms[/FONT][/COLOR]
[COLOR=#000000][FONT=Arial]DNS INT:           0.97 ms (ecsd.info)[/FONT][/COLOR]

I am thinking about separating the primary network traffic on the nodes on a VLAN this weekend, I will let you know the results if I do.

Thank you again for your help it is greatly appreciated.

Best Regards,

Jared Planter
I.T Director
Escondido Charter High School
 
This may be unrelated but with 2.1, I had to change the multicast method on our switches from dense mode multicasting (which floods all ports) to sparse mode multicasting (which builds trees and then unicasts). This may or may be the cause of issues on your network, but I know it took me forever to figure it out so I thought I would throw that in there.

Thank you for that I am going to look into it. We're you using Cisco switches? If so could you give me some pointers on what you did to change the multicast mode? Thanks again!
 
Thank you for that I am going to look into it. We're you using Cisco switches? If so could you give me some pointers on what you did to change the multicast mode? Thanks again!

Our core switches are Dells but on a Cisco stack it would be (for IOS 12.4):
configure terminal

ip multicasting-routing distributed

interface then the interfaces you are using for your nodes... must be a layer 3 no switchport routable port

ip pim version 2

ip pim sparse-mode

end

show running-config to verify

After testing then commit changes to startup-config


Reference: http://www.cisco.com/en/US/docs/swi...se/configuration/guide/swmcast.html#wp1024278
 
Last edited by a moderator:
PX2 has the least because it loses quorum the most. It is 11:45AM PST right now and so far today PX2 has dropped out of quorum 3 times and only rebooting PX2 resolves it. PX1 has lost quorum once however I run "/etc/init.d/pve-cluster stop" and then "pmxcfs" and it resolves that issue, this fix does not work for PX2.

I am thinking about separating the primary network traffic on the nodes on a VLAN this weekend, I will let you know the results if I do.

Thank you again for your help it is greatly appreciated.
Have you tested the quality of your cables? Bad cables can also result in the kind of problems you are seeing. Another question. Do you use fibra cables? I have seen network behave like you describe either because of bad connector or because of internal defects in the cable. Remember that a fibra cable can be broken inside and at the same time this can be very hard to determine by looking at the cable. If I was you I would definitely consider a broken/bad cable/connector in which case I would test the network using proper network testing tools. Another issue could be your switches. Maybe your switches simply cannot cope with the traffic on the network or a failed connected trunk. Do you have spanning tree configured on your switches?

You could try a simple network test with iperf.

What do you see for error stats and dropped packet stats on your nodes?
 
Last edited:
pveperf is only showing local storage performance and not iSCSI... don't worry about that. I would highly recommend separating SAN traffic onto it's own logical or physical network and having AT LEAST one dedicated 1GB NIC per node for nothing but SAN. All of my nodes have 3 NICs each with about 10 - 14 Windows server VMs per node and I saturate the links from time to time. I use a ZFS based SAN and SSDs for read caches and it helps tremendously for IO on the SAN side (not unusual to hit 80,000 IO/s from the ARC or L2ARC). My bottleneck is the network.

Multicast traffic is extremely latency sensitive and should not be sharing with the SAN traffic if you can help it. Sparse mode multicasting helped my network. The default on most switches is dense mode which floods all ports of a segment (it's much older and doesn't scale well).
 
Thank you all for your input it has been incredibly useful in troubleshooting this issue. All of the cables have been tested thoroughly and they are good to go; the SAN traffic has had a dedicated 1GB nic on each node with traffic separated in a VLAN. After all of the conversation here and probing deeper into the issue it dawned on me that all of my high traffic VM's were bridged with eth0, the nic used for traffic between the nodes. So I bridged those VM's to a separate NIC used for VM traffic and have not had one single problem in the last 30 hours or so. Keeping my fingers crossed but I believe the problem has been resolved.

Here is my nic configuration for each node:

ETH0: Node Traffic
ETH1: VM Traffic (Trunk Port)
ETH2: VM Traffic (Trunk Port)
ETH3: SAN Traffic

This is something I should have caught earlier and not been using configurations from my predecessorr of whom originally installed this cluster, however it is better sooner than later and this has been a great learning experience for me and my staff.

Thank you all for your insight it has been invaluable and is greatly appreciated! :D

I will post an update later this week on the cluster status.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!