problem with quorum and cman stopping

peetaur

Active Member
Jun 29, 2012
11
0
41
Germany
I started this discussion here: http://forum.proxmox.com/threads/10629-New-Kernel-and-bug-fixes?p=60597#post60597

And moved it here since it's no longer on topic.



Problem symptoms:

2 nodes dropped out of the cluster for apparently no reason.
bcvm1 was then the only node in

Investigation summary:
According to various logs, they dropped out at separate times.

Maybe it was after losing network connection for a brief time.

Looking in nagios, there was no network interruption at that time (19:11 first retransmit, 19:20 first evicted, and 19:33 second evicted)

assumptions/conclusions:
When a single node is disconnected, it loses quorum (whether or not the other nodes do).
When a single node loses quorum, cman stops on that node.
Because cman stops, the node will not rejoin ever.

Therefore:
If another, drops out, you could lose quorum on the running nodes, simply because of too few votes.

In my case, this means quorum is lost on the still connected node (1+qdisk=2 left out of 4 = inquorate).

SOLUTION (bad solution):
And rebooting bcvm3 fixed the quorum between qdisk, bcvm1, and bcvm3.
Then bcvm2 could rejoin simply by restarting cman.

The result of this was that everything looked fine in the CL, but the gui still showed vms offline and vm hosts red...
So I logged in a different node, same problem
So I restarted pve-cluster on the red node, and then clicked on all the nodes in the gui, and it was green again.

Prevention for next time:
unknown... so far I changed the number of votes on the qdisk to 5, so hopefully the last node alive will have Quorum, so I don't have to reboot anything, only restart cman.
I was also thinking about making a watchdog script that restarts cman and pve-cluster. (pve-cluster seems to need a restart for the web gui to work again, but not related to quorum)

Here are large parts of some logs while the problem had already occurred:

Code:
=================================
bcvm1
=================================

root@bcvm1:/etc/pve# clustat
Cluster Status for bcproxmox1 @ Thu Aug 30 10:19:07 2012
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 bcvm2                                                               1 Offline
 bcvm3                                                               2 Offline
 bcvm1                                                               3 Online, Local
 /dev/loop1                                                          0 Online, Quorum Disk

root@bcvm1:/etc/pve# cat /var/log/cluster/qdiskd.log 
[...]
Aug 23 18:05:44 qdiskd Quorum Daemon Initializing
Aug 23 18:05:48 qdiskd Heuristic: 'ip addr | grep vmbr0 | grep -q UP' UP
Aug 23 18:05:59 qdiskd Node 1 is the master
Aug 23 18:06:11 qdiskd Initial score 2/3
Aug 23 18:06:11 qdiskd Initialization complete
Aug 23 18:06:11 qdiskd Score sufficient for master operation (2/3; required=2); upgrading
Aug 24 13:08:24 qdiskd qdiskd: write (system call) has hung for 15 seconds
Aug 24 13:08:24 qdiskd In 15 more seconds, we will be evicted
Aug 24 13:09:13 qdiskd qdisk cycle took more than 3 seconds to complete (63.980000)
Aug 24 13:09:13 qdiskd qdiskd on node 1 reports hung write()
Aug 24 13:09:13 qdiskd qdiskd on node 1 reports hung write()
Aug 24 13:09:13 qdiskd qdiskd on node 1 reports hung write()
Aug 24 13:40:13 qdiskd qdisk cycle took more than 3 seconds to complete (5.300000)
Aug 29 19:20:08 qdiskd Node 2 evicted
Aug 29 19:33:37 qdiskd Assuming master role
Aug 29 19:33:40 qdiskd Writing eviction notice for node 1
Aug 29 19:33:43 qdiskd Node 1 evicted
[...]


# gunzip -c /var/log/cluster/corosync.log.1.gz  | less
[...]
(first retransmit appears 19:11:12)
Aug 29 19:11:12 corosync [TOTEM ] Retransmit List: 17a7ad 17a7ae 17a7af 17a7b0 17a7b1 17a7b2 17a7b3 17a7b4 17a7b5 17a7b6 17a7b7 17a7b8 17a7b9 17a7ba 17a7bb 17a7bc 17a7bd 17a7be 17a7bf 17a7c0 
Aug 29 19:11:12 corosync [TOTEM ] Retransmit List: 17a7ad 17a7ae 17a7af 17a7b0 17a7b1 17a7b2 17a7b3 17a7b4 17a7b5 17a7b6 17a7b7 17a7b8 17a7b9 17a7ba 17a7bb 17a7bc 17a7bd 17a7be 17a7bf 17a7c0 
[...]
Aug 29 19:20:26 corosync [TOTEM ] A processor failed, forming new configuration.
Aug 29 19:20:28 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 29 19:20:28 corosync [CLM   ] New Configuration:
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.19) 
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.20) 
Aug 29 19:20:28 corosync [CLM   ] Members Left:
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.58) 
Aug 29 19:20:28 corosync [CLM   ] Members Joined:
Aug 29 19:20:28 corosync [QUORUM] Members[2]: 1 3
Aug 29 19:20:28 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 29 19:20:28 corosync [CLM   ] New Configuration:
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.19) 
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.20) 
Aug 29 19:20:28 corosync [CLM   ] Members Left:
Aug 29 19:20:28 corosync [CLM   ] Members Joined:
Aug 29 19:20:28 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 29 19:20:28 corosync [QUORUM] Members[2]: 1 3
Aug 29 19:20:28 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.0.20) ; members(old:3 left:1)
Aug 29 19:20:28 corosync [MAIN  ] Completed service synchronization, ready to provide service.
[...]
Aug 29 19:33:46 corosync [TOTEM ] A processor failed, forming new configuration.
Aug 29 19:34:42 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 29 19:34:42 corosync [CLM   ] New Configuration:
Aug 29 19:34:42 corosync [CLM   ]       r(0) ip(10.3.0.19) 
Aug 29 19:34:42 corosync [CLM   ] Members Left:
Aug 29 19:34:42 corosync [CLM   ]       r(0) ip(10.3.0.20) 
Aug 29 19:34:42 corosync [CLM   ] Members Joined:
Aug 29 19:34:42 corosync [CMAN  ] quorum lost, blocking activity
Aug 29 19:34:42 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 29 19:34:42 corosync [QUORUM] Members[1]: 3
Aug 29 19:34:42 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 29 19:34:42 corosync [CLM   ] New Configuration:
Aug 29 19:34:42 corosync [CLM   ]       r(0) ip(10.3.0.19) 
Aug 29 19:34:42 corosync [CLM   ] Members Left:
Aug 29 19:34:42 corosync [CLM   ] Members Joined:
Aug 29 19:34:42 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 29 19:34:42 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.0.19) ; members(old:2 left:1)
Aug 29 19:34:42 corosync [MAIN  ] Completed service synchronization, ready to provide service.
[...]

=================================
bcvm2
=================================

root@bcvm2:~# clustat
Could not connect to CMAN: Connection refused
root@bcvm2:~# /etc/init.d/cman status
Found stale pid file
root@bcvm2:~# /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Starting qdiskd... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

root@bcvm2:~# less /var/log/cluster/qdiskd.log

root@bcvm2:~# clustat
Cluster Status for bcproxmox1 @ Thu Aug 30 10:27:40 2012
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 bcvm2                                                               1 Online, Local
 bcvm3                                                               2 Offline
 bcvm1                                                               3 Offline
 /dev/loop1                                                          0 Offline, Quorum Disk

root@bcvm2:~# cat /var/log/cluster/qdiskd.log
[...]
Aug 16 13:34:31 qdiskd Unable to match label 'proxmox1_qdisk' to any device
Aug 16 13:39:09 qdiskd Unable to match label 'proxmox1_qdisk' to any device
Aug 16 13:40:11 qdiskd Unable to match label '/mnt/pve/bcnas1san/proxmox1_qdisk' to any device
Aug 16 13:40:54 qdiskd Unable to match label '/mnt/pve/bcnas1san/proxmox1_qdisk' to any device
Aug 16 13:41:45 qdiskd Warning: /mnt/pve/bcnas1san/proxmox1_qdisk is not a block device
Aug 16 13:41:45 qdiskd qdisk_open: ioctl(BLKSSZGET)Aug 16 13:41:45 qdiskd Specified partition /mnt/pve/bcnas1san/proxmox1_qdisk does not have a qdisk label
Aug 16 13:45:14 qdiskd Specified partition /dev/loop1 does not have a qdisk label
Aug 16 13:46:32 qdiskd Quorum Daemon Initializing
Aug 16 13:46:36 qdiskd Heuristic: 'ip addr | grep vmbr0 | grep -q UP' UP
Aug 16 13:46:59 qdiskd Initial score 2/3
Aug 16 13:46:59 qdiskd Initialization complete
Aug 16 13:46:59 qdiskd Score sufficient for master operation (2/3; required=2); upgrading
Aug 16 13:47:17 qdiskd Assuming master role
Aug 16 14:02:35 qdiskd Node 2 shutdown
Aug 20 11:04:19 qdiskd Node 2 shutdown
Aug 23 13:49:28 qdiskd Node 2 shutdown
Aug 23 15:07:50 qdiskd Node 2 shutdown
Aug 24 13:08:25 qdiskd qdiskd: write (system call) has hung for 15 seconds
Aug 24 13:08:25 qdiskd In 15 more seconds, we will be evicted
Aug 24 13:09:01 qdiskd qdisk cycle took more than 3 seconds to complete (51.240000)
Aug 24 13:09:01 qdiskd qdiskd on node 3 reports hung write()
Aug 24 13:09:01 qdiskd qdiskd on node 3 reports hung write()
Aug 24 13:09:01 qdiskd qdiskd on node 3 reports hung write()
Aug 24 13:09:09 qdiskd qdiskd on node 3 reports hung write()
Aug 24 13:40:13 qdiskd qdisk cycle took more than 3 seconds to complete (6.410000)
Aug 29 19:20:05 qdiskd Writing eviction notice for node 2
Aug 29 19:20:08 qdiskd Node 2 evicted
Aug 29 19:32:54 qdiskd cman_dispatch: Host is down
Aug 29 19:32:54 qdiskd Halting qdisk operations
Aug 30 10:17:58 qdiskd Quorum Daemon Initializing
Aug 30 10:18:02 qdiskd Heuristic: 'ip addr | grep vmbr0 | grep -q UP' UP
Aug 30 10:18:10 qdiskd Node 3 is the master
Aug 30 10:18:25 qdiskd Initial score 2/3
Aug 30 10:18:25 qdiskd Initialization complete
Aug 30 10:18:25 qdiskd Score sufficient for master operation (2/3; required=2); upgrading
[...]

# gunzip -c /var/log/cluster/corosync.log.1.gz  | less
[...]
Aug 29 19:20:26 corosync [TOTEM ] A processor failed, forming new configuration.
Aug 29 19:20:28 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 29 19:20:28 corosync [CLM   ] New Configuration:
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.19) 
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.20) 
Aug 29 19:20:28 corosync [CLM   ] Members Left:
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.58) 
Aug 29 19:20:28 corosync [CLM   ] Members Joined:
Aug 29 19:20:28 corosync [QUORUM] Members[2]: 1 3
Aug 29 19:20:28 corosync [CLM   ] CLM CONFIGURATION CHANGE
Aug 29 19:20:28 corosync [CLM   ] New Configuration:
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.19) 
Aug 29 19:20:28 corosync [CLM   ]       r(0) ip(10.3.0.20) 
Aug 29 19:20:28 corosync [CLM   ] Members Left:
Aug 29 19:20:28 corosync [CLM   ] Members Joined:
Aug 29 19:20:28 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug 29 19:20:28 corosync [CPG   ] chosen downlist: sender r(0) ip(10.3.0.20) ; members(old:3 left:1)
Aug 29 19:20:28 corosync [MAIN  ] Completed service synchronization, ready to provide service.
[...]

=================================
bcvm3
=================================

root@bcvm3:~# clustat
Could not connect to CMAN: Connection refused

root@bcvm3:~# cat /var/log/cluster/qdiskd.log
Aug 27 09:37:21 qdiskd qdisk cycle took more than 3 seconds to complete (8.450000)
Aug 29 19:19:34 qdiskd cman_dispatch: Host is down
Aug 29 19:19:34 qdiskd Halting qdisk operations

# gunzip -c /var/log/cluster/corosync.log.1.gz
Aug 29 19:19:32 corosync [TOTEM ] FAILED TO RECEIVE
 
I pulled the network on one machine, waited until it lost quorum, checked cman, and then plugged it back in.

The one node lost quorum as expected. I didn't try unplugging a second node, because I don't want to have to reboot them to fix it.

It didn't lose the qdisk since it is on a separate network. This is a bit scary ... maybe setting qdisk votes high would cause a serious problem then, having 2 quorate clusters. I tried setting qdisk votes to 5 on Friday, and I'm not sure if it is really set, or if it has a way to prevent this problem.

cman was not stopped. This was a surprise. It makes me think the network has nothing to do with cman stopping last week, and simply crashing. However, it's not perfectly consistent with last week, because the "FAILED TO RECIEVE" error didn't happen. And here's a note from dietmar about that:

Code:
Aug 29 19:19:32 corosync [TOTEM ] FAILED TO RECEIVE

This error causes cman/corosync to exit. Do you use iptables (see http://forum.proxmox.com/threads/8665-cman-keeps-crashing)?

And after plugging it back in, I didn't need to restart anything; the node rejoined, and clustat reported that it was quorate.
 
This happened again over the weekend. This time, I had ssh clients connected and idle, and 2 of them were disconnected too (using ssh settings TCPKeepAlive=yes and ServerAliveInterval=30). This doesn't happen with any other servers, so I'm still not convinced that there is a chance that it is the network hardware or configuration causing this.

Also, perhaps due to my votes=5 on the qdisk, the last node said it still had quorum, unlike last time. Unfortunately though, this did not mean that the other nodes could rejoin the cluster by restarting things in /etc/init.d/ (not any I tried, anyway). I had to reboot one node, and restart cman on the last, just like last time.

It doesn't always happen on the same 2 nodes. This time #1 and #3 were dropped, but last time #2 and #3 were dropped. (#1 and #2 are identical hardware, and #3 is a bit newer). It only seems to happen on 2 of the 3, but never all 3 servers. Realize that the ssh connection was kept on the one remaining node, so it's not simply 'the last one' and if it fails too, it's unnoticed. But this one did not fail with ssh either.

This draws me to this untestable hypothesis:
Disconnection of all other nodes prevents disconnection of the last. This implies that the network issue is not related to the physical network (wiring, switches, etc.) and all to do with these machines (the kernel, or openvz, for example), and that whatever makes them fail is no longer running or not able to make the last fail network connection (including ssh connections having nothing to do with the cluster software) when all the other nodes are lost.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!