A short story on a cluster failing

liori · Jul 10, 2019

Hi, I've recently had an opportunity to debug a friend's cluster of six physical machines (4 necessary for quorum) and about 50 virtual machines running on the newest released proxmox. I am not a sysadmin, I am a software developer, and this wasn't technically even my cluster, but I was asked to help. I wasn't smart enough to make notes, but here's what I remember—maybe this will be useful to someone.

The case started from observing that backups stopped being made. After logging in to some machines I noticed that the `vzdump` processes were hanging on kernel calls for several days at that point. Other symptoms were: inability to `ls /etc/pve/nodes/«somenodename»/` (process getting stuck in a kernel call), while being able to look at almost everything else. Also, most modifications to files on that file system seemed to hang.

I started debugging `pmxcfs` with `gdb` on one of the machines and it turned out that some fuse threads were hanging at dfsm.c:339 ("g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);"). These were the threads handling writes to the file system. Also, it turned out, `dfsm->state` was set to DFSM_MODE_START_SYNC. Presumably for several days at that point. One of the message queues (I don't remember which one) had 8 enqueued messages.

Then I noticed that two machines (other than the one I debugged above) out of six were stuck at resending messages: `[dcdb] notice: cpg_join retry «some number»` every second.

I restarted `corosync` and `pve-cluster` services on these two machines and immediately backups started working.

Conclusions and suggestions: it seems that there is some kind of a concurrency/communication bug related to initial syncing between machines where a failing minority led to lack of working quorum. What would be helpful at debugging this problem are: (1) some kind of notification that "hey, I'm stuck in DFSM_MODE_START_SYNC for far too long because I'm still expecting messages from nodes X and Y" and (2) making some logic that if there is no message from a minority of nodes for some time, making the quorum still work despite that.

I hope this write-up will be helpful.

jdancer · Jul 10, 2019

I thought quorum requires an odd number of nodes to avoid split brain situations?

liori · Jul 10, 2019

That would be an interesting thing to know! I don't remember reading about it in the Administration Guide though.

Chris · Jul 10, 2019

Quorum requires a majority of nodes, for reference: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_quorum

jim.bond.9862 · Jul 10, 2019

liori said:
That would be an interesting thing to know! I don't remember reading about it in the Administration Guide though.

I thought it only apply when cluster is less than 4 nodes. Like you can not have a stable quorum with 2 nodes. You need 3.

Chris · Jul 10, 2019

You want n = 2f+1, where n is the min number of nodes you need to tolerate the failure of f of those nodes without loosing quorum (assuming each node has the same number of votes). So to tolerate the failure of f=1 node, you need a minimum of n=3, to tolerate f=2 failed nodes, you need n=5, ecc..

LnxBil · Jul 11, 2019

jdancer said:
I thought quorum requires an odd number of nodes to avoid split brain situations?

For a minimal solution, yes, but it also works for an even number of nodes:

5 nodes need 3 for quorum
6 nodes need 4 for quorum
7 nodes need 4 for quorum

Search

Search

A short story on a cluster failing

liori

New Member

jdancer

Renowned Member

liori

New Member

Chris

Proxmox Staff Member

jim.bond.9862

Renowned Member

Chris

Proxmox Staff Member

LnxBil

Distinguished Member

We value your privacy