A short story on a cluster failing

liori

New Member
Jul 4, 2019
2
0
1
41
Hi, I've recently had an opportunity to debug a friend's cluster of six physical machines (4 necessary for quorum) and about 50 virtual machines running on the newest released proxmox. I am not a sysadmin, I am a software developer, and this wasn't technically even my cluster, but I was asked to help. I wasn't smart enough to make notes, but here's what I remember—maybe this will be useful to someone.

The case started from observing that backups stopped being made. After logging in to some machines I noticed that the `vzdump` processes were hanging on kernel calls for several days at that point. Other symptoms were: inability to `ls /etc/pve/nodes/«somenodename»/` (process getting stuck in a kernel call), while being able to look at almost everything else. Also, most modifications to files on that file system seemed to hang.

I started debugging `pmxcfs` with `gdb` on one of the machines and it turned out that some fuse threads were hanging at dfsm.c:339 ("g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);"). These were the threads handling writes to the file system. Also, it turned out, `dfsm->state` was set to DFSM_MODE_START_SYNC. Presumably for several days at that point. One of the message queues (I don't remember which one) had 8 enqueued messages.

Then I noticed that two machines (other than the one I debugged above) out of six were stuck at resending messages: `[dcdb] notice: cpg_join retry «some number»` every second.

I restarted `corosync` and `pve-cluster` services on these two machines and immediately backups started working.

Conclusions and suggestions: it seems that there is some kind of a concurrency/communication bug related to initial syncing between machines where a failing minority led to lack of working quorum. What would be helpful at debugging this problem are: (1) some kind of notification that "hey, I'm stuck in DFSM_MODE_START_SYNC for far too long because I'm still expecting messages from nodes X and Y" and (2) making some logic that if there is no message from a minority of nodes for some time, making the quorum still work despite that.

I hope this write-up will be helpful.
 
I thought quorum requires an odd number of nodes to avoid split brain situations?
 
That would be an interesting thing to know! I don't remember reading about it in the Administration Guide though.
 
That would be an interesting thing to know! I don't remember reading about it in the Administration Guide though.
I thought it only apply when cluster is less than 4 nodes. Like you can not have a stable quorum with 2 nodes. You need 3.
 
You want n = 2f+1, where n is the min number of nodes you need to tolerate the failure of f of those nodes without loosing quorum (assuming each node has the same number of votes). So to tolerate the failure of f=1 node, you need a minimum of n=3, to tolerate f=2 failed nodes, you need n=5, ecc..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!