Hi, I've recently had an opportunity to debug a friend's cluster of six physical machines (4 necessary for quorum) and about 50 virtual machines running on the newest released proxmox. I am not a sysadmin, I am a software developer, and this wasn't technically even my cluster, but I was asked to help. I wasn't smart enough to make notes, but here's what I remember—maybe this will be useful to someone.
The case started from observing that backups stopped being made. After logging in to some machines I noticed that the `vzdump` processes were hanging on kernel calls for several days at that point. Other symptoms were: inability to `ls /etc/pve/nodes/«somenodename»/` (process getting stuck in a kernel call), while being able to look at almost everything else. Also, most modifications to files on that file system seemed to hang.
I started debugging `pmxcfs` with `gdb` on one of the machines and it turned out that some fuse threads were hanging at dfsm.c:339 ("g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);"). These were the threads handling writes to the file system. Also, it turned out, `dfsm->state` was set to DFSM_MODE_START_SYNC. Presumably for several days at that point. One of the message queues (I don't remember which one) had 8 enqueued messages.
Then I noticed that two machines (other than the one I debugged above) out of six were stuck at resending messages: `[dcdb] notice: cpg_join retry «some number»` every second.
I restarted `corosync` and `pve-cluster` services on these two machines and immediately backups started working.
Conclusions and suggestions: it seems that there is some kind of a concurrency/communication bug related to initial syncing between machines where a failing minority led to lack of working quorum. What would be helpful at debugging this problem are: (1) some kind of notification that "hey, I'm stuck in DFSM_MODE_START_SYNC for far too long because I'm still expecting messages from nodes X and Y" and (2) making some logic that if there is no message from a minority of nodes for some time, making the quorum still work despite that.
I hope this write-up will be helpful.
The case started from observing that backups stopped being made. After logging in to some machines I noticed that the `vzdump` processes were hanging on kernel calls for several days at that point. Other symptoms were: inability to `ls /etc/pve/nodes/«somenodename»/` (process getting stuck in a kernel call), while being able to look at almost everything else. Also, most modifications to files on that file system seemed to hang.
I started debugging `pmxcfs` with `gdb` on one of the machines and it turned out that some fuse threads were hanging at dfsm.c:339 ("g_cond_wait (&dfsm->sync_cond, &dfsm->sync_mutex);"). These were the threads handling writes to the file system. Also, it turned out, `dfsm->state` was set to DFSM_MODE_START_SYNC. Presumably for several days at that point. One of the message queues (I don't remember which one) had 8 enqueued messages.
Then I noticed that two machines (other than the one I debugged above) out of six were stuck at resending messages: `[dcdb] notice: cpg_join retry «some number»` every second.
I restarted `corosync` and `pve-cluster` services on these two machines and immediately backups started working.
Conclusions and suggestions: it seems that there is some kind of a concurrency/communication bug related to initial syncing between machines where a failing minority led to lack of working quorum. What would be helpful at debugging this problem are: (1) some kind of notification that "hey, I'm stuck in DFSM_MODE_START_SYNC for far too long because I'm still expecting messages from nodes X and Y" and (2) making some logic that if there is no message from a minority of nodes for some time, making the quorum still work despite that.
I hope this write-up will be helpful.