help please: My cluster has fallen apart!?!

proxwolfe

Renowned Member
Jun 20, 2020
537
64
68
50
I have a small three node cluster in my home lab. Everything was working fine until recently one of the nodes ("A") went into a reboot loop (I don't what caused this yet), so I turned it off for later trouble shooting. The other two nodes ("B" and "C") have been working fine since then (Ceph did complain about several undersized pgs but other than that everything was fine).

Then I added a replacement node ("D") to the mix. My cluster then had four nodes, only three of which were online (B, C, D). Everything was working fine.

Today, I wanted to find out what was causing the boot loop in node A. So I brought it back online to observe what happens (spoiler: the boot loop is gone). Well, what did happen was that I lost the connection to my other hosts for a moment. After a while, I was able to log in to two of the three remaining ones (C, D) via the GUI again. The third one (B) is still running but I can't log in to its GUI while ssh is fine).

Now, when I look at the GUI of each of the three nodes that I can access at the GUI I see this:

A: Cluster consists of A, B, C. Only A is online, B and C are offline
B: (can't access GUI)
C: Cluster consists of A, B, C, D: Only C is online, A is offline, B and C are unknown
D: Cluster consists of A, B, C, D: Only D is online, A is offline, B and C are unknown

In the CEPH section of the GUI, all four nodes with their respective OSDs are shown as "In" and "Up". So the problem seems only to be with the Corosync part of the cluster.

Additional info: Not sure whether this had anything to do with the problem but it is unusual so I want to point it out: (Long) Before node A had its issues, I had put a TFA requirement on the cluster. When I tried add in D to the cluster (after A was offline), I got an error message about the TFA, so I disabled it. After that, nodes B, C and D were accessible without TFA but A did not know about this. So when A came back online, it asked for the TFA but it did not accept the code from my app anymore. So I disabled TFA for A as well. Because I couldn't do this from the GUI, as it would not let me access it without the TFA, I had to do it via ssh. When trying to disable TFA via ssh, I got an error that it could not lock some config file (because it had no quorum). I found another post here that suggested to first reduce the number of nodes expected for quorum to 1. So I did this on node A (only). After that, I was able to remove the TFA on node A as well, or so I thought because there were no more error messages. When trying to log in again, A still asked me for the TFA but this time it did accept the code from my app again.

How can I bring my cluster together again? I need to get A back into the cluster because I need to get to the config files for a number of VMs that were running on it. Once that is done, I will remove it from the cluster for good.

Any help is appreciated!
 
Small update:

I can now also log in to the GUI of node B when I access its IP directly (all my nodes have LE certificates and can -- normally -- be reached internally via the respective domain names).

And when I look into node B's GUI, I see this: Cluster consists of A, B, C, D: B, C and D are online, only A is offline.

Does that make sense to anyone?
 
Btw. all vm/lxc config files are in /etc/pve of each node and so there is no need to get node A back into cluster to access these configs.