I have a small three node cluster in my home lab. Everything was working fine until recently one of the nodes ("A") went into a reboot loop (I don't what caused this yet), so I turned it off for later trouble shooting. The other two nodes ("B" and "C") have been working fine since then (Ceph did complain about several undersized pgs but other than that everything was fine).
Then I added a replacement node ("D") to the mix. My cluster then had four nodes, only three of which were online (B, C, D). Everything was working fine.
Today, I wanted to find out what was causing the boot loop in node A. So I brought it back online to observe what happens (spoiler: the boot loop is gone). Well, what did happen was that I lost the connection to my other hosts for a moment. After a while, I was able to log in to two of the three remaining ones (C, D) via the GUI again. The third one (B) is still running but I can't log in to its GUI while ssh is fine).
Now, when I look at the GUI of each of the three nodes that I can access at the GUI I see this:
A: Cluster consists of A, B, C. Only A is online, B and C are offline
B: (can't access GUI)
C: Cluster consists of A, B, C, D: Only C is online, A is offline, B and C are unknown
D: Cluster consists of A, B, C, D: Only D is online, A is offline, B and C are unknown
In the CEPH section of the GUI, all four nodes with their respective OSDs are shown as "In" and "Up". So the problem seems only to be with the Corosync part of the cluster.
Additional info: Not sure whether this had anything to do with the problem but it is unusual so I want to point it out: (Long) Before node A had its issues, I had put a TFA requirement on the cluster. When I tried add in D to the cluster (after A was offline), I got an error message about the TFA, so I disabled it. After that, nodes B, C and D were accessible without TFA but A did not know about this. So when A came back online, it asked for the TFA but it did not accept the code from my app anymore. So I disabled TFA for A as well. Because I couldn't do this from the GUI, as it would not let me access it without the TFA, I had to do it via ssh. When trying to disable TFA via ssh, I got an error that it could not lock some config file (because it had no quorum). I found another post here that suggested to first reduce the number of nodes expected for quorum to 1. So I did this on node A (only). After that, I was able to remove the TFA on node A as well, or so I thought because there were no more error messages. When trying to log in again, A still asked me for the TFA but this time it did accept the code from my app again.
How can I bring my cluster together again? I need to get A back into the cluster because I need to get to the config files for a number of VMs that were running on it. Once that is done, I will remove it from the cluster for good.
Any help is appreciated!
Then I added a replacement node ("D") to the mix. My cluster then had four nodes, only three of which were online (B, C, D). Everything was working fine.
Today, I wanted to find out what was causing the boot loop in node A. So I brought it back online to observe what happens (spoiler: the boot loop is gone). Well, what did happen was that I lost the connection to my other hosts for a moment. After a while, I was able to log in to two of the three remaining ones (C, D) via the GUI again. The third one (B) is still running but I can't log in to its GUI while ssh is fine).
Now, when I look at the GUI of each of the three nodes that I can access at the GUI I see this:
A: Cluster consists of A, B, C. Only A is online, B and C are offline
B: (can't access GUI)
C: Cluster consists of A, B, C, D: Only C is online, A is offline, B and C are unknown
D: Cluster consists of A, B, C, D: Only D is online, A is offline, B and C are unknown
In the CEPH section of the GUI, all four nodes with their respective OSDs are shown as "In" and "Up". So the problem seems only to be with the Corosync part of the cluster.
Additional info: Not sure whether this had anything to do with the problem but it is unusual so I want to point it out: (Long) Before node A had its issues, I had put a TFA requirement on the cluster. When I tried add in D to the cluster (after A was offline), I got an error message about the TFA, so I disabled it. After that, nodes B, C and D were accessible without TFA but A did not know about this. So when A came back online, it asked for the TFA but it did not accept the code from my app anymore. So I disabled TFA for A as well. Because I couldn't do this from the GUI, as it would not let me access it without the TFA, I had to do it via ssh. When trying to disable TFA via ssh, I got an error that it could not lock some config file (because it had no quorum). I found another post here that suggested to first reduce the number of nodes expected for quorum to 1. So I did this on node A (only). After that, I was able to remove the TFA on node A as well, or so I thought because there were no more error messages. When trying to log in again, A still asked me for the TFA but this time it did accept the code from my app again.
How can I bring my cluster together again? I need to get A back into the cluster because I need to get to the config files for a number of VMs that were running on it. Once that is done, I will remove it from the cluster for good.
Any help is appreciated!