help please: My cluster has fallen apart!?!

proxwolfe

Renowned Member
Jun 20, 2020
546
66
68
50
I have a small three node cluster in my home lab. Everything was working fine until recently one of the nodes ("A") went into a reboot loop (I don't what caused this yet), so I turned it off for later trouble shooting. The other two nodes ("B" and "C") have been working fine since then (Ceph did complain about several undersized pgs but other than that everything was fine).

Then I added a replacement node ("D") to the mix. My cluster then had four nodes, only three of which were online (B, C, D). Everything was working fine.

Today, I wanted to find out what was causing the boot loop in node A. So I brought it back online to observe what happens (spoiler: the boot loop is gone). Well, what did happen was that I lost the connection to my other hosts for a moment. After a while, I was able to log in to two of the three remaining ones (C, D) via the GUI again. The third one (B) is still running but I can't log in to its GUI while ssh is fine).

Now, when I look at the GUI of each of the three nodes that I can access at the GUI I see this:

A: Cluster consists of A, B, C. Only A is online, B and C are offline
B: (can't access GUI)
C: Cluster consists of A, B, C, D: Only C is online, A is offline, B and C are unknown
D: Cluster consists of A, B, C, D: Only D is online, A is offline, B and C are unknown

In the CEPH section of the GUI, all four nodes with their respective OSDs are shown as "In" and "Up". So the problem seems only to be with the Corosync part of the cluster.

Additional info: Not sure whether this had anything to do with the problem but it is unusual so I want to point it out: (Long) Before node A had its issues, I had put a TFA requirement on the cluster. When I tried add in D to the cluster (after A was offline), I got an error message about the TFA, so I disabled it. After that, nodes B, C and D were accessible without TFA but A did not know about this. So when A came back online, it asked for the TFA but it did not accept the code from my app anymore. So I disabled TFA for A as well. Because I couldn't do this from the GUI, as it would not let me access it without the TFA, I had to do it via ssh. When trying to disable TFA via ssh, I got an error that it could not lock some config file (because it had no quorum). I found another post here that suggested to first reduce the number of nodes expected for quorum to 1. So I did this on node A (only). After that, I was able to remove the TFA on node A as well, or so I thought because there were no more error messages. When trying to log in again, A still asked me for the TFA but this time it did accept the code from my app again.

How can I bring my cluster together again? I need to get A back into the cluster because I need to get to the config files for a number of VMs that were running on it. Once that is done, I will remove it from the cluster for good.

Any help is appreciated!
 
Small update:

I can now also log in to the GUI of node B when I access its IP directly (all my nodes have LE certificates and can -- normally -- be reached internally via the respective domain names).

And when I look into node B's GUI, I see this: Cluster consists of A, B, C, D: B, C and D are online, only A is offline.

Does that make sense to anyone?
 
After I took node A offline again, nodes B, C and D started operating normally again.
 
Btw. all vm/lxc config files are in /etc/pve of each node and so there is no need to get node A back into cluster to access these configs.
 
  • Like
Reactions: leesteken
Btw. all vm/lxc config files are in /etc/pve of each node and so there is no need to get node A back into cluster to access these configs.
I tried that. I rebooted node A without access to the Corosync network (so that it would not mess things up again) and copied all VM conf files to a USB stick.

But when I try to copy an old conf file onto node D, it won't let me. It tells me that the file already exists (in /etc/pve/qemu-server) but when I list this directory's contents, the file is not there.

So how do I get the config files to my new node?
 
If node D is joined into the cluster it's still there as it get it from the other 2 online nodes B+C.
Anyway if you want to change somethink in the vm/lxc config files because you have there some other config as before as node A was online you can just edit eg with vi the config files on any 1 node of online cluster node and after saving there are already on the other nodes too.
 
Last edited:
The way I understand it, in the folder /etc/pve/nodes/<node B>/qemu-server there are only the conf files for the VMs on node B. And the same goes for nodes C and D.

So if I want the VM to be on node D, I need to put the conf file in the <node D> folder (the VM's disks should be on each node via CEPH). But I can't copy the conf file in any of the nodes folders.
 
Aah, yes, understand you, yes, all configs there are sorted under "nodes" and even only one time. But even the vm and lxc configs from your died node A are still shown on node B+C(+D) under path /etc/pve/nodes/<nodeA>/qemu-server (or /lxc/).
And you cannot copy a vm/lxc config file because the machine can just exist once on a node and so you should move a config from path of 1 node to another which works fine when it's powered down - but that's a problem when the image is local (we use nfs and so no problem), with ceph should work I suppose - otherwise use the gui !!
 
Last edited: