HA not working when I lose a Node

Savo · Nov 19, 2023

Hello All,

I am figuring I am missing something small so I hope someone can help me.

Setup:
2 Dell Optiplex miniPCs with 2 NICs. 1 Nic on standard network and another dedicated to Ceph.

I configured the Cluster and Ceph with 2 nodes, and added my Proxmox Backup server as a qDevice. So I should be keeping quorum. I also configured a VM with HA. I am able to migrate between the 2 very well, although slow with just a 1Gb link. But if I pull the power on one of my nodes, it does start to move the VM to the other node, and tries to start it, but is not able to with the following error: "TASK ERROR: start failed: rbd error: rbd: couldn't connect to the cluster!". It will not start till I bring that other node back online. I am using both as monitors in Ceph as well.

Thanks in advance,
AJ

sb-jw · Nov 19, 2023

CEPH also requires three monitors to form a quorum.
Furthermore, CEPH is designed for 3 replicas on three nodes. If you cannot achieve this, you have to adjust the distribution and ideally reduce the replicas to 2.

Savo · Nov 19, 2023

sb-jw said:
CEPH also requires three monitors to form a quorum.
Furthermore, CEPH is designed for 3 replicas on three nodes. If you cannot achieve this, you have to adjust the distribution and ideally reduce the replicas to 2.

So I did reduce the replicas. Is there a way to add that backup server as a ceph monitors? This is a test to see if my deployment works before purchasing better hardware.

Would the solution be to install VE on my backup server and add it to the cluster, but not the HA group? Just run Backup Server as a VM?

sb-jw · Nov 19, 2023

Savo said:
Would the solution be to install VE on my backup server and add it to the cluster, but not the HA group? Just run Backup Server as a VM?

You can implement this however you like, you just need to be able to create three independent CEPH Mons.

A possible solution is that you do PVE on the other server and put PBS in a VM. But I would never run PBS in a VM, especially the performance on bare metal will be completely different than in the VM.

However, what you do here will not provide you with any meaningful results for the productive setup. I would always advise you to start with CEPH with at least three nodes / mons / mgr. Each node should have at least 2 SSDs, and you should plan at least 4 GB of RAM for each SSD (=OSD). Ideally, you have at least a 10 GbE network, but a bond of 2x 1 GbE or 4x 1 GbE can also give you good basic performance.
With CEPH you should definitely also run LACP with Layer3+4, which means a meshed network will potentially slow down your performance. When it comes to switches, you should use enterprise devices, as the latency of consumer devices is massively high. Latency in particular is always a critical parameter in storage systems.

Savo · Nov 25, 2023

OK, So I added a third node. But now when I lose power on one node, this is what happens in the HA section.

Very odd to me that all the timestamps go bad when i lose pmox01. Am I missing something here? I thought pmox02 and pmox03 would hold quorum. Sorry for these dumb questions.

Maybe this is because I am trying to make this work with low end hardware including low end switches for my test environment. I have 10Gb ordered for the final product, but wanted to make sure this was going to be stable before I pull the trigger.

Search

Search

HA not working when I lose a Node

Savo

New Member

sb-jw

Famous Member

Savo

New Member

sb-jw

Famous Member

Savo

New Member

We value your privacy