HA not working when I lose a Node

Savo

New Member
Nov 19, 2023
3
0
1
Hello All,

I am figuring I am missing something small so I hope someone can help me.

Setup:
2 Dell Optiplex miniPCs with 2 NICs. 1 Nic on standard network and another dedicated to Ceph.

I configured the Cluster and Ceph with 2 nodes, and added my Proxmox Backup server as a qDevice. So I should be keeping quorum. I also configured a VM with HA. I am able to migrate between the 2 very well, although slow with just a 1Gb link. But if I pull the power on one of my nodes, it does start to move the VM to the other node, and tries to start it, but is not able to with the following error: "TASK ERROR: start failed: rbd error: rbd: couldn't connect to the cluster!". It will not start till I bring that other node back online. I am using both as monitors in Ceph as well.

Thanks in advance,
AJ
 
CEPH also requires three monitors to form a quorum.
Furthermore, CEPH is designed for 3 replicas on three nodes. If you cannot achieve this, you have to adjust the distribution and ideally reduce the replicas to 2.
 
CEPH also requires three monitors to form a quorum.
Furthermore, CEPH is designed for 3 replicas on three nodes. If you cannot achieve this, you have to adjust the distribution and ideally reduce the replicas to 2.
So I did reduce the replicas. Is there a way to add that backup server as a ceph monitors? This is a test to see if my deployment works before purchasing better hardware.

Would the solution be to install VE on my backup server and add it to the cluster, but not the HA group? Just run Backup Server as a VM?
 
Would the solution be to install VE on my backup server and add it to the cluster, but not the HA group? Just run Backup Server as a VM?
You can implement this however you like, you just need to be able to create three independent CEPH Mons.

A possible solution is that you do PVE on the other server and put PBS in a VM. But I would never run PBS in a VM, especially the performance on bare metal will be completely different than in the VM.

However, what you do here will not provide you with any meaningful results for the productive setup. I would always advise you to start with CEPH with at least three nodes / mons / mgr. Each node should have at least 2 SSDs, and you should plan at least 4 GB of RAM for each SSD (=OSD). Ideally, you have at least a 10 GbE network, but a bond of 2x 1 GbE or 4x 1 GbE can also give you good basic performance.
With CEPH you should definitely also run LACP with Layer3+4, which means a meshed network will potentially slow down your performance. When it comes to switches, you should use enterprise devices, as the latency of consumer devices is massively high. Latency in particular is always a critical parameter in storage systems.
 
OK, So I added a third node. But now when I lose power on one node, this is what happens in the HA section.
1700927033332.png

Very odd to me that all the timestamps go bad when i lose pmox01. Am I missing something here? I thought pmox02 and pmox03 would hold quorum. Sorry for these dumb questions.

Maybe this is because I am trying to make this work with low end hardware including low end switches for my test environment. I have 10Gb ordered for the final product, but wanted to make sure this was going to be stable before I pull the trigger.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!