quota impact of on-demand compute-only nodes in a cluster with CEPH?

abufrejoval

Member
Jun 27, 2023
53
13
8
In my home-lab I run three NUCs in a 24x7 HCI cluster and then I may add workstations on demand, that normally run some other operating systems or are simply turned off, as on-demand compute nodes for experimentation, typically with VMs running GPU pass-through for CUDA.

Been doing that for years with RHV/oVirt, tried it a bit Xcp-ng (not so promising) and now trying the same with Proxmox.

With oVirt I've run across an issue, that I'd rather not have with Proxmox, so before I do, I'd like to have some feedback:

The main difference with the permanent nodes and the on-demand ones is that only the permanent nodes contribute to the [shared] storage, the ad-hoc compute nodes are meant to remain pretty much stateless, so they can be turned off (or used otherwise). Since these on-demand node do not contribute Gluster bricks or CEPH ODS, their running state should not have any impact on storage consistency or state.

If you have six total nodes, and only three are contributing HCI storage, the three on-demand nodes should not be involved in quorum counting... unless you try involving them in tie splitting for a reason.

With RHV/oVirt that doesn't always work as expected, I've seen quorum loss reports when nodes that didn't contribute bricks to a given gluster volume were shut down, even if they only had other volumes. But at least it worked, when they didn't contribute Gluster volumes at all.

With Proxmox what has me concerned is that each node is given a "vote", supposedly on the corosync file system.

Does that mean I'll get into trouble when I add more on-demand non-storage nodes than HCI nodes? Or will things already go South if my (currently) two ad-hoc nodes are shut down and I reboot one of my three-replica CEPH nodes, because only 2 out of 5 nodes (corosync quota) but 2 out of 3 storage nodes (Ceph quota) remain running?

And how much trouble would that be? Will I just not be able to create, launch or migrate VMs until corosync quota is back?

I guess currently Proxmox treats all nodes the same at the corosync level, which is why you can manage the cluster just the same from every node, as long as you have a majority.

If you wanted to support a majority of on-demand nodes, you'd have to differentiate between votes from permanent nodes and on-demand nodes and treat the corosync file system as read-only from the on-demand nodes. They'd then have to proxy via one of the permanent nodes for their own operations.

I guess I could add a little Proxmox VM (or container?) on each one of the permanent nodes to give them double votes for a similar effect...

I quite like the ability to boot both my CUDA workstations and my kid's gaming rigs (which often are my older CUDA workstations) off one of these super fast Kingston data traveller USB sticks with a basic Proxmox and then have them run VMs from the HCI cluster for machine learning workloads...

But I can see that running into trouble with the next one and while dropping a node in RHV/oVirt is a simple GUI action (it can then also be re-inserted just as easily), in Proxmox that doesn't seem to be the case.
 
Last edited:
I guess currently Proxmox treats all nodes the same at the corosync level,
Rather than guess, read the docs, you can weight nodes by increasing their number of votes. note this also requires you increasing in the min number of votes. If done right i believe this should stop the minor nodes from ever forming a quorum, also their loss shouldn't invalidate the quorum as the other nodes with more votes decide.

You can set the number of votes when you add a node with pvecm when you add nodes.

You can edit corosync after i believe
Code:
  node {
    name: uno
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.1
    ring1_addr: 10.20.20.1
  }

good luck, i suspect the fun will be trying to find the right numbers to use....

.. I would give the 3 minor nodes 1 vote each, the major node 5 votes each, and set quorum min at 10. YMMV
 
Last edited:
  • Like
Reactions: abufrejoval
The docs aren't terribly detailled on the quorum mechanism or on where and how to use votes greater than 1, but I think I'll just use them naively for now, since my cluster isn't that big

I've also read the original Corosync paper, but I guess I'd have to hunt down the references for quota details.

Thanks most of all, for pointing out what seems to be the simplest solution right now!

I guess the safest approach is to ensure that the loss of the total votes of on-demand nodes never causes a loss of corosync quorum, when CEPH is still within its fault tolerance levels.

And then to only ever manage the cluster via one of the permanent nodes.

Votes could become trickier when it comes to HA and fault recovery, but I haven't progressed to HA and affinity rules yet, which are a really handy feature on RHV/oVirt, as it has a rather intelligent management engine and the host deamons to deal with them... but that comfort comes with a lot of complexity, too, and since Redhat has killed the project, bemoaning its demise is futile at best.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!