Proxmox cluster limited to 2 nodes - adding Ceph-only nodes

mgaudette112

Member
Dec 21, 2023
54
3
8
Hi,

I am running a 2-Nodes Proxmox cluster (2 x Dell R640, PVE 8.4), with VMs stored on a separate shared NFS storage. Due to Microsoft licensing for it's Server products, I am limited to 48 cores for the entire cluster.

My goal is to use Ceph (NVMe only) instead of NFS - I have used it in a lab (both Proxmox and Cephadm), I really like the idea of it. I understand I could turn this into a 3 nodes x 16 cores system, but I understand Ceph really start shining at 5+ nodes anyways, and I cannot reasonably (due to Microsoft licensing) get to 5 nodes with my constraints without giving MS more of my money.

I am left with three options, and I would like to know what the community thinks of these.

Note: the existing nodes are overprovisioned in terms of cores and memory at the moment.

Option 1:
Use my current 2 node Proxmox for Ceph and add (yet-to-be understood/yet-to-be-invented) ceph-only nodes. This begs the question: Can I somehow add Proxmox nodes that aren't actually PVE's but only limited to adding Ceph storage to an existing Proxmox cluster? So they don't have "cores that could be used for VMs" to a Microsoft audit?

Option 2: Just start an independent Ceph storage (cephadm, etc.) on 5 new nodes, not using Proxmox's ceph features. This is definitely the easier solution to understand, BUT
  • it seems a waste not to use those 2 x 8 U.2 NVMe slots in my current Proxmox nodes
  • My current UPS is reaching it's limit, I am not sure I can get to 5 more nodes, even of modest devices. Using the existing nodes would likely be more power efficient. Removing the NFS storage will free me about 2-3 nodes worth of power, but that does not compensate for adding 5 nodes.
Option 3: Add 3 nodes of the smallest NVMe/ECC devices I can find (Lenovo P320 seem to fit the bill), and use NVMe passthrough on a single VM per Proxmox node (to get to 5 Cephadm nodes), not using Proxmox ceph features. I know HA wont work with PCIe passthrough, but that's fine as the HA-part of storage would be handled by Ceph.
  • Is PCIe passthrough reliable for passing U2 drives to VMs? Any gotcha's?
  • Can a Proxmox cluster use a ceph storage partly provisioned by it's own VMs? Does it make sense from a HA perspective? When the VM starts it would try using it's own storage for the OS...seems weird.
  • ...I guess I could easily put the Ceph VM's OS storage on the local ZFS disk of the node, so the virtualized ceph node booting would not rely on it's own ceph storage for the OS.
  • I do feel this solution becomes full of pitfalls...,But then again it seems the best way to reuse the existing node's wattage and free U.2 slots
I'd welcome any opinions, including "you're mad" (as long as it includes the "why" I am).
 
but I understand Ceph really start shining at 5+ nodes
Yes, that was my impression too, after a year of testing (in a homelab) --> https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

Disclaimer: not using Ceph currently...

Can I somehow add Proxmox nodes that aren't actually PVE's but only limited to adding Ceph storage to an existing Proxmox cluster?
You can install PVE + Ceph and just not run any VMs. This way you get Quorum easier and Ceph can be handled by the nice PVE Web-GUI :-)
 
My goal is to use Ceph (NVMe only) instead of NFS - I have used it in a lab (both Proxmox and Cephadm), I really like the idea of it. I understand I could turn this into a 3 nodes x 16 cores system, but I understand Ceph really start shining at 5+ nodes anyways, and I cannot reasonably (due to Microsoft licensing) get to 5 nodes with my constraints without giving MS more of my money.
any storage solution has a sweet spot, but that spot is completely dependent on its use. ceph scales well with number of initiators, which in the hypervisor use case can translate to number of VMs. if you have 3 VMs, you can scale to 100 nodes and your performance will not improve meaningfully. Alternatively, if you have 1000 VMs drawing 10k iops each the cluster performance will scale with more OSDs and nodes.

The last part of your question is probably the most important one. reasonably.

What are you after? what deficiency are you trying to cure? what money are you giving to microsoft, and what for? how will ceph alleviate that?
 
  • Like
Reactions: Johannes S
So they don't have "cores that could be used for VMs" to a Microsoft audit?
Only Microsoft can answer this question, this forum can't answer this. The real question however is what problem do you want to solve with Ceph you can't solve with NFS or ZFS storage Replication? ZFS would je able to use the NVMEs in your nodes.
If you have only two nodes you shouldn't use Ceph in production ( to much room for error). Your third Option is way too hackish for my taste to trust production with it.
Variant 2 will work but again" Which problem do you want to solve that you need to invest in five nodes and the additional complexity?

For learning/playing around with Ceph you could always create three VMs in your existing cluster.
 
Last edited:
  • Like
Reactions: mgaudette112
Can I somehow add Proxmox nodes that aren't actually PVE's but only limited to adding Ceph storage to an existing Proxmox cluster? So they don't have "cores that could be used for VMs" to a Microsoft audit?
There was a thread a while back about using specific pools and a 1 GB disk to prevent VMs moving to (off of) certain nodes. I can find it later if you can’t. PVE 9 has a new affinity feature which sounds similar but I haven’t looked at it yet.
 
Yes, that was my impression too, after a year of testing (in a homelab) --> https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

Disclaimer: not using Ceph currently...


You can install PVE + Ceph and just not run any VMs. This way you get Quorum easier and Ceph can be handled by the nice PVE Web-GUI :-)

As I said, I can't add more core to my hypervisor, that's a no go.

As for a separate Proxmox cluster just for Ceph - that works, I hadn't thought of that's but it doesn't really help me reuse the existing nodes for Ceph purpose.
 
Last edited:
What are you after? what deficiency are you trying to cure? what money are you giving to microsoft, and what for? how will ceph alleviate that?
I am after a storage system that I can update/reboot without downtime (in ceph's case, one node at a time).

I am giving Microsoft money for Microsoft Server licenses, but that's slightly besides the point. Ceph won't help, never thought it would.
 
I am after a storage system that I can update/reboot without downtime (in ceph's case, one node at a time).

For this you don't even need shared storage you can migrate the even if the storage isn't shared at all ( e.G. LVM/Thin). With ZFS storage replication you would reduce the migration time further:
https://pve.proxmox.com/wiki/Storage_Replication

To enable automatic migration you will need high-availability:

https://pve.proxmox.com/wiki/High_Availability#_requirements

Now this would need three nodes but with a qdevice ( e.G. on a small PC, Raspberry PI, your NAS or a ProxmoxBackupServer) you don't need a full blown PVE node thus having license cost:
https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support


So in your case I would go with storage replication plus a qdevice. I would keep the NFS/NAS for VMs where even the minimal theoretisch dataloss of storage replication can't be tolerated.

Please note that even with a qdevice Ceph still needs at least three nodes.

Udos idea with the extra ceph-only PVE Cluster is technically sweet but still needs a larger investment and an audit might still cause problems since you could migrate VMs to it with qm remote-migrate on the command line.

Again: Only MS can tell you how they would interpret such a Setup but I'm willing to bet Gold that their Interpretation will be in their favour not yours.
So in my opinion your should better try to get an budget for a backup server in a "Two-node plus combined PBS/qdevice"-Setup. Will be cheaper and less trouble
 
Last edited:
Udos idea with the extra ceph-only PVE Cluster is technically sweet
Seems I am misunderstood here.

"You can install PVE + Ceph and just not run any VMs." was meant to be a reply to "Option 3: Add 3 nodes of the smallest NVMe/ECC devices".

Some nodes (1 or 2 or even 3) would extend the current cluster, give additional quota and add storage. But no VMs are put there.

As usual: there is more than one way to skin a cat :-)
 
This a PVE 9.0.3 CEPH cluster with 5 nodes. Setup HA for those licensed VMs to only have access to the first 2 nodes thus staying within your CPU requirements.


2025-08-11_14-03-55.jpg
 
This a PVE 9.0.3 CEPH cluster with 5 nodes. Setup HA for those licensed VMs to only have access to the first 2 nodes thus staying within your CPU requirements.


View attachment 89302

I think a lot people are misunderstanding my problem - Microsoft limits the entire hypervisor cluster from exceeding x cores (48 cores in my case).

It doesn't matter if I configure the VMs to never go on nodes 3-5, or make a pinky promise to not use those nodes - the entire thing has to stay under 48 cores. But my storage doesn't count (as far as I know!), only the Hypervisors.
 
Oh I assumed you were using SPLA licensing which is paid per core, monthly, common in (required for) hosting client's servers. What license are you using?
I am using this for a small business (not a hosting company, just hosting our own services), Windows Server standard. Each license comes with 2 instances of Windows Server, on up to 16 cores.

We have 3 of those, so we are allowed a max of 6 server instances on a Hypervisor cluster of 48 cores total. Which is exactly what I have built.

I am aware I could do ceph with 3 nodes (by having 3 servers of 16 cores each instead of 2 with 24 cores each) but I wanted to go with 5 ceph nodes.

I think I'll give up and keep on using NFS, at least for the short term.
 
  • Like
Reactions: Johannes S
my problem - Microsoft limits the entire hypervisor cluster from exceeding x cores (48 cores in my case).
based on that requirement, seems like option 2 is the only rational solution- or, BTW, there are other ways to get fault tolerant storage- you can buy it. a Dell ME50xx or HP MSA26xx would do the trick nicely.

penny wise, pound foolish. (unless, of course, the original requirement for being able to update the storage without downtime isn't a requirement at all- in which case not messing with it is the correct answer.)
 
@mgaudette112 AFAIK that's not how it works, each license would need to cover the 48 physical cores. The limiting in that direction would be to limit each VM to 16 virtual cores but based on how they count for SPLA that doesn't matter.
 
I think I'll give up and keep on using NFS, at least for the short term.

Isn't zfs storage replication in a two+qdevice/pbs-cluster an option? By default the VMs are replicated every 15 minutes to the other node, but you can change that to one minute or multiple hours or even days. Depending on your usecase this might be enough.
 
Some nodes (1 or 2 or even 3) would extend the current cluster, give additional quota and add storage. But no VMs are put there.

If I understood everything correct this isn't a viable solution due to MS licensing terms. If (even just in theory) he could use the extra nodes for running VMs he will have to pay (quite literally).
 
  • Like
Reactions: mgaudette112
penny wise, pound foolish. (unless, of course, the original requirement for being able to update the storage without downtime isn't a requirement at all- in which case not messing with it is the correct answer.)
It depends on what your definition of requirement is - but yes it is not a hard requirement from anybody else than myself, and I will stick with NFS for now. Thank you