[SOLVED] 4 or 5 storage nodes?

godfather007

Renowned Member
Oct 17, 2008
91
4
73
48
Netherlands
Hi,

for my company i'm proposing a PVE cluster-setup and move away from VMware (cost-decision).

My goal is to make a full-flash (24 bay) setup with 4 storage (incl compute) nodes (for Linux-vm's) and 2 compute nodes (windows-datacenter) totals 6.
It will host around 200-250 VM's.

I've found a comparison at ceph about 3 vs 5 storage nodes having better performance.
https://ceph.io/community/part-3-rhcs-bluestore-performance-scalability-3-vs-5-nodes/

Should i make the decision to make it 5 nodes or 4 will already have more performance gain?

Thanks in advance,
Martijn
 
An even number of nodes always opens a door for split brain problems. I you have the choice you should go for an odd number and then 5 is better than 3 and 7 is better than 5. :)
 
Not at hand, but with an even number a failed network link could lead to a 2:2 situation.
With an odd number there's always a majority.
True, with four nodes "more than half" is three, that will probably circumvent the problem. But an odd number still is more robust on all sorts of failures that could occur.
 
Last edited:
  • Like
Reactions: godfather007
Because you asked I made a bit of research about the topic. Fact is that you need more than half of the MONs to have quorum and RedHat advices to have an odd number of them. Thus, you could probably get away with an even number of storage nodes as long as you have an odd number of monitors. I understood that you want to use integrated Ceph functionality of PVE. This way you're also better off with an odd number of nodes (or a separate qdevice).
So, all in all, I stand a bit back from what I said first and think that four storage nodes could work as well with a separate qdevice/monitor (maybe on one of the compute nodes or a completely separate machine).
That said, Ceph really works better with more OSDs on more nodes (at least if you aim for a single-digit number of nodes).
Therefore I'd still suggest five nodes if your budget supports it.
 
Last edited:
First, a general note that you'd normally never risk a split brain (at least if you do not set a ceph's pool size to 2/1 or manually thinker with the cluster votes) in Proxmox VE with Ceph as that's exactly what quorum is for, and that's also why with both, 4 and three total nodes, only one can fail as with two the basic rule of >> 50% votes isn't achievable for either.

To the topic: Four storages node and a QDevice can work somewhat OK, but I'd still recommend going for 5 over 4, the main advantages is that in the worst, still working, situation, when two nodes fail, the load from the two failed nodes gets spread over the remaining three nodes. Whereas in the four node + QDevice for quorum setup the left-over two node need to handle all the failed nodes load. So you'd need to dimension the nodes with at least 50% extra capacity if you want to be prepared for a two node failure, in the five node case 40% capacity would be fine even in the worst case. Even if you only plan for the case where at max one node fully fails, it's 33% extra capacity vs. 25% extra capacity required for taking over storage/compute load of the failed nodes by the remaining three or four nodes, respectively.

Also, Ceph pools are recommended to be run with replica size/min = 3/2, meaning, three copies of every object and two copies successfully written before any write returns OK to the client. As a QDevice does not provide any real service you have two nodes but want three copies, ceph does not like this much, as with the default failure domain settings it tries to spread objects over different OSDs (disks) and hosts to reduce the likelihood of all copies being destroyed if one or two node faile completely.

IMO, it's better to run more slightly smaller nodes than few huge ones. Similarly, with OSD disk size: it's tempting to just use a few huge ones and be done, but using smaller ones not only increases performance (higher IOPS budget) and if one OSD fails there's less to re-balance. Naturally one needs to strike a trade-off, so it would be good to have a rough idea of the initial data usage required for your workload and the expected year-to-year growth in data usage.

A huge advantage of Ceph is how scaleable it is, you can start out with a three node cluster providing just a few TiB of space and end up with 15 nodes and 100s of TiB, all possible without a single downtime.

But also, ceph needs a bit of compute power to handle the data flow and those nice re-balancing features, so if you want to converge compute and storage you need to have that in mind too, and extra node can really help to take off steam.

In any way, I'd recommend to check out our ceph docs and relatively recent performance paper:
https://pve.proxmox.com/pve-docs/chapter-pveceph.html
https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/
 
  • Like
Reactions: godfather007

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!