proxmox ceph minimum reasonable OSD?

syadnom

Renowned Member
Oct 19, 2009
26
0
66
So after having a ton of issues getting glusterfs working, I'm exploring ceph. My 2 primary hosts have 6 drives each.

My delima here is that if I install prox on 2 drives in a mirror, I waste a TON of space and only have 4 drives left per host. That means a total of 8 OSD with ceph.

I'm tempted to install prox on a pair of USB keys to have all 6 drives available for 12 OSD.

Is this too small a number? RedHat is sugest 50 OSD here.
 
You are correct that GlusterFS is probably not a stable solution today for VM storage. Your "problems", though undescribed, are not surprising.

With Ceph, however, you have a bigger problem to consider than the minimum number of OSD. Running Ceph on 2 hosts will likely not give you the resiliency you are looking for. You can do it, but because of the way quorum works on the MONs you find that the loss of a single host likely makes the cluster unavailable. Probably not what you are looking for.

In order to ensure a quorum can be maintained in the face of node failure you need an odd number of MONs >= 3 nodes. Minimum ceph configs with predictable resiliency require at least 3 nodes. Practical deployments that can ensure completely stable operation with a node failed require at least 5. Deployments that provide efficient disk utilizaiton using erasure codes (again - realistic deployments) require at least 9.

Ceph is a scale out solution. It doesn't "scale down" very well for small deployments.
 
You can run reasonably with a small number of OSD. 12 would be fine. Assuming they will be on the two "primary" nodes you'll want to set repicaiton to "2" (size = 2) and the minimum size to "1", which will allow you to keep writing with one node down. Not really ideal but workable. You'll need MON on both of these same nodes and a third MON on the third node you note above.

Not completely ideal - but will be better than your experience with GlusterFS.
 
Hi,
I would not use an replication of two! Zhere are good reasons, why ceph increase the default replication from two to three (years ago).

If you have 3 nodes - use all for OSDs too.
And forget USB-Sticks - ceph is with three nodes not the fastest storage - with usb-sticks you will get very slow storage and your data will lost in an very short time (all writes are done 6 times: 3 * replica * (journal + disk) )!
This is the reason why an journal-SSD should be an "ernterprise data center" ready one, like intel DC S3700.

Udo
 
Hi Udo

I understand the reasons that sage et. al increased the default size=3 sometime ago. We're testing with a small SSD cluster (6 nodes, 24 800GB intel enterprise SSDs + NVME journals in each node) and are considering using size=2. Reasoning being that backfills are almost instant if we down a node, monitoring the SSDs smart stats religiously. Size = 3 was to mitigate data corruption on spinners + losing a second spinner while a cluster was rebalancing could be likely?

I know you are ceph expert, what is your thoughts on using size=2 in non moving part drive clusters?

Thanks!
 
Hi Udo

I understand the reasons that sage et. al increased the default size=3 sometime ago. We're testing with a small SSD cluster (6 nodes, 24 800GB intel enterprise SSDs + NVME journals in each node) and are considering using size=2. Reasoning being that backfills are almost instant if we down a node, monitoring the SSDs smart stats religiously. Size = 3 was to mitigate data corruption on spinners + losing a second spinner while a cluster was rebalancing could be likely?

I know you are ceph expert, what is your thoughts on using size=2 in non moving part drive clusters?

Thanks!
Hi,
yes the "replica=3 recommendation" is for normal non-raided spinning disks (the default for most ceph-installations).
In your case I think replica=2 is ok, because the time for rebuild is not very long.
For an 4TB-disk it's take more than 10 hours (depends on many things) and if any osd die in the same time (with replica=2) you will have data-lost!

Udo
 
I wouldn't really recommend what @syadnom is doing either. A two-node Ceph cluster (even if it has a third MON to manage Quorum) won't be a very satisfying experience. But he didn't really ask if he should do it - he asked if he could do it. And if he has his OSDs spread over two nodes there is little real benefit from replica=3 (in fact, its probably a really bad idea).

In general, as I said in my first reply, the practical minimum is 3 nodes. It really takes at least 5 to be effective. More if you are doing erasure codes (practical minimum is probably 9 nodes with OSDs to get any benefit from it and still maintain resiliency).

But - if he goes forward with his proposed use case, he probably wants to set size (replicas) =2 and min min-size =1.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!