proxmox ceph minimum reasonable OSD?

syadnom · Aug 11, 2016

So after having a ton of issues getting glusterfs working, I'm exploring ceph. My 2 primary hosts have 6 drives each.

My delima here is that if I install prox on 2 drives in a mirror, I waste a TON of space and only have 4 drives left per host. That means a total of 8 OSD with ceph.

I'm tempted to install prox on a pair of USB keys to have all 6 drives available for 12 OSD.

Is this too small a number? RedHat is sugest 50 OSD here.

PigLover · Aug 11, 2016

You are correct that GlusterFS is probably not a stable solution today for VM storage. Your "problems", though undescribed, are not surprising.

With Ceph, however, you have a bigger problem to consider than the minimum number of OSD. Running Ceph on 2 hosts will likely not give you the resiliency you are looking for. You can do it, but because of the way quorum works on the MONs you find that the loss of a single host likely makes the cluster unavailable. Probably not what you are looking for.

In order to ensure a quorum can be maintained in the face of node failure you need an odd number of MONs >= 3 nodes. Minimum ceph configs with predictable resiliency require at least 3 nodes. Practical deployments that can ensure completely stable operation with a node failed require at least 5. Deployments that provide efficient disk utilizaiton using erasure codes (again - realistic deployments) require at least 9.

Ceph is a scale out solution. It doesn't "scale down" very well for small deployments.

syadnom · Aug 11, 2016

I actually do have a 3rd node, and APU, so I have 3 node quorum.

PigLover · Aug 11, 2016

You can run reasonably with a small number of OSD. 12 would be fine. Assuming they will be on the two "primary" nodes you'll want to set repicaiton to "2" (size = 2) and the minimum size to "1", which will allow you to keep writing with one node down. Not really ideal but workable. You'll need MON on both of these same nodes and a third MON on the third node you note above.

Not completely ideal - but will be better than your experience with GlusterFS.

udo · Aug 12, 2016

Hi,
I would not use an replication of two! Zhere are good reasons, why ceph increase the default replication from two to three (years ago).

If you have 3 nodes - use all for OSDs too.
And forget USB-Sticks - ceph is with three nodes not the fastest storage - with usb-sticks you will get very slow storage and your data will lost in an very short time (all writes are done 6 times: 3 * replica * (journal + disk) )!
This is the reason why an journal-SSD should be an "ernterprise data center" ready one, like intel DC S3700.

Udo

iptgeek · Aug 12, 2016

Hi Udo

I understand the reasons that sage et. al increased the default size=3 sometime ago. We're testing with a small SSD cluster (6 nodes, 24 800GB intel enterprise SSDs + NVME journals in each node) and are considering using size=2. Reasoning being that backfills are almost instant if we down a node, monitoring the SSDs smart stats religiously. Size = 3 was to mitigate data corruption on spinners + losing a second spinner while a cluster was rebalancing could be likely?

I know you are ceph expert, what is your thoughts on using size=2 in non moving part drive clusters?

Thanks!

udo · Aug 12, 2016

iptel said:
Hi Udo

I understand the reasons that sage et. al increased the default size=3 sometime ago. We're testing with a small SSD cluster (6 nodes, 24 800GB intel enterprise SSDs + NVME journals in each node) and are considering using size=2. Reasoning being that backfills are almost instant if we down a node, monitoring the SSDs smart stats religiously. Size = 3 was to mitigate data corruption on spinners + losing a second spinner while a cluster was rebalancing could be likely?

I know you are ceph expert, what is your thoughts on using size=2 in non moving part drive clusters?

Thanks!

Hi,
yes the "replica=3 recommendation" is for normal non-raided spinning disks (the default for most ceph-installations).
In your case I think replica=2 is ok, because the time for rebuild is not very long.
For an 4TB-disk it's take more than 10 hours (depends on many things) and if any osd die in the same time (with replica=2) you will have data-lost!

Udo

PigLover · Aug 12, 2016

I wouldn't really recommend what @syadnom is doing either. A two-node Ceph cluster (even if it has a third MON to manage Quorum) won't be a very satisfying experience. But he didn't really ask if he should do it - he asked if he could do it. And if he has his OSDs spread over two nodes there is little real benefit from replica=3 (in fact, its probably a really bad idea).

In general, as I said in my first reply, the practical minimum is 3 nodes. It really takes at least 5 to be effective. More if you are doing erasure codes (practical minimum is probably 9 nodes with OSDs to get any benefit from it and still maintain resiliency).

But - if he goes forward with his proposed use case, he probably wants to set size (replicas) =2 and min min-size =1.

Search

Search

proxmox ceph minimum reasonable OSD?

syadnom

Renowned Member

PigLover

Renowned Member

syadnom

Renowned Member

PigLover

Renowned Member

udo

Distinguished Member

iptgeek

Renowned Member

udo

Distinguished Member

PigLover

Renowned Member

We value your privacy