CephFS EC Pool recommendations

Proxmox India · Jan 25, 2019

Hi,

I have 2 SSD per node. and I have 6 Nodes. which makes its 12 SSD

now what will give me a good capacity and resilience against failures.

I am confused between choosing -
EC21 --- i.e K=2, M=1 - 66% Capacity
EC42 --- i.e K=4, M=2 - 66% Capacity
EC32 --- i.e K=3, M=2 - 60% Capacity
EC43 --- i.e K=4, M=3 - 57% Capacity

does putting 43 mean I need to have 7 hosts on - i am having - crush-failure-domain=host

I dont mind having 57% vs 66% as long as I get better reliability.

If i set it to EC21 - i am able to have 2 hosts down out of 6 and still have good performance and no loss of data.

Please suggest folks

sseidel · Jan 25, 2019

It's been a while since I set our EC pool up, but I think K=2, M=1 won't work because it's not spread out enough to work with one failed host. We have k=4 and m=2, which I think is the minimum (we have 2 or 3 SSDs per host and 5 hosts). It works well, but there's not a lot more I can say about it.

With your setup, I think 4/2 will not allow 2 hosts to be down, because 2 hosts = 4 OSDs, so your M should be at least 4 if you want to tolerate 2 down hosts. 4/4 might work and while it's only 50% capacity, it's still a lot better than the 33% with the standard replicated rule size=3.

Proxmox India · Jan 26, 2019

Thanks sseidel,

2 queries.

a if we have staggered failures, like one node is down. and while trying to remove the failed node, there is a loose contact and 2 more nodes have power issue or some such thing. which means that we have 3 nodes down now. Would we survive this with a EC43 or EC44.
Bringing the nodes back would take lets say 1-2 days. which means 3 hosts * 2 OSD = 6 OSD Down out of 12.

b. Also if we are to shit the clusters to another locations we will need to shutdown all VM and shutdown all the hosts and restart them at the new location. In such a situation if we shutdown the hosts one by one will the ceph cluster come back up properly as it would see failed OSD as we cannot shutdown all servers at the same time and you need to select each host on the gui and give the shutdown command which takes a few seconds each.

sseidel · Jan 28, 2019

Somebody can correct me if this is wrong, but to my understanding if you want to survive with 6 OSD down then you need m=6.

Shutdown/startup usually work fine if you follow proper procedures: first shut down all clients (VMs), then shutdown all servers. Then start up again. Ceph will not start working if there is no quorum but once it sees enough OSDs and MONs it will start recovery and everything should be fine. Setting osd to noout will help because then there'll be no useless shifting of data. (We have noout on all the time because we have monitoring on the cluster and so we can react if something goes wrong.)

Alwin · Jan 28, 2019

While the cluster may be able to sustain a 50% loss, it will most likely not recover or likely lose data. Besides Ceph's data, if nodes are going down, the cluster will lose quorum on corosync (depending on which hosts failed, even MONs). So HA wouldn't work either.

sseidel · Jan 28, 2019

Alwin said:
While the cluster may be able to sustain a 50% loss, it will most likely not recover or likely lose data. Besides Ceph's data, if nodes are going down, the cluster will lose quorum on corosync (depending on which hosts failed, even MONs). So HA wouldn't work either.

Yes, it would be an interesting experiment (but not more!) to have a high redundancy, then take 3 of the servers and start them up at one location and the other 3 in a separate network and see if the data can be read in both clusters

alexskysilk · Jan 28, 2019

sseidel said:
Somebody can correct me if this is wrong, but to my understanding if you want to survive with 6 OSD down then you need m=6.

xK+ym refers to the stripe arrangement; in a node failure domain, it needs to not exceed the number of nodes, not OSDs. you can survive 6 OSD failures if they all happen to be on the same node; even if they're not, if you have a minimum of 6 nodes you'd still survive the fault. if you have less nodes, SOME files or objects would suffer.

In your case, the only arrangements that would make sense if 4k+2m or 3k+3m; the former has more usable capacity, the latter is more fault tolerant.

-edit-
On reflection, I think its worth noting that the above is ACADEMIC. the likelihood of 6 OSDs dying at the same time without a node failure is astronomically low, enough to not rate in the discussion. When OSDs fail you will either replace them or redistribute the PGs, or both.

Search

Search

CephFS EC Pool recommendations

Proxmox India

Member

sseidel

Renowned Member

Proxmox India

Member

sseidel

Renowned Member

Alwin

Proxmox Retired Staff

sseidel

Renowned Member

alexskysilk

Distinguished Member