CephFS EC Pool recommendations

Proxmox India

Member
Oct 16, 2017
46
3
13
48
Bangalore
Hi,

I have 2 SSD per node. and I have 6 Nodes. which makes its 12 SSD

now what will give me a good capacity and resilience against failures.

I am confused between choosing -
EC21 --- i.e K=2, M=1 - 66% Capacity
EC42 --- i.e K=4, M=2 - 66% Capacity
EC32 --- i.e K=3, M=2 - 60% Capacity
EC43 --- i.e K=4, M=3 - 57% Capacity

does putting 43 mean I need to have 7 hosts on - i am having - crush-failure-domain=host

I dont mind having 57% vs 66% as long as I get better reliability.

If i set it to EC21 - i am able to have 2 hosts down out of 6 and still have good performance and no loss of data.

Please suggest folks
 

sseidel

Active Member
Jul 8, 2015
49
7
28
It's been a while since I set our EC pool up, but I think K=2, M=1 won't work because it's not spread out enough to work with one failed host. We have k=4 and m=2, which I think is the minimum (we have 2 or 3 SSDs per host and 5 hosts). It works well, but there's not a lot more I can say about it.

With your setup, I think 4/2 will not allow 2 hosts to be down, because 2 hosts = 4 OSDs, so your M should be at least 4 if you want to tolerate 2 down hosts. 4/4 might work and while it's only 50% capacity, it's still a lot better than the 33% with the standard replicated rule size=3.
 

Proxmox India

Member
Oct 16, 2017
46
3
13
48
Bangalore
Thanks sseidel,

2 queries.

a if we have staggered failures, like one node is down. and while trying to remove the failed node, there is a loose contact and 2 more nodes have power issue or some such thing. which means that we have 3 nodes down now. Would we survive this with a EC43 or EC44.
Bringing the nodes back would take lets say 1-2 days. which means 3 hosts * 2 OSD = 6 OSD Down out of 12.

b. Also if we are to shit the clusters to another locations we will need to shutdown all VM and shutdown all the hosts and restart them at the new location. In such a situation if we shutdown the hosts one by one will the ceph cluster come back up properly as it would see failed OSD as we cannot shutdown all servers at the same time and you need to select each host on the gui and give the shutdown command which takes a few seconds each.
 

sseidel

Active Member
Jul 8, 2015
49
7
28
Somebody can correct me if this is wrong, but to my understanding if you want to survive with 6 OSD down then you need m=6.

Shutdown/startup usually work fine if you follow proper procedures: first shut down all clients (VMs), then shutdown all servers. Then start up again. Ceph will not start working if there is no quorum but once it sees enough OSDs and MONs it will start recovery and everything should be fine. Setting osd to noout will help because then there'll be no useless shifting of data. (We have noout on all the time because we have monitoring on the cluster and so we can react if something goes wrong.)
 

Alwin

Proxmox Retired Staff
Retired Staff
Aug 1, 2017
4,617
457
88
While the cluster may be able to sustain a 50% loss, it will most likely not recover or likely lose data. Besides Ceph's data, if nodes are going down, the cluster will lose quorum on corosync (depending on which hosts failed, even MONs). So HA wouldn't work either.
 

sseidel

Active Member
Jul 8, 2015
49
7
28
While the cluster may be able to sustain a 50% loss, it will most likely not recover or likely lose data. Besides Ceph's data, if nodes are going down, the cluster will lose quorum on corosync (depending on which hosts failed, even MONs). So HA wouldn't work either.
Yes, it would be an interesting experiment (but not more!) to have a high redundancy, then take 3 of the servers and start them up at one location and the other 3 in a separate network and see if the data can be read in both clusters :confused:
 

alexskysilk

Renowned Member
Oct 16, 2015
803
105
63
Chatsworth, CA
www.skysilk.com
Somebody can correct me if this is wrong, but to my understanding if you want to survive with 6 OSD down then you need m=6.

xK+ym refers to the stripe arrangement; in a node failure domain, it needs to not exceed the number of nodes, not OSDs. you can survive 6 OSD failures if they all happen to be on the same node; even if they're not, if you have a minimum of 6 nodes you'd still survive the fault. if you have less nodes, SOME files or objects would suffer.

In your case, the only arrangements that would make sense if 4k+2m or 3k+3m; the former has more usable capacity, the latter is more fault tolerant.

-edit-
On reflection, I think its worth noting that the above is ACADEMIC. the likelihood of 6 OSDs dying at the same time without a node failure is astronomically low, enough to not rate in the discussion. When OSDs fail you will either replace them or redistribute the PGs, or both.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!