Proxmox VE Ceph Server released (beta)

symmcom

Renowned Member
Oct 28, 2012
1,081
30
68
Calgary, Canada
www.symmcom.com
You don't exactly NEED raid controllers for ceph to function.

- You need raid controllers if you want more disks in your system than the mainboards controller is offering
- You can use a raid controller to benefit from its battery-backed RW-caches. Do note that regular hard drives already have its own memory cache, just on a much smaller scale than what controllers have. They really only exist because non-SSD drives are extremely slow...

Also, while we tend to call them raid controllers, they really actually arent. They are actually just additional disk controllers, which just happen to implement some raid levels (which ceph neither needs nor wants). With ceph you dont use the raid functionality in any way shape or form - this single-disk RAID0 is really just a trick to present individual disks to ceph while still making use of the controller's cache (which the JBOD mode typically doesn't allow for).
Very nicely put MO! Ceph indeed diminishes the need to have RAID setup. I myself use combination of RAID and Expander card to have JBOD for Ceph. I am not sure if battery backed cache is needed since Ceph can heal itself pretty good. This is was my primary concern when i was introduced to Ceph. I ran several tests to simulate complete Ceph node failure but never had any data issue.

If we are talking about Caching through Proxmox for VMs such as writeback, write through etc. of course they are little different story. They really have nothing to do with RAID cache we are talking about here.
 

Sakis

Active Member
Aug 14, 2013
121
3
38
I am planing a 3 node ceph-proxmox cluster. I try to find what is better for my set up according to replication times. I am not sure about one thing.

Each node will have 4 osds, 4 x 4 TB SATA disks means a total 48TB ceph cluster.

If i will use 3 times replication
From that the usable will be 16TB. Calculating that i should stop at 85% means i have 13.6TB max data inside the pools. Lets say i use 10TB in a future scenario.
Boom, one cpu melts. Proxmox HA works like a charm. Will the ceph cluster still working 100% and rebalance?
From my calculations now i will have a ceph cluster with 32TB space, means 10.6TB usable, means 85% max 9TB usable data. What will happen at the 1TB missing from my data? Sounds like a corruption dissaster to me or i miss smth crucial and how ceph works.

So if i choose 3 times replication, and want to be sure for ceph and proxmox cluster availability with 1 node down i must use max aproximately 9TB. Correct? (thats a huge amount of loss from 48TB data drives).

If i will use 2 times replication things are getting better.
From that the usable will be 24TB. Calculating that i should stop at 85% means i have 20.4TB max data inside the pools. If i use now 10TB of data i wont have problem if a node dissapears, cause it will leave me with 32TB ceph cluster, 16TB usable, 85% means 13.6 max data which is ok with 10TB used data.

So, if i am correct again, if i want to afford one node downtime i can choose 2 time replication and gain some usable data. And if want to be crazy and retain my data at all costs (despite that the proxmox cluster will lose quorum) i should use 3 times replication and use up to 4.5TB data. (16TB ceph cluster, 5.33TB data due to 3 time replication, 85%= 4.5TB max data)

Do i miss something?

Thank you
 

symmcom

Renowned Member
Oct 28, 2012
1,081
30
68
Calgary, Canada
www.symmcom.com
In my opinion you should use N(nodes)-1 formula to decide how many replicas you need. You have 3 nodes, so you should consider using 3-1=2 replicas. This will give you 32TB usable space while providing performance.

From rest of your message, i am not sure what you are asking. Are you afraid of losing data because you have over written data? Ceph will automatically stop writing to an OSD when it approaches critical shortage so you will never overwrite or lose data. If you lose too many OSDs or nodes and lose quorum, Ceph will simply stop writing data all together. When you add more OSDs or Nodes, it will start rebalancing again. Simply put the only time you will lose any data when you lose replicas all together. For example, in 3 node cluster, if 2 of your nodes fails totally including the physical nodes and all the OSDs in it simply went up in smoke, then you are facing massive data loss. But this is the worst of worst case scenario and hardly will occur. In such cases your only option will be recover data from backup storage.
For a 3 node Ceph setup with 3 replicas, you can afford losing 2 nodes completely and still have all data. So it really comes down how valuable your data is and how effective your backup system is.
 
Last edited:

spirit

Famous Member
Apr 2, 2010
4,295
286
103
www.odiso.com
In my opinion you should use N(nodes)-1 formula to decide how many replicas you need
I think it's also depend of how many disks you have in your nodes, and the probability that you can loose 2 disks at the same time.

Note that when you loose a disk, the datas will be rebalanced, so if you have a fast network and fast disk, this can be fast.

I'm going to build small 3 nodes clusters (3x 6 osd ssd 1TB), with 2x10G links, with replication 2x.
If a disk die, It'll take some minutes to replicate the datas.

Now if you have 6TB 5,4k disk with gigabit links, this can be a lot slower. And if your cluster is already heavy io loaded, this can be worst.




 

symmcom

Renowned Member
Oct 28, 2012
1,081
30
68
Calgary, Canada
www.symmcom.com

I think it's also depend of how many disks you have in your nodes, and the probability that you can loose 2 disks at the same time.
Very true! According to Ceph guys and my experience, the higher the number of OSD the higher the performance.


Note that when you loose a disk, the datas will be rebalanced, so if you have a fast network and fast disk, this can be fast.

I'm going to build small 3 nodes clusters (3x 6 osd ssd 1TB), with 2x10G links, with replication 2x.
If a disk die, It'll take some minutes to replicate the datas.

Now if you have 6TB 5,4k disk with gigabit links, this can be a lot slower. And if your cluster is already heavy io loaded, this can be worst.
To get really good performance out of rebalancing process, there must be higher network bandwidth than 1gbps. There are simply no substitute of that. The following formula is a good start to figure out how long the rebalancing might take:

Disk capacity(Gigabits) / (Network Bandwidth * (Nodes-1)) = Recovery Time(second)

Based on the formula, for a 27TB or 27,648GB Ceph cluster with 1GB Network and 3 Nodes we can calculate the time it will take to cluster recovery.
27648 / 1 * (3-1) = 13,824 seconds = 230.4 Minutes

Same specs but with 40 GBPS network:
27648 / 40 * (3-1) = 345.6 seconds = 5.76 Minutes

Same specs as above but with 6 Nodes:
27648 / 40 * (6-1) = 138.24 seconds = 2.30 Minutes

Clearly, higher number of Nodes and higher network bandwidth has advantage. This is of course not the whole picture. We still have to take # of OSDs into consideration, but provides a rough idea.

Currently i am transforming a 1gbps network Ceph cluster to a 40gbps Infiniband. I will be sure to post some after result when all is done.
 

spirit

Famous Member
Apr 2, 2010
4,295
286
103
www.odiso.com
symmcom;99741[COLOR=#333333 said:
Currently i am transforming a 1gbps network Ceph cluster to a 40gbps Infiniband. I will be sure to post some after result when all is done.[/COLOR]
On my side, I'm going to build next year a ceph cluster on 40gbit too, but ethernet, with mellanox switch (sx1012).

around 5000€ for 16 port 40gbits or 48ports 10gbits (with splitter cables, so it's possible to mix 40gbits for ceph nodes and 10gbit for clients on same switch).

I'm also waiting for rdma support in ceph, to see the difference with ip.
 

Sakis

Active Member
Aug 14, 2013
121
3
38
Thank you for the answers,

This the question i am mostly seeking an answer for.

If i will use 3 times replication
From that the usable will be 16TB. Calculating that i should stop at 85% means i have 13.6TB max data inside the pools. Lets say i use 10TB in a future scenario.
Boom, one cpu melts. Proxmox HA works like a charm. Will the ceph cluster still working 100% and rebalance?
From my calculations now i will have a ceph cluster with 32TB space, means 10.6TB usable, means 85% max 9TB usable data. What will happen at the 1TB missing from my data? Sounds like a corruption dissaster to me or i miss smth crucial and how ceph works.
In a scenario when 1 node die complety in a 3 node cluster with 3 times replication and the ceph pools where almost full before the dissaster will the remaining OSDs have space for rebalance? I believe no according my calculations. At the rebalance, wont the cluster create again all the data 3 times? This will have to exceed the max space available afterwards.
 
According to the docs, I need to run pveceph init command just in one node and the configs will be spread to another servers... But in practice, this not work.. I need to run pveceph init and createmon in each server, in order to make ceph works properly...
So, this confuse me: I need run only in one server or in each one?? Or I make something wrong???
 

symmcom

Renowned Member
Oct 28, 2012
1,081
30
68
Calgary, Canada
www.symmcom.com
According to the docs, I need to run pveceph init command just in one node and the configs will be spread to another servers... But in practice, this not work.. I need to run pveceph init and createmon in each server, in order to make ceph works properly...
So, this confuse me: I need run only in one server or in each one?? Or I make something wrong???
You DO NOT need to run #pveceph init on all nodes, just one! For MONs, you need to create 2 MONs on 2 nodes through CLI and rest of the MONs can be installed from GUI.
 
You DO NOT need to run #pveceph init on all nodes, just one! For MONs, you need to create 2 MONs on 2 nodes through CLI and rest of the MONs can be installed from GUI.
So, I am doing something wrong, 'cause here I have deploy 3 Virtubox with proxmox, just to make a lab research, and I neede ran pveceph init blablabla... in each server...
I will test it again...
 

symmcom

Renowned Member
Oct 28, 2012
1,081
30
68
Calgary, Canada
www.symmcom.com
So, I am doing something wrong, 'cause here I have deploy 3 Virtubox with proxmox, just to make a lab research, and I neede ran pveceph init blablabla... in each server...
I will test it again...
The only thing pveceph init does it create Ceph cluster, write ceph.conf file and keys. Do you get any error message when run pveceph init or createmon?
 
The only thing pveceph init does it create Ceph cluster, write ceph.conf file and keys. Do you get any error message when run pveceph init or createmon?
This is my steps:

1 - Install proxmox in three virtualbox machines
2 - Update it every virtualbox machines
3 - Assembly the cluster
4 - Ran pveceph install -version firefly
5 - In node1 = pveceph init --network 10.10.10.0/24
6 - In each node, when I ran pveceph createmon, just the first one, node1 allow me to create mon... In the others, I receive the message that ceph is not initialized yet, or something similar...

I am right now doing fresh installation to try it again...

I will report it as soon as possible...
 

symmcom

Renowned Member
Oct 28, 2012
1,081
30
68
Calgary, Canada
www.symmcom.com
This is my steps:

1 - Install proxmox in three virtualbox machines
2 - Update it every virtualbox machines
3 - Assembly the cluster
4 - Ran pveceph install -version firefly
5 - In node1 = pveceph init --network 10.10.10.0/24
6 - In each node, when I ran pveceph createmon, just the first one, node1 allow me to create mon... In the others, I receive the message that ceph is not initialized yet, or something similar...

I am right now doing fresh installation to try it again...

I will report it as soon as possible...
Only reason that i can think of other nodes will say Ceph is not initialized is nodes are not talking to each other to sync Cluster data. The command #pveceph init creates Ceph cluster config file ceph.conf in /etc/pve which should be available to all nodes in the same cluster. May be check if all nodes are part of Proxmox cluster by #pvecm nodes or try to ping each other.

Also you mentioned you are using VirtualBox, did you add 2nd vNIC on all VMs? Is 10.10.10.0/24 your primary network?
 
Only reason that i can think of other nodes will say Ceph is not initialized is nodes are not talking to each other to sync Cluster data. The command #pveceph init creates Ceph cluster config file ceph.conf in /etc/pve which should be available to all nodes in the same cluster. May be check if all nodes are part of Proxmox cluster by #pvecm nodes or try to ping each other.

Also you mentioned you are using VirtualBox, did you add 2nd vNIC on all VMs? Is 10.10.10.0/24 your primary network?
I check it, and I see that ceph.conf exist in /etc/pve, but still get ceph not initialized message...
And I'm working here with two nic's in VB... One to LAN and one just for cluster/ceph...
Let finish the configuration and I will try again here... After that I'll post later...

Thanks
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!