Proxmox VE Ceph Server released (beta)

wahmed · Jun 10, 2014

All 3 of them are correct. The #1 as tom mentioned is for installing proxmox+ceph in one node and manage ceph through proxmox GUI.
The #2 to install ceph on any Debian based OS has nothing to do with proxmox
And the #3 is official ceph documentation regardless what hypervisor or OS you are using

Sent from my SGH-I747M using Tapatalk

impire · Jun 10, 2014

tom said:
If you want to install Ceph Server, its http://pve.proxmox.com/wiki/Ceph_Server

Thank you very much.

Is there an official documentation on how to setup HA for the VMs on CEPH?

I read somewhere stating that CEPH is itself HA. But that's the confusing part. Isn't CEPH's purpose is storage only? Which mean I still need to configure HA using some references like this:

http://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster

OR this video (part 5 of 6):

https://www.youtube.com/watch?v=iY8bSHzw7zQ

impire · Jun 10, 2014

Hello,

Can anyone show me how to back up the VMs on a CEPH? I can take snap shots, but when using the Backup option, the "storage" option is empty.

Thanks in advance for your help.

wahmed · Jun 10, 2014

impire said:
Hello,

Can anyone show me how to back up the VMs on a CEPH? I can take snap shots, but when using the Backup option, the "storage" option is empty.

Thanks in advance for your help.

I am assuming you are trying to use Ceph storage to store your backup. Not possible. Ceph RBD storage can only store VM images and nothing else. If you want to use Ceph platform to store your backup, you will have to setup CephFS then you can store anything on it without needing a separate Node such as FreeNAS or anything else. Be cautious about putting backup and VMs on same storage platform. For any reason if Ceph cluster crashes, you will obviously lose both VMs and backups.

impire · Jun 11, 2014

symmcom said:
I am assuming you are trying to use Ceph storage to store your backup. Not possible. Ceph RBD storage can only store VM images and nothing else. If you want to use Ceph platform to store your backup, you will have to setup CephFS then you can store anything on it without needing a separate Node such as FreeNAS or anything else. Be cautious about putting backup and VMs on same storage platform. For any reason if Ceph cluster crashes, you will obviously lose both VMs and backups.

Thank you very much. How about backing up to a local storage on the node itself? I should be able to that right? The backup storage option is empty, it does not offer even local storage.

"Ceph cluster crashes". That sounds scary and has been on my mind since the initial installation stage. I thought about running another proxmox node by itself having nothing to do with the rest of the 3 CEPH nodes.

Although it may be a waste of resource to have such backup, I wonder if there's a chance the entire CEPH cluster crashes. I thought CEPH is supposed to be object distributed storage and highly available. Have you seen or heard of an case where entire CEPH cluster crashes? If so, what happened to the data that replicated across the nodes?

Thanks in advance for your time.

wahmed · Jun 11, 2014

impire said:
Thank you very much. How about backing up to a local storage on the node itself? I should be able to that right? The backup storage option is empty, it does not offer even local storage.

"Ceph cluster crashes". That sounds scary and has been on my mind since the initial installation stage. I thought about running another proxmox node by itself having nothing to do with the rest of the 3 CEPH nodes.

Although it may be a waste of resource to have such backup, I wonder if there's a chance the entire CEPH cluster crashes. I thought CEPH is supposed to be object distributed storage and highly available. Have you seen or heard of an case where entire CEPH cluster crashes? If so, what happened to the data that replicated across the nodes?

Didnt mean to confuse you, when i said Ceph cluster crashes what i meant was your Proxmox+Ceph nodes. The crash here is not the software itself, but the hardware. Lets say your motherboard/CPU went up in smoke for example or multiple HDDs died permanently, you will not have any backup else where to restore your VMs since you will not be able to access your nodes at all.

I have over a dozen Proxmox+Ceph setup running for many many months without any issue. And i know other people who do not have any issues.

It is a common practice not to store backup on same hardware that VM runs. Even if you store backups locally for performance point of view, you should regularly move those backup else where on another storage. I personally dont put Ceph OSDs on same Proxmox nodes that i run VMs from. I do end up with more Proxmox nodes than i need this way, but at least they are separated. For a large VM environment this is recommended practice. Because each OSD uses up some resources and i dont my VMs and OSDs fight for same resources. If you have big platform such as 128GB Ram dual Xeon motherboard nodes, then i dont think its an issue. Plenty of resources to go around.

impire · Jun 16, 2014

symmcom said:
Didnt mean to confuse you, when i said Ceph cluster crashes what i meant was your Proxmox+Ceph nodes. The crash here is not the software itself, but the hardware. Lets say your motherboard/CPU went up in smoke for example or multiple HDDs died permanently, you will not have any backup else where to restore your VMs since you will not be able to access your nodes at all.

I have over a dozen Proxmox+Ceph setup running for many many months without any issue. And i know other people who do not have any issues.

It is a common practice not to store backup on same hardware that VM runs. Even if you store backups locally for performance point of view, you should regularly move those backup else where on another storage. I personally dont put Ceph OSDs on same Proxmox nodes that i run VMs from. I do end up with more Proxmox nodes than i need this way, but at least they are separated. For a large VM environment this is recommended practice. Because each OSD uses up some resources and i dont my VMs and OSDs fight for same resources. If you have big platform such as 128GB Ram dual Xeon motherboard nodes, then i dont think its an issue. Plenty of resources to go around.

Thank you very much. Do you mind sharing your hardware config?

How about network diagram? I am still confused about the part where it was suggested to run OSD nodes on separate network but also let the nodes with VMs connected to it?

Is there a command line reference guide for pveceph or CEPH command line guide in general. I could not find a clear cut path to documents on their website. Let say I want to do something simple like increasing the PGs size. I don't see where on CEPH documentation that have this info. Thanks in advance for your help.

impire · Jun 16, 2014

impire said:
Thank you very much. That is very helpful. So to increase the PGs, it can only be done via the command line and not the GUI?

Also, are these two commands pretty much the same?

ceph osd pool set poolname pg_num 512
ceph osd pool set poolname pgp_num 512

Thank you much. Where can I get a command line reference guide for CEPH?

wahmed · Jun 16, 2014

impire said:
Thank you much. Where can I get a command line reference guide for CEPH?

There are no one command line reference guide really. I built mine whenever i faced new command line, just added them in a documentation.

To understand about separate network for Ceph, may be the following diagram will help.

Ignore the MONs and MDSs nodes. This is a network diagram for Ceph on its own node without Proxmox. Ceph network and Proxmox network is separated with 2 switches and 2 subnets.

Now let's say you are using Proxmox+Ceph on same node and want to separate network traffic. We are going to assume the following specs

7 Nodes with Proxmox+Ceph
Each node has 2 NICs.

NIC 1 as vmbr0 = 192.168.1.0/24
NIC 2 as vmbr1 = 192.168.10.0/24

All NIC1s are connected to 1 switch and all NIC2s are connected to another switch. When you create Ceph network, you will have to create it on NIC2 subnet which is 192.168.10.0/24. So your Ceph cluster creation will look like this:
# pveceph init --network 192.168.10.0/24

And that is it to separate Proxmox and Ceph traffic. Hope this makes more sense now.

impire · Jun 16, 2014

symmcom said:
There are no one command line reference guide really. I built mine whenever i faced new command line, just added them in a documentation.

To understand about separate network for Ceph, may be the following diagram will help.
View attachment 2128

Ignore the MONs and MDSs nodes. This is a network diagram for Ceph on its own node without Proxmox. Ceph network and Proxmox network is separated with 2 switches and 2 subnets.

Now let's say you are using Proxmox+Ceph on same node and want to separate network traffic. We are going to assume the following specs

7 Nodes with Proxmox+Ceph
Each node has 2 NICs.

NIC 1 as vmbr0 = 192.168.1.0/24
NIC 2 as vmbr1 = 192.168.10.0/24

All NIC1s are connected to 1 switch and all NIC2s are connected to another switch. When you create Ceph network, you will have to create it on NIC2 subnet which is 192.168.10.0/24. So your Ceph cluster creation will look like this:
# pveceph init --network 192.168.10.0/24

And that is it to separate Proxmox and Ceph traffic. Hope this makes more sense now.

Thank you so much. You are a life saver.

So this mean I do not need to route the network between the two switches. The nodes with VMs will just be communicating on their own switch. Correct?

I plan to upgrade to a 10GB network soon. Does the theory of having 2 switches to ease traffic throughput still applies?

I always thought switches are designed in a way each port has it own dedicated path. If this is true, then how much of a performance differences between having them run on the same switches versus two independent switches.

Would you also kindly share your hardware configuration for the OSDs and VMs? The CEPH doc stated you can run dual core for the OSD just as long as you have enough memory for the OSDs. This mean we don't really need much cpu power for the OSD nodes?

I highly doubt if CEPH can really run on cheap commodity hardware (as stated by them). From what I've seen and read so far, CEPH need some beefed up hardware to run the OSDs. Please correct me if I am wrong.

Thanks again for your time and effort.

mir · Jun 16, 2014

If you have enough nics I would suggest having vmbr1 created over a bond. This will increase bandwidth, as in full connection over several lines not as in a full connection spanning several nics. Also this will give you a redundant connection to your Ceph cluster. For the switch: If the only traffic running through the switch are pve-node <-> Ceph cluster and ceph-node <-> ceph node then a simple switch is sufficient (for bonding I would strongly recommend a managed switch which supports LAPC). I have not seen any 10Gb switch which I would characterize as a simple switch though.

Given 3 pve-nodes and 3 Ceph nodes and using 2 nic bonding this means you must look for a 16 port switch.

impire · Jun 16, 2014

mir said:
If you have enough nics I would suggest having vmbr1 created over a bond. This will increase bandwidth, as in full connection over several lines not as in a full connection spanning several nics. Also this will give you a redundant connection to your Ceph cluster. For the switch: If the only traffic running through the switch are pve-node <-> Ceph cluster and ceph-node <-> ceph node then a simple switch is sufficient (for bonding I would strongly recommend a managed switch which supports LAPC). I have not seen any 10Gb switch which I would characterize as a simple switch though.

Given 3 pve-nodes and 3 Ceph nodes and using 2 nic bonding this means you must look for a 16 port switch.

full connection over several lines not as in a full connection spanning several nics

So how do I configure inside the ProxMox control panel and the switch to do full connection over several lines as opposed to spanning?

Thank you very much. My switch is a Dell 5324. So I also need to configure Link Aggregation on the switch and bonding via the ProxMox pve?

I have 5 nodes so I guess upgrading to a 48 ports switch is in order.

I've tried Link Aggregation one time (also with a Dell 5324 and Dell 6024). During the test I see the same file on one server come out a different size and bite on the other server. I tried back and forth and although it happened only random, the file was corrupted. Perhaps I have done something but I followed the Dell's tech suggestion to the teeth. It scared the heck out of me I never looked back at LAPC. But I will give it another try as I have been reading everywhere that boding is recommended in the CEPH cluster and nodes.

Thank you for your help.

mir · Jun 16, 2014

impire said:
So how do I configure inside the ProxMox control panel and the switch to do full connection over several lines as opposed to spanning?

Thank you very much. My switch is a Dell 5324. So I also need to configure Link Aggregation on the switch and bonding via the ProxMox pve?

I have 5 nodes so I guess upgrading to a 48 ports switch is in order.

Apart from configuring bonding in the pve nodes and on the switch and the Ceph nodes no other configuration is necessary provided you use LAPC. See this youtube for an example: http://www.youtube.com/watch?v=-8SwpgaxFuk

Remember to configure the Link Aggregation as dynamic using LACP.

Using a 48 port switch leaves room for adding more nodes but with 5 nodes and 5 ceph nodes you can manage with a 24 port switch. If money is no problem you might as well acquire a 48 port switch.

impire said:
I've tried Link Aggregation one time (also with a Dell 5324 and Dell 6024). During the test I see the same file on one server come out a different size and bite on the other server. I tried back and forth and although it happened only random, the file was corrupted. Perhaps I have done something but I followed the Dell's tech suggestion to the teeth. It scared the heck out of me I never looked back at LAPC. But I will give it another try as I have been reading everywhere that boding is recommended in the CEPH cluster and nodes.

Sounds like a firmware error in the switch. I have been running with bonding for 2 years now. Rock solid and no errors yet.

wahmed · Jun 16, 2014

impire said:
So this mean I do not need to route the network between the two switches. The nodes with VMs will just be communicating on their own switch. Correct?

Correct. You do not need to route any traffic between 2 switches. VMs traffic will be restricted to one switch only. [/quote]

impire said:
I plan to upgrade to a 10GB network soon. Does the theory of having 2 switches to ease traffic throughput still applies?

This will really come down to the amount of VMs you have and number of people your cluster serves. While a 10gb network will certainly increase Ceph cluster bandwidth, if your VM cluster is small you may be ok to just use single network for both Proxmox and Ceph traffic. Or use 10gb network for Ceph and use existing 1gb network for Proxmox VMs. I prefer to separate networks. As mir suggested, you can also go with network bonding way.

impire said:
I always thought switches are designed in a way each port has it own dedicated path. If this is true, then how much of a performance differences between having them run on the same switches versus two independent switches.

Each port does have its own dedicated path. What you have think is how much traffic each port going to handle. If you are putting Proxmox and Ceph both on lets say 1gb port, then it is going to get consumed by both traffic on first come first basis. Meaning if Ceph using up 700mbps that will leave with 300mbps for Proxmox traffic. If the Proxmox traffic demands 500mbps, there are none left. Same goes for 10gb traffic. During Ceph self-healing it is possible that almost all 10gb bandwidth may be consumed.

impire said:
Would you also kindly share your hardware configuration for the OSDs and VMs? The CEPH doc stated you can run dual core for the OSD just as long as you have enough memory for the OSDs. This mean we don't really need much cpu power for the OSD nodes?

This is from my personal experience. So dont take it as etched on stone. Ceph do not need beefed up node. I run my ceph cluster with dual core i3 and 16gb ram. Even on a bad day when cluster rebalances, it performs just fine. I could possibly get away with 8gb RAM, but sometimes i move VMs to Ceph nodes temporarily so additional 8GB helps.
I have the following specs per Ceph nodes:
Motherboard: Intel Server SBL1200BTLR
CPU: i3-3220
RAM: 16GB
NIC: Intel Pro 1gb
RAID: Intel RS2WC040
Expander Card: Intel 24port Expander
OSD: 10 SATA Segate Baracuda 2TB

Total Ceph nodes : 3
Total OSDs : 30
Replica : 3
Total PG : 2432

impire · Jun 17, 2014

mir said:
Apart from configuring bonding in the pve nodes and on the switch and the Ceph nodes no other configuration is necessary provided you use LAPC. See this youtube for an example: http://www.youtube.com/watch?v=-8SwpgaxFuk

Remember to configure the Link Aggregation as dynamic using LACP.

Using a 48 port switch leaves room for adding more nodes but with 5 nodes and 5 ceph nodes you can manage with a 24 port switch. If money is no problem you might as well acquire a 48 port switch.

Sounds like a firmware error in the switch. I have been running with bonding for 2 years now. Rock solid and no errors yet.

Thank you very much. Your suggestions helped a great deal.

impire · Jun 17, 2014

symmcom said:
Correct. You do not need to route any traffic between 2 switches. VMs traffic will be restricted to one switch only.

This will really come down to the amount of VMs you have and number of people your cluster serves. While a 10gb network will certainly increase Ceph cluster bandwidth, if your VM cluster is small you may be ok to just use single network for both Proxmox and Ceph traffic. Or use 10gb network for Ceph and use existing 1gb network for Proxmox VMs. I prefer to separate networks. As mir suggested, you can also go with network bonding way.

Each port does have its own dedicated path. What you have think is how much traffic each port going to handle. If you are putting Proxmox and Ceph both on lets say 1gb port, then it is going to get consumed by both traffic on first come first basis. Meaning if Ceph using up 700mbps that will leave with 300mbps for Proxmox traffic. If the Proxmox traffic demands 500mbps, there are none left. Same goes for 10gb traffic. During Ceph self-healing it is possible that almost all 10gb bandwidth may be consumed.

This is from my personal experience. So dont take it as etched on stone. Ceph do not need beefed up node. I run my ceph cluster with dual core i3 and 16gb ram. Even on a bad day when cluster rebalances, it performs just fine. I could possibly get away with 8gb RAM, but sometimes i move VMs to Ceph nodes temporarily so additional 8GB helps.
I have the following specs per Ceph nodes:
Motherboard: Intel Server SBL1200BTLR
CPU: i3-3220
RAM: 16GB
NIC: Intel Pro 1gb
RAID: Intel RS2WC040
Expander Card: Intel 24port Expander
OSD: 10 SATA Segate Baracuda 2TB

Total Ceph nodes : 3
Total OSDs : 30
Replica : 3
Total PG : 2432[/QUOTE]

Thank you!!! You and Mir are totally awesome. Man, I was in the dark about this whole ordeal but begin to see the bright lights and am going to config all afternoon long.

So regardless, in my situation with 5 nodes it's better to run two separate 24 port switches rather than just one 48 ports switch. From the way you described, Even if I separate the ceph nodes and created bonding, it is better to be on a separate switch.

The VMs itself doesn't need bonding as 1GB connection is adequate for medium usages (4-6VMs medium to low traffic), correct? I have some heavy windows users. Should I bond the nics on those VM nodes?

I notice your Total PG is 2432. But according to the calculation, it supposed to be 1024 (30 x 100 / 3 = 1000 -> 1024). Even next power up to 2 would be 2048. But you set it at 2432? Is there a reason for that?

I always thought more PGs is better as you plan for growth. But in my previous post, it was suggested by another person in the forum to increase only as needed.

What chasis do you use that can house a 10 drives? Is it a 4U?

Your 2 cents are always utmost appreciated. Thank you.

mir · Jun 17, 2014

If you by stackable switches you can configure a bond spanning both switches. See this from Catalyst 3750 -> http://www.cisco.com/c/en/us/suppor...-switches/69979-cross-stack-etherchannel.html.

For redundancy I would also make a bond for your VM's.

If you buy 2,5" disks (WD Red has a fairly priced 1TB 2,5") you should be able to fit 10 disks inside a 2U chassis. http://www.storagereview.com/wd_red_25_1tb_hdd_review_wd10jfcx

spirit · Jun 17, 2014

Hi,

About hardware configuration, 2 wonderfull guides from inktank:

http://www.inktank.com/resources/inktank-hardware-configuration-guide/
http://www.inktank.com/resources/inktank-hardware-selection-guide/

cold123 · Jun 17, 2014

Hi,

what is your suggestions about how to to make 2 different distributed storages ( one from sas one from ssd )
as far as i can see there is no way to divide ceph storage

TY

wahmed · Jun 17, 2014

impire said:
So regardless, in my situation with 5 nodes it's better to run two separate 24 port switches rather than just one 48 ports switch. From the way you described, Even if I separate the ceph nodes and created bonding, it is better to be on a separate switch.

Correct. 2 switch will always perform better since traffics are separated. For a small environment you may not even notice the difference on day to day basis, but you will notice it when Ceph starts to self heal/rebalance due to OSD failure or addition to.

impire said:
The VMs itself doesn't need bonding as 1GB connection is adequate for medium usages (4-6VMs medium to low traffic), correct? I have some heavy windows users. Should I bond the nics on those VM nodes?

For 4-6 VMs i would not bother with bonding. But it depends on how much traffic those windows heavy users causes. Are they consistently using up over half of bandwidth? If yes, then bonding might be a good idea.

impire said:
I notice your Total PG is 2432. But according to the calculation, it supposed to be 1024 (30 x 100 / 3 = 1000 -> 1024). Even next power up to 2 would be 2048. But you set it at 2432? Is there a reason for that?

Yes, my PG number indeed does not match the PG calculation formula. I do not alway stick with power 2. I try to stay within 50 to 100 PG per OSD. So for me it is 2432/30 = 81. If i used power of 2 i would have to use 2048 which i believe somewhat low or 4096 which is very high.

Edit: Just noticed Ceph recommends 100 to 200 PGs per OSD. I have to check on that one.

impire said:
I always thought more PGs is better as you plan for growth. But in my previous post, it was suggested by another person in the forum to increase only as needed.

That is correct. You should increase PG number as need arises. Usually when you are adding new OSDs. If you create higher number PG from the beginning, you are just using up node resources such as CPU, RAM unnecessarily. Each individual PG consumes some resources. Keeping PG number per OSD between 50-100 gives balance according to Ceph developers.

impire said:
What chasis do you use that can house a 10 drives? Is it a 4U?

I use this 12 Bay Chasis for all my Ceph nodes.
http://www.in-win.com.tw/Server/zh/goods.php?act=view&id=IW-RS212-02
I try to stick with same brand and model. Easy to replace and i can always keep spare on hand without buying different make/model.

Proxmox VE Ceph Server released (beta)

Famous Member

Active Member

Active Member

Famous Member

Active Member

Famous Member

Active Member

Active Member

Famous Member

Active Member

Famous Member

Active Member

Famous Member

Famous Member

Active Member

Active Member

Famous Member

Distinguished Member

New Member

Famous Member

We value your privacy