Proxmox VE Ceph Server released (beta)

symmcom

Renowned Member
Oct 28, 2012
1,078
26
68
Calgary, Canada
www.symmcom.com
Hi,

what is your suggestions about how to to make 2 different distributed storages ( one from sas one from ssd )
as far as i can see there is no way to divide ceph storage
There is for sure way to divide Ceph cluster for SSD or SAS/SATA. You have to edit CRUSHmap. The steps are following:
1. Extract CRUSHmap.
2. Decompile CRUSHmap to txt.
3. Edit CRUSHmap.
4. Compile CRUSHmap.
5. Inject CRUSHmap back in cluster.

If you are using same node then basically you have to create a virtual host entry in the CRUSHmap and assign the OSDs to it. This is just way overly simplified version of explanation. If you are not familiar with CRUSHmap and syntaxes used in it, then getting to know it first will help tremendously. Regardless you want to separate cluster or not, knowing CRUSHmap will help you greatly to customize your Ceph cluster.

At this time CRUSHmap cannot be edited through proxmox GUI, it can only be viewed.
 

symmcom

Renowned Member
Oct 28, 2012
1,078
26
68
Calgary, Canada
www.symmcom.com
Very good documentations on hardware selection for Ceph! Will help anybody seeking deeper understanding on Ceph hardware.

The kind of hardware that i use will be considered well below standard based on the documentation suggestion. But at least i know what can be achieved with lowest budget possible.

Note the hardware differences between Performance and Capacity configuration. This was the most valuable piece of information to me. I did not take into consideration that higher the node count and smaller the OSD count per node, recovery/rebalancing will actually be faster. As of now all of my Ceph setups has optimized to use as many OSD possible per node to save Node count and space.
 

impire

Member
Jun 10, 2010
106
0
16
If you by stackable switches you can configure a bond spanning both switches. See this from Catalyst 3750 -> http://www.cisco.com/c/en/us/support/docs/switches/catalyst-3750-series-switches/69979-cross-stack-etherchannel.html.

For redundancy I would also make a bond for your VM's.

If you buy 2,5" disks (WD Red has a fairly priced 1TB 2,5") you should be able to fit 10 disks inside a 2U chassis. http://www.storagereview.com/wd_red_25_1tb_hdd_review_wd10jfcx
Thank you. I was overly excited for a moment as I thought the WD RED 2.5" also come in a 4TB version.

I've noticed the WD RED doesn't list RPMs on the drive. It stated 6gb transfer but doesn't list the RPMs. I wonder why.

Also, this is made for NAS and geared toward home and small businesses. It is a good idea to put them in a enterprise environment?

Just a thought. Wouldn't it be better to have a system with 6 x 4TB 3.5" drives versus a system with 10 x 1TB 2.5" drives?

I have been looking at different systems like the Dell C6220 and such. It is so hard to find larger capacity HD in 2.5" size. I gave up on the idea of having chasis that could house 2.5" drives as the price and availability was impractical. But the WD Red looks good and I just hope they come out with a 2TB or 4TB soon, then it would make a lot more sense than trying to fill it with bunch of 1TB drives.

Thanks for your time and kind effort.
 

symmcom

Renowned Member
Oct 28, 2012
1,078
26
68
Calgary, Canada
www.symmcom.com
I have been looking at different systems like the Dell C6220 and such. It is so hard to find larger capacity HD in 2.5" size. I gave up on the idea of having chasis that could house 2.5" drives as the price and availability was impractical. But the WD Red looks good and I just hope they come out with a 2TB or 4TB soon, then it would make a lot more sense than trying to fill it with bunch of 1TB drives.
Have you looked at this chassis? This one takes 24 2.5" SSD in one node.
http://www.in-win.com.tw/Server/zh/goods.php?act=view&id=IW-RS224-02
 

impire

Member
Jun 10, 2010
106
0
16
Each port does have its own dedicated path. What you have think is how much traffic each port going to handle. If you are putting Proxmox and Ceph both on lets say 1gb port, then it is going to get consumed by both traffic on first come first basis. Meaning if Ceph using up 700mbps that will leave with 300mbps for Proxmox traffic. If the Proxmox traffic demands 500mbps, there are none left. Same goes for 10gb traffic. During Ceph self-healing it is possible that almost all 10gb bandwidth may be consumed.
Thank you. What if I put them on separate VLANs but on the same switch? This would mean they won't be fighting for traffic on the same ports. Technically, they will be on separate network, just still on the same switch. Will that still not recommended?

I am trying to avoid having 2 switches, which mean I would need 4 switches total as I always prefer to have backups. This is just in case the main switches dies. I would rather just have 1 main and 1 backup. But if it mean better ceph traffic, then a separate switch it shall be.

I was wishful thinking about upgrading to 10GB network. I took one look at current pricing and realized I've forgotten to do one very important thing. That is to go plant a money tree in my backyard and wait for a few seasons Hopefully by then the tree will grow enough money for me to buy the 10gb hardware. We all know the current 10gb gears are way over priced.

Which lead me to another creative thought. What about fibre channel? It's made for SAN but why not? It's cheap and for the current value may be worthwhile. I see fibre channel switches and nic cards cost next to nothing. They give 2gb-4gb each port. Bond 2 of those 4gb ports and we would get excellent speed at 8gb. Any thoughts?

Thanks in advance for your help.
 

impire

Member
Jun 10, 2010
106
0
16
Have you looked at this chassis? This one takes 24 2.5" SSD in one node.
http://www.in-win.com.tw/Server/zh/goods.php?act=view&id=IW-RS224-02
Thank you. Pretty awesome chassis. But then again, even at 24 2.5" x 1TB, that's only 24 TB.

I like your current chassis better. I could fit 10 x 4TB = 40TB. The price between the 1TB 2.5" and the 4TB 3.5" SATA is not that much of a different. I have been sticking to SATA largely because SAS in the larger capacity are pricey at 7200RPM.

Not too many people can afford 1TB SSD nowadays. 2.5" SATA drives are not available in larger size. But I am glad Mir pointed out the WD Red as I will be watching for future release of larger capacity.
 

impire

Member
Jun 10, 2010
106
0
16
I use this 12 Bay Chasis for all my Ceph nodes.
http://www.in-win.com.tw/Server/zh/goods.php?act=view&id=IW-RS212-02
I try to stick with same brand and model. Easy to replace and i can always keep spare on hand without buying different make/model.
Thanks a lot for this. Back in the 90's, I used to build and ship thousands of PCs under my own brand. For the past years, I have been using Dell. But venturing into cloud computing and Ceph, I realized their servers are limited and over priced. Your link and comment inspired me to build an awesome one.

So with my endeavor to build our dream severs. Questions:

I know where you get the chassis? But where do you get the rest of the components (motherboards, processors, RAM, RAID cards, etc)?

Why do you need the Intel 24port Expander Card?

Have you used any chassis and mother board that could house 4 physical processors? I know Super Micro makes them.

What if I build a server with a lot of power (4 x 8 cores = 32 cores) and pack the ceph nodes with a lot of RAM (128GB), can I just use it for both ceph nodes and clusters (VMs)?

or regardless of the horse power of the ceph nodes it is still better to keep the clusters that run the VMs separated?
 

impire

Member
Jun 10, 2010
106
0
16
This was the most valuable piece of information to me. I did not take into consideration that higher the node count and smaller the OSD count per node, recovery/rebalancing will actually be faster. As of now all of my Ceph setups has optimized to use as many OSD possible per node to save Node count and space.
Does this also mean the read/write process will be faster if you spread the OSDs across the ceph nodes rather than packing each nodes full of OSDs?

Does this also mean it's better to have 2 x 4TB than 8 x 1TB OSDs in each nodes?
 

mir

Renowned Member
Apr 14, 2012
3,500
99
68
Copenhagen, Denmark
I've noticed the WD RED doesn't list RPMs on the drive. It stated 6gb transfer but doesn't list the RPMs. I wonder why.
5400 RPM -> http://www.amazon.com/Western-Digital-Cache-Drive-WD10JFCX/dp/B00EHBES1U
Just a thought. Wouldn't it be better to have a system with 6 x 4TB 3.5" drives versus a system with 10 x 1TB 2.5" drives?
Performance wise you get higher throughput with 10 disks instead of 6 disks. Reads can theoretically be spread across 10 disks which will give 10 x read speed of 1 disk instead of 6 x read speed of 1 disk.
 

impire

Member
Jun 10, 2010
106
0
16
for affordable 10gb check Infiniband . We use that thanks to active forum support. And here is the wiki link: http://pve.proxmox.com/wiki/Infiniband .

prices on ebay go up and down... cables usually ship from China .
Thank you so much for the info.

I don't see a lot of NIC cards available out there. Which NIC cards do you use (single, dual, quad ports)?

Does it require any special driver for Proxmox/Debian to see the NICs? I also have a problem with third party NICs and Debian needing the proper driver for it.

What option to you choose for the VMs (e1000, virtio, etc.)?

I also read that 10gb is limited by the type of processors and hard drive that you have. Current max speed of hard drive throughput is 6Gb/s. Are you experience full 10gb speed?

Thanks in advance for your help?
 

RobFantini

Renowned Member
May 24, 2012
1,706
40
68
Boston,Mass
Thank you so much for the info.

I don't see a lot of NIC cards available out there. Which NIC cards do you use (single, dual, quad ports)?

Does it require any special driver for Proxmox/Debian to see the NICs? I also have a problem with third party NICs and Debian needing the proper driver for it.

What option to you choose for the VMs (e1000, virtio, etc.)?

I also read that 10gb is limited by the type of processors and hard drive that you have. Current max speed of hard drive throughput is 6Gb/s. Are you experience full 10gb speed?

Thanks in advance for your help?
1- the drivers are built into the kernel .

2- we use infiniband for cluster / ceph network. vmbr0 etc for vm's

3- we have not got to speed testing ceph yet...

4- for next msg in thread we use these model cards, per cli command lspci MT25208 , MT25418 and MT25208 .

Code:
02:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

03:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)
ib is very easy to set up. see wiki and ask questions, but use a diff thread please.
 

impire

Member
Jun 10, 2010
106
0
16
Thank you very much. Just one last question regarding this topic. May I ask for the brand and model of the switch you are using? I see both Melannox and Flextronics. Some of these go for 20gb or 40gb speed. I also wonder if the higher speed switch can auto-sense the lower speed NICs. Thanks for your help.
 
Last edited:

felipe

Member
Oct 28, 2013
152
1
18
hi,

i purchased 3 server for ceph with 2 10gig nic each server.
and two 10g switches. unfortunally the switches are not stackable (too expensive)
what the best way to have fault tolerancy?
10g network bonded for monitors
10g network bonded for osd
without stackable switch i cannot make any 20gig bond i think...?

thank you
 

aderumier

Member
May 14, 2013
203
18
18
hi,

i purchased 3 server for ceph with 2 10gig nic each server.
and two 10g switches. unfortunally the switches are not stackable (too expensive)
what the best way to have fault tolerancy?
10g network bonded for monitors
10g network bonded for osd
without stackable switch i cannot make any 20gig bond i think...?

thank you
use active-backup bonding. (so only 1 link of the bond will be used)

if you can have 4 10gb nic by ceph server,
then configure ceph to use public network (vm->osd), and private network (osd->osd replication).
monitor don't need dedicated bandwidth.
 

impire

Member
Jun 10, 2010
106
0
16
use active-backup bonding. (so only 1 link of the bond will be used)

if you can have 4 10gb nic by ceph server,
then configure ceph to use public network (vm->osd), and private network (osd->osd replication).
monitor don't need dedicated bandwidth.
What about the ProxMox host? Is it advantageous to have it on the 10Gb network?

In the below scenario, the VMs are communicating directly with the Ceph nodes (vm -> OSD). Is it worthwhile to have a fast network on the ProxMox host itself or is it a waste of 10gb ports? It seems the processing bandwidth is strictly between the VMs and and the Ceph nodes.

ProxMox Host = Subnet A (1gb network)

VMs = Subnet B (10gb network)

Ceph Nodes = Subnet C (10gb network on separate switch).
 

impire

Member
Jun 10, 2010
106
0
16
I am curious. Why do we need to much bandwidth when the hard drives throughput and transfer rate can only be at 3Gb/s or 6Gb/s? I read in various forums that even high end graphic house that run these 10gb network are getting the bottle neck at hard drives or processor level.

Unless we run a network of beefed up servers with many cores and bunch of raid drives, I don't see how we can ever take advantage of a full 10gb network. Even at 20gb/s the only thing that can use it fully would be to run the VMs on a RAM disk.
 

udo

Famous Member
Apr 22, 2009
5,874
163
83
Ahrensburg; Germany
I am curious. Why do we need to much bandwidth when the hard drives throughput and transfer rate can only be at 3Gb/s or 6Gb/s? I read in various forums that even high end graphic house that run these 10gb network are getting the bottle neck at hard drives or processor level.

Unless we run a network of beefed up servers with many cores and bunch of raid drives, I don't see how we can ever take advantage of a full 10gb network. Even at 20gb/s the only thing that can use it fully would be to run the VMs on a RAM disk.
Hi,
the transfer speed is one thing, but for an storage network the latency is also (or more) important. And the latency on an 10GB Network is much less than on an 1GB-Network (and 10G Base T is not so good like an SFP+-Network).

Udo
 

impire

Member
Jun 10, 2010
106
0
16
Hi,
the transfer speed is one thing, but for an storage network the latency is also (or more) important. And the latency on an 10GB Network is much less than on an 1GB-Network (and 10G Base T is not so good like an SFP+-Network).

Udo
Thank you Udo.

Is there a way to see how much bandwidth the Ceph Nodes are using up? The only way I can think of right now is go to the switch and use the web gui bandwidth monitoring tool.

It would be nice to be able to see a graphical performance bar for the Ceph nodes from ProxMox GUI.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!