Proxmox VE Ceph Server released (beta)

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Jan 24, 2014.

  1. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,062
    Likes Received:
    16
    Very nicely put MO! Ceph indeed diminishes the need to have RAID setup. I myself use combination of RAID and Expander card to have JBOD for Ceph. I am not sure if battery backed cache is needed since Ceph can heal itself pretty good. This is was my primary concern when i was introduced to Ceph. I ran several tests to simulate complete Ceph node failure but never had any data issue.

    If we are talking about Caching through Proxmox for VMs such as writeback, write through etc. of course they are little different story. They really have nothing to do with RAID cache we are talking about here.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. pixel

    pixel Member
    Proxmox Subscriber

    Joined:
    Aug 6, 2014
    Messages:
    136
    Likes Received:
    2
    too bad. was also interested in connecting to existing storage.
     
  3. dietmar

    dietmar Proxmox Staff Member
    Staff Member

    Joined:
    Apr 28, 2005
    Messages:
    16,445
    Likes Received:
    304
    You can use existing ceph storage anyways.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  4. pixel

    pixel Member
    Proxmox Subscriber

    Joined:
    Aug 6, 2014
    Messages:
    136
    Likes Received:
    2
  5. Sakis

    Sakis Member
    Proxmox Subscriber

    Joined:
    Aug 14, 2013
    Messages:
    119
    Likes Received:
    3
    I am planing a 3 node ceph-proxmox cluster. I try to find what is better for my set up according to replication times. I am not sure about one thing.

    Each node will have 4 osds, 4 x 4 TB SATA disks means a total 48TB ceph cluster.

    If i will use 3 times replication
    From that the usable will be 16TB. Calculating that i should stop at 85% means i have 13.6TB max data inside the pools. Lets say i use 10TB in a future scenario.
    Boom, one cpu melts. Proxmox HA works like a charm. Will the ceph cluster still working 100% and rebalance?
    From my calculations now i will have a ceph cluster with 32TB space, means 10.6TB usable, means 85% max 9TB usable data. What will happen at the 1TB missing from my data? Sounds like a corruption dissaster to me or i miss smth crucial and how ceph works.

    So if i choose 3 times replication, and want to be sure for ceph and proxmox cluster availability with 1 node down i must use max aproximately 9TB. Correct? (thats a huge amount of loss from 48TB data drives).

    If i will use 2 times replication things are getting better.
    From that the usable will be 24TB. Calculating that i should stop at 85% means i have 20.4TB max data inside the pools. If i use now 10TB of data i wont have problem if a node dissapears, cause it will leave me with 32TB ceph cluster, 16TB usable, 85% means 13.6 max data which is ok with 10TB used data.

    So, if i am correct again, if i want to afford one node downtime i can choose 2 time replication and gain some usable data. And if want to be crazy and retain my data at all costs (despite that the proxmox cluster will lose quorum) i should use 3 times replication and use up to 4.5TB data. (16TB ceph cluster, 5.33TB data due to 3 time replication, 85%= 4.5TB max data)

    Do i miss something?

    Thank you
     
  6. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,062
    Likes Received:
    16
    In my opinion you should use N(nodes)-1 formula to decide how many replicas you need. You have 3 nodes, so you should consider using 3-1=2 replicas. This will give you 32TB usable space while providing performance.

    From rest of your message, i am not sure what you are asking. Are you afraid of losing data because you have over written data? Ceph will automatically stop writing to an OSD when it approaches critical shortage so you will never overwrite or lose data. If you lose too many OSDs or nodes and lose quorum, Ceph will simply stop writing data all together. When you add more OSDs or Nodes, it will start rebalancing again. Simply put the only time you will lose any data when you lose replicas all together. For example, in 3 node cluster, if 2 of your nodes fails totally including the physical nodes and all the OSDs in it simply went up in smoke, then you are facing massive data loss. But this is the worst of worst case scenario and hardly will occur. In such cases your only option will be recover data from backup storage.
    For a 3 node Ceph setup with 3 replicas, you can afford losing 2 nodes completely and still have all data. So it really comes down how valuable your data is and how effective your backup system is.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #246 symmcom, Sep 8, 2014
    Last edited: Sep 8, 2014
  7. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,312
    Likes Received:
    131
    I think it's also depend of how many disks you have in your nodes, and the probability that you can loose 2 disks at the same time.

    Note that when you loose a disk, the datas will be rebalanced, so if you have a fast network and fast disk, this can be fast.

    I'm going to build small 3 nodes clusters (3x 6 osd ssd 1TB), with 2x10G links, with replication 2x.
    If a disk die, It'll take some minutes to replicate the datas.

    Now if you have 6TB 5,4k disk with gigabit links, this can be a lot slower. And if your cluster is already heavy io loaded, this can be worst.




     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,062
    Likes Received:
    16
    Very true! According to Ceph guys and my experience, the higher the number of OSD the higher the performance.

    To get really good performance out of rebalancing process, there must be higher network bandwidth than 1gbps. There are simply no substitute of that. The following formula is a good start to figure out how long the rebalancing might take:

    Disk capacity(Gigabits) / (Network Bandwidth * (Nodes-1)) = Recovery Time(second)

    Based on the formula, for a 27TB or 27,648GB Ceph cluster with 1GB Network and 3 Nodes we can calculate the time it will take to cluster recovery.
    27648 / 1 * (3-1) = 13,824 seconds = 230.4 Minutes

    Same specs but with 40 GBPS network:
    27648 / 40 * (3-1) = 345.6 seconds = 5.76 Minutes

    Same specs as above but with 6 Nodes:
    27648 / 40 * (6-1) = 138.24 seconds = 2.30 Minutes

    Clearly, higher number of Nodes and higher network bandwidth has advantage. This is of course not the whole picture. We still have to take # of OSDs into consideration, but provides a rough idea.

    Currently i am transforming a 1gbps network Ceph cluster to a 40gbps Infiniband. I will be sure to post some after result when all is done.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,312
    Likes Received:
    131
    On my side, I'm going to build next year a ceph cluster on 40gbit too, but ethernet, with mellanox switch (sx1012).

    around 5000€ for 16 port 40gbits or 48ports 10gbits (with splitter cables, so it's possible to mix 40gbits for ceph nodes and 10gbit for clients on same switch).

    I'm also waiting for rdma support in ceph, to see the difference with ip.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Sakis

    Sakis Member
    Proxmox Subscriber

    Joined:
    Aug 14, 2013
    Messages:
    119
    Likes Received:
    3
    Thank you for the answers,

    This the question i am mostly seeking an answer for.

    In a scenario when 1 node die complety in a 3 node cluster with 3 times replication and the ceph pools where almost full before the dissaster will the remaining OSDs have space for rebalance? I believe no according my calculations. At the rebalance, wont the cluster create again all the data 3 times? This will have to exceed the max space available afterwards.
     
  11. Gilberto Ferreira

    Joined:
    Aug 21, 2014
    Messages:
    47
    Likes Received:
    0
    According to the docs, I need to run pveceph init command just in one node and the configs will be spread to another servers... But in practice, this not work.. I need to run pveceph init and createmon in each server, in order to make ceph works properly...
    So, this confuse me: I need run only in one server or in each one?? Or I make something wrong???
     
  12. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,062
    Likes Received:
    16
    You DO NOT need to run #pveceph init on all nodes, just one! For MONs, you need to create 2 MONs on 2 nodes through CLI and rest of the MONs can be installed from GUI.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  13. Gilberto Ferreira

    Joined:
    Aug 21, 2014
    Messages:
    47
    Likes Received:
    0
    So, I am doing something wrong, 'cause here I have deploy 3 Virtubox with proxmox, just to make a lab research, and I neede ran pveceph init blablabla... in each server...
    I will test it again...
     
  14. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,062
    Likes Received:
    16
    The only thing pveceph init does it create Ceph cluster, write ceph.conf file and keys. Do you get any error message when run pveceph init or createmon?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  15. Gilberto Ferreira

    Joined:
    Aug 21, 2014
    Messages:
    47
    Likes Received:
    0
    This is my steps:

    1 - Install proxmox in three virtualbox machines
    2 - Update it every virtualbox machines
    3 - Assembly the cluster
    4 - Ran pveceph install -version firefly
    5 - In node1 = pveceph init --network 10.10.10.0/24
    6 - In each node, when I ran pveceph createmon, just the first one, node1 allow me to create mon... In the others, I receive the message that ceph is not initialized yet, or something similar...

    I am right now doing fresh installation to try it again...

    I will report it as soon as possible...
     
  16. symmcom

    symmcom Active Member

    Joined:
    Oct 28, 2012
    Messages:
    1,062
    Likes Received:
    16
    Only reason that i can think of other nodes will say Ceph is not initialized is nodes are not talking to each other to sync Cluster data. The command #pveceph init creates Ceph cluster config file ceph.conf in /etc/pve which should be available to all nodes in the same cluster. May be check if all nodes are part of Proxmox cluster by #pvecm nodes or try to ping each other.

    Also you mentioned you are using VirtualBox, did you add 2nd vNIC on all VMs? Is 10.10.10.0/24 your primary network?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  17. Gilberto Ferreira

    Joined:
    Aug 21, 2014
    Messages:
    47
    Likes Received:
    0
    I check it, and I see that ceph.conf exist in /etc/pve, but still get ceph not initialized message...
    And I'm working here with two nic's in VB... One to LAN and one just for cluster/ceph...
    Let finish the configuration and I will try again here... After that I'll post later...

    Thanks
     
  18. Gilberto Ferreira

    Joined:
    Aug 21, 2014
    Messages:
    47
    Likes Received:
    0
    Sorry! My misunderstood here...
    I was confused about pveceph init and pveceph createmon... The first command needed ran in one node and the other command, pveceph createmon, in each node...


    Now I got it!


    Thanks
     
  19. meldieching

    meldieching New Member

    Joined:
    Oct 28, 2014
    Messages:
    3
    Likes Received:
    0
    Can I install Proxmox VE in both virtual and phisical servers? For example, I would like to use some virtual machines (not managed by proxmox) as MONs and some phisical servers as OSDs?
     
  20. Norman Uittenbogaart

    Joined:
    Feb 28, 2012
    Messages:
    144
    Likes Received:
    4
    You don't need proxmox to create mon's and osd's?

    Sent from my SM-G920F using Tapatalk
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice