Ceph configuration for replication on second node

Discussion in 'Proxmox VE: Installation and configuration' started by anthony, May 17, 2019.

  1. anthony

    anthony New Member

    Joined:
    Jun 20, 2018
    Messages:
    14
    Likes Received:
    0
    I thought I did this all correctly, but i installed ceph with rbd, and thought that the defaults would be to replicate to data between the two nodes and not just the local OSDS. I have 2 nodes(3rd strictly for monitoring), 8 drives per node. with a min of 2 copies set in the pool. Im new to ceph, but it appears the crush rule is what i need to be looking at? can someone help me understand a little better how a rule would be set up to ensure there is an accessible copy of the data on each node? is there a default place i can directly edit the crush map in a text editor or is it all through the crush command?
     
  2. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,188
    Likes Received:
    192
    For Ceph you need three nodes as MON and for small clusters a size 3 / min_size 2, as chances are high that objects in-flight might get lost on a subsequent failure. The default failure domain is node level and it should have distributed a copy already to the second node. And yes, you change how ceph is replicating things with the crushmap. But I strictly advise against it. Better take a look at ZFS with our pvesr (send/receive) to replicate data between nodes.
    https://pve.proxmox.com/pve-docs/chapter-pvesr.html
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. anthony

    anthony New Member

    Joined:
    Jun 20, 2018
    Messages:
    14
    Likes Received:
    0
    Alwin, why do you advise against ceph? One major drawback of zfs and pvesr is live migration is not possible, something I am looking for. Can you explain what you mean by "chances are high that objects in-flight might get lost on a subsequent failure."?
     
  4. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,188
    Likes Received:
    192
    You didn't mention this before. But for a hyper-converged setup with only two nodes, a ZFS storage is the recommended way to go.

    The time a object might only have one copy written to disk, while the second copy is being written to a different disk on a different node is higher in small clusters. This is due the fact that bigger clusters are growing in parallel write capability and therefore need less time for sync data as bigger they get. If a second failure occurs and it hits the objects that are the last remaining copy of themselves, then these will be very likely lost.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. anthony

    anthony New Member

    Joined:
    Jun 20, 2018
    Messages:
    14
    Likes Received:
    0
    Ah okay. I understand now. I'm guessing there's no way to say there has to be a copy per node for the pool and not just 2 copys across osds? With a 20gb (2x 10gb links bonded) replication network ID assume the disk write speed would be the limiting factor and there would be minimal difference in speed as compared to writing to 2 osds.

    I asked in a separate post a few days ago which of the many replication storage solutions would be best and only got 1 response. In there I did ask about live migration.

    Unfortunately live migration is a requirement for this particular setup. It's unfortunate zfs does not allow for it.

    Is there another more reliable solution that would allow for live migration?
     
  6. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,188
    Likes Received:
    192
    This is already by default across hosts (ceph osd tree).

    Ceph does not have an affinity, but you probably won't max out the 10GbE bandwidth with ceph. Besides that, your corosync (cluster) and client traffic need to be on separate networks too, otherwise cluster will be slow/unstable.

    Depends on what reliable means for your setup. But live migration can be done with any shared storage supported by Proxmox VE (eg. NFS/SMB/iSCSI/...).
    https://pve.proxmox.com/pve-docs/chapter-pvesm.html#_storage_types
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. anthony

    anthony New Member

    Joined:
    Jun 20, 2018
    Messages:
    14
    Likes Received:
    0
    The 20gb link is strictly for the cluster network. there is another nic for quoram 1 for backups, and 2 for vm interfaces. I am still having trouble getting a vm to restart after a link failure. i assume its due to the vmdisk not being accessible on the other machine, perhaps this is not due to ceph.. is there a way to verify that there is a replica?

    unfortunatley i only have the 2 servers for both storage and compute, so something like ceph, or i suppose zfs would be my only option. Live migration is extremely beneficial for this use case, but it seems having reliability on 2 nodes with live migration is a tall order?
     
  8. anthony

    anthony New Member

    Joined:
    Jun 20, 2018
    Messages:
    14
    Likes Received:
    0
    it appears my PGs are not remapping.
     
  9. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,188
    Likes Received:
    192
    Dol you have the right replication count for your pool?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. anthony

    anthony New Member

    Joined:
    Jun 20, 2018
    Messages:
    14
    Likes Received:
    0
    The OSDs are not being marked as down for some reason. I think there is a setting for the number of votes for an osd to be marked as down. I am going to head your advice and go with ZFS for production, but am determined to get this to work in testing now. ill have to work around the non live migration. Is there a way to set a default replication for any new VMs added to the cluster?
     
  11. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,188
    Likes Received:
    192
    No, but you could use the API for you workflow.

    Usually two OSDs from another node need to report a OSD before it is marked down.
    Code:
    ceph daemon osd.0 config show | grep -i report
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice