Ceph configuration for replication on second node

anthony · May 17, 2019

I thought I did this all correctly, but i installed ceph with rbd, and thought that the defaults would be to replicate to data between the two nodes and not just the local OSDS. I have 2 nodes(3rd strictly for monitoring), 8 drives per node. with a min of 2 copies set in the pool. Im new to ceph, but it appears the crush rule is what i need to be looking at? can someone help me understand a little better how a rule would be set up to ensure there is an accessible copy of the data on each node? is there a default place i can directly edit the crush map in a text editor or is it all through the crush command?

Alwin · May 20, 2019

For Ceph you need three nodes as MON and for small clusters a size 3 / min_size 2, as chances are high that objects in-flight might get lost on a subsequent failure. The default failure domain is node level and it should have distributed a copy already to the second node. And yes, you change how ceph is replicating things with the crushmap. But I strictly advise against it. Better take a look at ZFS with our pvesr (send/receive) to replicate data between nodes.
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

anthony · May 20, 2019

Alwin, why do you advise against ceph? One major drawback of zfs and pvesr is live migration is not possible, something I am looking for. Can you explain what you mean by "chances are high that objects in-flight might get lost on a subsequent failure."?

Alwin · May 20, 2019

anthony said:
One major drawback of zfs and pvesr is live migration is not possible, something I am looking for.

You didn't mention this before. But for a hyper-converged setup with only two nodes, a ZFS storage is the recommended way to go.

anthony said:
Can you explain what you mean by "chances are high that objects in-flight might get lost on a subsequent failure."?

The time a object might only have one copy written to disk, while the second copy is being written to a different disk on a different node is higher in small clusters. This is due the fact that bigger clusters are growing in parallel write capability and therefore need less time for sync data as bigger they get. If a second failure occurs and it hits the objects that are the last remaining copy of themselves, then these will be very likely lost.

anthony · May 20, 2019

Ah okay. I understand now. I'm guessing there's no way to say there has to be a copy per node for the pool and not just 2 copys across osds? With a 20gb (2x 10gb links bonded) replication network ID assume the disk write speed would be the limiting factor and there would be minimal difference in speed as compared to writing to 2 osds.

I asked in a separate post a few days ago which of the many replication storage solutions would be best and only got 1 response. In there I did ask about live migration.

Unfortunately live migration is a requirement for this particular setup. It's unfortunate zfs does not allow for it.

Is there another more reliable solution that would allow for live migration?

Alwin · May 20, 2019

anthony said:
I'm guessing there's no way to say there has to be a copy per node for the pool and not just 2 copys across osds?

This is already by default across hosts (ceph osd tree).

anthony said:
With a 20gb (2x 10gb links bonded) replication network ID assume the disk write speed would be the limiting factor and there would be minimal difference in speed as compared to writing to 2 osds.

Ceph does not have an affinity, but you probably won't max out the 10GbE bandwidth with ceph. Besides that, your corosync (cluster) and client traffic need to be on separate networks too, otherwise cluster will be slow/unstable.

anthony said:
Is there another more reliable solution that would allow for live migration?

Depends on what reliable means for your setup. But live migration can be done with any shared storage supported by Proxmox VE (eg. NFS/SMB/iSCSI/...).
https://pve.proxmox.com/pve-docs/chapter-pvesm.html#_storage_types

anthony · May 20, 2019

The 20gb link is strictly for the cluster network. there is another nic for quoram 1 for backups, and 2 for vm interfaces. I am still having trouble getting a vm to restart after a link failure. i assume its due to the vmdisk not being accessible on the other machine, perhaps this is not due to ceph.. is there a way to verify that there is a replica?

unfortunatley i only have the 2 servers for both storage and compute, so something like ceph, or i suppose zfs would be my only option. Live migration is extremely beneficial for this use case, but it seems having reliability on 2 nodes with live migration is a tall order?

anthony · May 20, 2019

it appears my PGs are not remapping.

Alwin · May 20, 2019

anthony said:
it appears my PGs are not remapping.

Dol you have the right replication count for your pool?

anthony · May 20, 2019

The OSDs are not being marked as down for some reason. I think there is a setting for the number of votes for an osd to be marked as down. I am going to head your advice and go with ZFS for production, but am determined to get this to work in testing now. ill have to work around the non live migration. Is there a way to set a default replication for any new VMs added to the cluster?

Alwin · May 21, 2019

anthony said:
Is there a way to set a default replication for any new VMs added to the cluster?

No, but you could use the API for you workflow.

anthony said:
The OSDs are not being marked as down for some reason. I think there is a setting for the number of votes for an osd to be marked as down.

Usually two OSDs from another node need to report a OSD before it is marked down.

Code:

ceph daemon osd.0 config show | grep -i report

Search

Search

Ceph configuration for replication on second node

anthony

Member

Alwin

Proxmox Retired Staff

anthony

Member

Alwin

Proxmox Retired Staff

anthony

Member

Alwin

Proxmox Retired Staff

anthony

Member

anthony

Member

Alwin

Proxmox Retired Staff

anthony

Member

Alwin

Proxmox Retired Staff

We value your privacy