Ceph configuration for replication on second node

anthony

Member
Jun 20, 2018
30
1
11
28
I thought I did this all correctly, but i installed ceph with rbd, and thought that the defaults would be to replicate to data between the two nodes and not just the local OSDS. I have 2 nodes(3rd strictly for monitoring), 8 drives per node. with a min of 2 copies set in the pool. Im new to ceph, but it appears the crush rule is what i need to be looking at? can someone help me understand a little better how a rule would be set up to ensure there is an accessible copy of the data on each node? is there a default place i can directly edit the crush map in a text editor or is it all through the crush command?
 
For Ceph you need three nodes as MON and for small clusters a size 3 / min_size 2, as chances are high that objects in-flight might get lost on a subsequent failure. The default failure domain is node level and it should have distributed a copy already to the second node. And yes, you change how ceph is replicating things with the crushmap. But I strictly advise against it. Better take a look at ZFS with our pvesr (send/receive) to replicate data between nodes.
https://pve.proxmox.com/pve-docs/chapter-pvesr.html
 
Alwin, why do you advise against ceph? One major drawback of zfs and pvesr is live migration is not possible, something I am looking for. Can you explain what you mean by "chances are high that objects in-flight might get lost on a subsequent failure."?
 
One major drawback of zfs and pvesr is live migration is not possible, something I am looking for.
You didn't mention this before. But for a hyper-converged setup with only two nodes, a ZFS storage is the recommended way to go.

Can you explain what you mean by "chances are high that objects in-flight might get lost on a subsequent failure."?
The time a object might only have one copy written to disk, while the second copy is being written to a different disk on a different node is higher in small clusters. This is due the fact that bigger clusters are growing in parallel write capability and therefore need less time for sync data as bigger they get. If a second failure occurs and it hits the objects that are the last remaining copy of themselves, then these will be very likely lost.
 
Ah okay. I understand now. I'm guessing there's no way to say there has to be a copy per node for the pool and not just 2 copys across osds? With a 20gb (2x 10gb links bonded) replication network ID assume the disk write speed would be the limiting factor and there would be minimal difference in speed as compared to writing to 2 osds.

I asked in a separate post a few days ago which of the many replication storage solutions would be best and only got 1 response. In there I did ask about live migration.

Unfortunately live migration is a requirement for this particular setup. It's unfortunate zfs does not allow for it.

Is there another more reliable solution that would allow for live migration?
 
I'm guessing there's no way to say there has to be a copy per node for the pool and not just 2 copys across osds?
This is already by default across hosts (ceph osd tree).

With a 20gb (2x 10gb links bonded) replication network ID assume the disk write speed would be the limiting factor and there would be minimal difference in speed as compared to writing to 2 osds.
Ceph does not have an affinity, but you probably won't max out the 10GbE bandwidth with ceph. Besides that, your corosync (cluster) and client traffic need to be on separate networks too, otherwise cluster will be slow/unstable.

Is there another more reliable solution that would allow for live migration?
Depends on what reliable means for your setup. But live migration can be done with any shared storage supported by Proxmox VE (eg. NFS/SMB/iSCSI/...).
https://pve.proxmox.com/pve-docs/chapter-pvesm.html#_storage_types
 
The 20gb link is strictly for the cluster network. there is another nic for quoram 1 for backups, and 2 for vm interfaces. I am still having trouble getting a vm to restart after a link failure. i assume its due to the vmdisk not being accessible on the other machine, perhaps this is not due to ceph.. is there a way to verify that there is a replica?

unfortunatley i only have the 2 servers for both storage and compute, so something like ceph, or i suppose zfs would be my only option. Live migration is extremely beneficial for this use case, but it seems having reliability on 2 nodes with live migration is a tall order?
 
The OSDs are not being marked as down for some reason. I think there is a setting for the number of votes for an osd to be marked as down. I am going to head your advice and go with ZFS for production, but am determined to get this to work in testing now. ill have to work around the non live migration. Is there a way to set a default replication for any new VMs added to the cluster?
 
Is there a way to set a default replication for any new VMs added to the cluster?
No, but you could use the API for you workflow.

The OSDs are not being marked as down for some reason. I think there is a setting for the number of votes for an osd to be marked as down.
Usually two OSDs from another node need to report a OSD before it is marked down.
Code:
ceph daemon osd.0 config show | grep -i report
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!