ceph architecture - hyper convergence

Raboo

New Member
Jan 10, 2016
2
0
1
42
Hi,

I'm wondering how this hyper-converged story with ceph is deployed in proxmox?
Cause I think ceph documentation states that you shouldn't run ceph in same node as the hypervisor nodes because of the resource competition. Ceph can easily eat up all available memory during recovery..
A workaround would be to have ceph running inside a local container or virtual instance, that limits the memory and perhaps even the CPU.
Same with networking. You perhaps need to QoS if you are runnig ceph on the same NIC as proxmox virutalization.
Well at least, thats how most "hyper-converged" vendors are doing it.

How does it work in proxmox?
 
Hi,
if you use ceph, it is a must to have a dedicated network for it, like all shared or distributed storage.
And to the CPU and Ram point you have to have a node what has enough resources for this task.
This is not a solution for over committed nodes what has no resources free.
 
Network QOS you have a couple options:
  • QOS via dedicated Links
  • QOS via the switch
  • QOS via SDN-Controller and openvswitch
  • Limit Ceph Backfill/recovery via tuning
  • ceph related: spread scrub times over 24h (or not have scrub running during high load times), limit the backfill ratios for recoveries.

For ceph you have a bunch of tuning options if you choose to ignore the ceph guidance, on top of what you need for VM's, but that is really a ceph tuning thing. :
e.g.
  • osd client message size cap
  • min_read_recency_for_promote
  • min_write_recency_for_promote
  • use a script to release memory regularly "ceph tell (osd.x / Mon.x / Mds.x) heap release
  • limiting your mon's as described here: http://www.spinics.net/lists/ceph-devel/msg23048.html


Sure, you could use the way that some solutions are going and run some dedup-capable FS on the ground floor, with ceph in a VM, but does that really solve something beyond maybe the "artificial osd cap" ? But that is actually backwards. You want ceph on the groundfloor and a dedup algorithm running infront of it, or rather passively after your writes have been done ontop of ceph.

I think it makes more sense to run Ceph on the ground floor. Because the one thing you want when you use ceph is the ability to use proper continuity via multiple Failure-Domains and the ability to separate your storage into Tiers , with SSD/NVME for Hot storage and Erasure Coded HDD for Cold storage. Then wait for the "even colder storage"- and Dedup- Plugins that are being worked on.

Sure, those 200 VM's of Windows, that have 20GB of the same data, sure make sense to avoid, But is it really worth spending money on Ram (20GB per TB of pool data on ZFS - http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe), if you could just buy an additional 8TB drive for 240 USD and stick the "useless" space of 200 more windows VM's on it ?


Else you could just use a solution like GLusterFS, BeeGFS, or any of the gazillion other solutions going down that path

 
Last edited:
Well, the biggest issue is memory. I was thinking that a containerized ceph would protect you from the risk of over-committing, when ceph is doing rebuilds and such actions.

Lets say you have a VM host.
- It has 100 units of RAM.
- ceph requires 15 units of RAM during rebuild but can easily use all available memory.
- ceph uses 5 units of RAM during regular operations.
- you see that you have 95 units of RAM available.
- You deploy a bunch of VM nodes(KVM + LXC) that takes 80-90 units of RAM.
- you have 5-15 units that are not dedicated for VM nodes, but in the hypervisor you see all non-used RAM which could be 45 units..
- Another ceph node breaks, ceph rebuilds, ceph starts using all non-used RAM, lets say 45 units, just because it's there and available in the hypervisor..
- A VM node(LXC) has a application that uses 1 unit of RAM, but with a quota of 20 units, it believes it has access to 20 units, and start using those.
- OOM or some other trouble..

It's not a matter of enough resources, it's a matter of dedicating/protecting the resources you have and need for certain workloads so you can't over-commit. But if you put ceph in a LXC container instead of the hypervisor. It then is protected from that type of over-commit, cause you limit the container to 15 units RAM in the above example.

I'm a novice when it comes to network. But I mentioned it because a typical server hardware might have a NIC with 2x10Gb and 2x1Gb ports.
I know, I would prefer to use both 10Gb ports with a LACP bond to achieve 20Gb and then use a virtual interface for storage and a couple of virtual NIC's for compute for different VLANs, instead of having 20Gb to storage and 2Gb to compute.
And running one port 10Gb to storage and one port 10Gb to compute would leave you with a non-redundant solution.

The ceph group also maintains a docker containers for ceph https://github.com/ceph/ceph-docker.

If proxmox would containerize ceph and protect ceph and the hypervisor from resource over-commit, then proxmox would be a decent production ready hyper-converged solution...
 
No one is stopping you from doing that right now.
Just set up a LXC/VM for ceph per node, then install a ceph cluster on it.

Your options are:
Pass disks through to VM (check the proxmox wiki on why that is a bad idea - example is for Virtual NAS)
Create a vDisk for each physical drive.


personally speaking:
I just know at our scale the overhead of this solution is not worth the benefit that such a solution entails.




also a big FYI:
when you run ceph/Proxmox you DO NOT NEED redundancy on a network-level. The Reason you use a multinode setup of Proxmox and of Ceph is so that THEY safeguard against any failure on the failure domain Host (and below). So no need for redundant network links, just set em up for maximum throughput for multiple destination sources (TCP ports)