Small Proxmox/Ceph cluster layout

Mar 4, 2015
11
2
23
Hi,

I'm currently evaluating a new (small) Virtualisation Cluster for about 20 Windows/Linux-Server VMs (coming from VMWare).
One combination which caught my interest was the idea of Proxmox and Ceph as it allows me to renew the storage as well.
As I'm new to Ceph and Proxmox I lack some experience how to lay out such a cluster. A first recommendation for an 3-node HA-Cluster is in the proxmox wiki (https://pve.proxmox.com/wiki/Ceph_Server) using local storage of the nodes and distributing it via Ceph. Has anyone build/run a similar setup? How is the performance? Can it handle about 30 mid-size Server VMs (Mail, Database, Webserver, Fileserver)?
As the storage in the example is connected directly via SAS/SATA it should be quite fast (low latency). But as it is distributed via Ceph does it mean a local write has to be replicated among all hosts before it is acknowledged? So a write access would result in a local write SAS access and two write accesses via 10Gb Ethernet.

Would be happy to here about your experience with such a setup.
clicks
 
Hi,

I'm currently evaluating a new (small) Virtualisation Cluster for about 20 Windows/Linux-Server VMs (coming from VMWare).
One combination which caught my interest was the idea of Proxmox and Ceph as it allows me to renew the storage as well.
As I'm new to Ceph and Proxmox I lack some experience how to lay out such a cluster. A first recommendation for an 3-node HA-Cluster is in the proxmox wiki (https://pve.proxmox.com/wiki/Ceph_Server) using local storage of the nodes and distributing it via Ceph. Has anyone build/run a similar setup? How is the performance? Can it handle about 30 mid-size Server VMs (Mail, Database, Webserver, Fileserver)?
As the storage in the example is connected directly via SAS/SATA it should be quite fast (low latency). But as it is distributed via Ceph does it mean a local write has to be replicated among all hosts before it is acknowledged? So a write access would result in a local write SAS access and two write accesses via 10Gb Ethernet.

Would be happy to here about your experience with such a setup.
clicks
Hi,
the VM-data on ceph is spliced in many 4MB-Chunks. The chunks are in different PGs (placement groups), which are on different (or the same) OSDs.
ceph wrote to the primary first and after that to the remaining replicas. If all replica has writen the data to the journal the write is acknowledged.

This means, you have only 33% local writes and need to wait for all other nodes (this is the reason, why you should use fast and reliable SSDs for journaling (like Intel DC S3700)).
About latency: you should test your config before.

Udo
 
Thats what were doing, but using an existing ceph cluster, so dont know about proxmoxes built in ceph management.

fast and reliable ssds are also really expensive, so we went with journals on a small partition on each osd disk. probably not as fast, but has been fast enough for us. each ceph node has 6x4TB and is connected to the proxmox nodes on 2 10g switches (lacp, so while both switches are up, the speed is 20g) were using sas for the osds, but for this purpose sata would have been fine. setting up the ceph cluster was easing using inktanks ansible playbooks.

setting up proxmox is easy, except you have to manually configure fencing and HA for the prox nodes (ceph handles this fine on its own) we just got our fencing hardware, so im about to find out how that goes.

before setting this up, i faked it on a linux machine using libvirt. you can try things like live migration (kvm allows for nested virtualization), detaching an osds disk, killing a ceph node etc. if your going to try this, ive found nested virtualization, at least kvm, more reliable on amd hosts. there is a libvirt fencing method, but proxmox doesnt have libvirt. if you really want, you can probably install it just for the fencing agent to try that.

our cluster itself runs on dell r720s. the idrac console is a pain, but you only have to use that for the initial setup and when things go really bad. ask them to set all the osds as single disk raid0. if you have to replace one, theres an annoying dell web gui that runs in linux to re-initialize the raid. but, once thats done, if your using the playbook, it can take over from there.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!