Small Proxmox/Ceph cluster layout

clicks · Mar 4, 2015

Hi,

I'm currently evaluating a new (small) Virtualisation Cluster for about 20 Windows/Linux-Server VMs (coming from VMWare).
One combination which caught my interest was the idea of Proxmox and Ceph as it allows me to renew the storage as well.
As I'm new to Ceph and Proxmox I lack some experience how to lay out such a cluster. A first recommendation for an 3-node HA-Cluster is in the proxmox wiki (https://pve.proxmox.com/wiki/Ceph_Server) using local storage of the nodes and distributing it via Ceph. Has anyone build/run a similar setup? How is the performance? Can it handle about 30 mid-size Server VMs (Mail, Database, Webserver, Fileserver)?
As the storage in the example is connected directly via SAS/SATA it should be quite fast (low latency). But as it is distributed via Ceph does it mean a local write has to be replicated among all hosts before it is acknowledged? So a write access would result in a local write SAS access and two write accesses via 10Gb Ethernet.

Would be happy to here about your experience with such a setup.
clicks

udo · Mar 4, 2015

clicks said:
Hi,

I'm currently evaluating a new (small) Virtualisation Cluster for about 20 Windows/Linux-Server VMs (coming from VMWare).
One combination which caught my interest was the idea of Proxmox and Ceph as it allows me to renew the storage as well.
As I'm new to Ceph and Proxmox I lack some experience how to lay out such a cluster. A first recommendation for an 3-node HA-Cluster is in the proxmox wiki (https://pve.proxmox.com/wiki/Ceph_Server) using local storage of the nodes and distributing it via Ceph. Has anyone build/run a similar setup? How is the performance? Can it handle about 30 mid-size Server VMs (Mail, Database, Webserver, Fileserver)?
As the storage in the example is connected directly via SAS/SATA it should be quite fast (low latency). But as it is distributed via Ceph does it mean a local write has to be replicated among all hosts before it is acknowledged? So a write access would result in a local write SAS access and two write accesses via 10Gb Ethernet.

Would be happy to here about your experience with such a setup.
clicks

Hi,
the VM-data on ceph is spliced in many 4MB-Chunks. The chunks are in different PGs (placement groups), which are on different (or the same) OSDs.
ceph wrote to the primary first and after that to the remaining replicas. If all replica has writen the data to the journal the write is acknowledged.

This means, you have only 33% local writes and need to wait for all other nodes (this is the reason, why you should use fast and reliable SSDs for journaling (like Intel DC S3700)).
About latency: you should test your config before.

Udo

pixel · Mar 4, 2015

Thats what were doing, but using an existing ceph cluster, so dont know about proxmoxes built in ceph management.

fast and reliable ssds are also really expensive, so we went with journals on a small partition on each osd disk. probably not as fast, but has been fast enough for us. each ceph node has 6x4TB and is connected to the proxmox nodes on 2 10g switches (lacp, so while both switches are up, the speed is 20g) were using sas for the osds, but for this purpose sata would have been fine. setting up the ceph cluster was easing using inktanks ansible playbooks.

setting up proxmox is easy, except you have to manually configure fencing and HA for the prox nodes (ceph handles this fine on its own) we just got our fencing hardware, so im about to find out how that goes.

before setting this up, i faked it on a linux machine using libvirt. you can try things like live migration (kvm allows for nested virtualization), detaching an osds disk, killing a ceph node etc. if your going to try this, ive found nested virtualization, at least kvm, more reliable on amd hosts. there is a libvirt fencing method, but proxmox doesnt have libvirt. if you really want, you can probably install it just for the fencing agent to try that.

our cluster itself runs on dell r720s. the idrac console is a pain, but you only have to use that for the initial setup and when things go really bad. ask them to set all the osds as single disk raid0. if you have to replace one, theres an annoying dell web gui that runs in linux to re-initialize the raid. but, once thats done, if your using the playbook, it can take over from there.

Search

Search

Small Proxmox/Ceph cluster layout

clicks

Member

udo

Distinguished Member

pixel

Renowned Member