Can you recommend Ceph for a 2-node proxmox setup?

RMM

Active Member
Oct 25, 2013
30
1
28
We are planning to setup a 2 node cluster (maybe later if when we need more resources a 3th node would be added), each node would have have a 4-core xeon processor with 32GB of ram (maybe 64). Each node would have 4 HDD, out of which we would make a software raid 10 array. The nodes would be connected over an internal 1Gbps Network, and an other 1Gbps link to the world. I'm aware of that the maximal writing throughput would be 125Mbps.
If one node fails, the VMs should be restarted manually on the 2nd node.
We will rent two dedicated server at hetzner.de, and it should stay payable so we won't invest into a 10Gbps network or SSDs.
I didn't find a lot of actual information about 2-3 node setup with only 1 OSD per node so now the questions are:
- Is it recommendable to use Ceph with only 2 nodes?
- Would we be able to get a writing throughput of around 100Mbps, or would it be a lot lower?
- We won't be using an SSD for the Journal. Should we still be able to get around 100Mbps of writing performance?
- The reading throughput should be for a 2 node setup almost as fast as without using a ceph , right?
- Should we forget about Ceph as long as we only use 2 nodes and directly use DRBD, and look into ceph again when we need a 3th node?

thanks a lot for your input :)
 
- Is it recommendable to use Ceph with only 2 nodes?
No. 3 MONs for quorum, 10GbE to have performance, especially on recovery.
http://docs.ceph.com/docs/master/start/hardware-recommendations/

If one node fails, the VMs should be restarted manually on the 2nd node.
This calls more for a active-backup solution, check out pve-zsync.

- Would we be able to get a writing throughput of around 100Mbps, or would it be a lot lower?
- We won't be using an SSD for the Journal. Should we still be able to get around 100Mbps of writing performance?
- The reading throughput should be for a 2 node setup almost as fast as without using a ceph , right?
Take a look at the above hardware recommendations and how ceph works. If you don't then you will feel sorry afterwards, as it has a good amount of complexity to it.

- Should we forget about Ceph as long as we only use 2 nodes and directly use DRBD, and look into ceph again when we need a 3th node?
Mainly, forget about a shared storage with two nodes, you can never ensure the simple majority in case of disaster.
For more on https://en.wikipedia.org/wiki/Byzantine_fault_tolerance
A Byzantine fault is any fault presenting different symptoms to different observers. A Byzantine failure is the loss of a system service due to a Byzantine fault in systems that require consensus.
Eg. the link between both server fail, who is now the up-to-date one?
 
I'm aware about quorums, split brain and similar scenarios.
I've installed quite a few 2-node (sometimes dual primary) drbd systems for small companies which usually do care about money and barely can afford 2 servers. Impossible convincing them the get 3. But the 2 node setups work pretty well, since more than 10 years...
My question was more related to performance, especially writing performance.

No. 3 MONs for quorum, 10GbE to have performance, especially on recovery.
http://docs.ceph.com/docs/master/start/hardware-recommendations/
They also recommend not to run VMs on the same hosts the ceph storage is on. But proxmox state in the their wiki that it should work.

No. 3 MONs for quorum, 10GbE to have performance, especially on recovery.
This calls more for a active-backup solution, check out pve-zsync.

My fault, I didn't express it well. There will be VMs running on both hosts all the time, just when one fails, one should be able to restart them on the other node.

I'm really mainly concerned about the writing performance. It's almost impossible finding any actual benchmarkes or experience reports on small node setups (maybe there is a reason therefore ;-)).

I'm already almost convincened I'll go for the usual DRBD setup, I just wanted to see if I can get some experience reports.
 
They also recommend not to run VMs on the same hosts the ceph storage is on. But proxmox state in the their wiki that it should work.
This gives them (ceph), especially with bigger setups, a lot of trouble, as Ceph is very latency demanding (besides the usual CPU/RAM). On a hyper converged setup, you need to have enough resources available to accommodate Ceph + VMs/CTs. In my experience once you grow your cluster to more then ~5-6 PVE hosts, you will soon start to host Monitors on separate servers (or less/no VM/CT action on PVE hosts) for better latency.

My fault, I didn't express it well. There will be VMs running on both hosts all the time, just when one fails, one should be able to restart them on the other node.
I still think, that the storage replication might be a more feasible way (especially in a hosted environment).
https://pve.proxmox.com/pve-docs/chapter-pvesr.html

I'm really mainly concerned about the writing performance. It's almost impossible finding any actual benchmarkes or experience reports on small node setups (maybe there is a reason therefore ;-)).
In my experience to have a "good" performance (yes, vague), you start with 12 disks distributed on three servers. The cluster network needs a 10GbE bandwidth to provide a decent enough throughput, when recovery (disk/host failure, OSD maintenance) happens, as usually the cluster and public network reside on the same link on small setups. Ceph performs better on scale, as more nodes it has as faster it gets. This dependence on hardware, environment setup and use case makes it hard to give some numbers. It usually leads to some extend with trail and error.

I'm already almost convincened I'll go for the usual DRBD setup, I just wanted to see if I can get some experience reports.
There are quite some people on the forum that have hosted setups, or ceph/drbd, I hope they can give you more insight.
 
I'm aware about quorums, split brain and similar scenarios.
I've installed quite a few 2-node (sometimes dual primary) drbd systems for small companies which usually do care about money and barely can afford 2 servers. Impossible convincing them the get 3. But the 2 node setups work pretty well, since more than 10 years...
My question was more related to performance, especially writing performance.


They also recommend not to run VMs on the same hosts the ceph storage is on. But proxmox state in the their wiki that it should work.



My fault, I didn't express it well. There will be VMs running on both hosts all the time, just when one fails, one should be able to restart them on the other node.

I'm really mainly concerned about the writing performance. It's almost impossible finding any actual benchmarkes or experience reports on small node setups (maybe there is a reason therefore ;-)).

I'm already almost convincened I'll go for the usual DRBD setup, I just wanted to see if I can get some experience reports.
If it's if any help, I have a Microserver Gen10 X3216 8GB running single node Proxmox 5.1 and Ceph Luminous and did a benchmark yesterday. Gave me a write speed of 100+MBps and read of 250+MBps. Multiple nodes should give better performance since you don't have all your writes on a single node. IOPS was about 100 I think.

These Microservers are about as low performance as a COTS server gets. Not a very useful setup for businesses, but ok for home use.
 
We just had a discussion, and we noticed that for the price of 2 nodes we could get 3 cheaper ones. So that's what we will do :).

This gives them (ceph), especially with bigger setups, a lot of trouble, as Ceph is very latency. I still think, that the storage replication might be a more feasible way (especially in a hosted environment).
https://pve.proxmox.com/pve-docs/chapter-pvesr.html
Thinking about it. It sounds more and more like a pretty acceptable solution. So in sense of replication and 3 nodes (on each nodes VMs are running) that would mean:
A replicates to B
B replicates to C
C replicates to A
So one node can fail, and we'd loose up to a littlebit more than one minute of data (assuming a schedule of copying every minute), could be an e-mail, an order or so but still all VMs would be running... But since we don't expect servers to fail on a regular basis, i think that should be acceptable. And Read/Write speed for the VMs wouldn't be limited by the network speed.
But if we want to get a 4th node it would be pretty complicated to integrate it into the whole setup? (if we start that some VMs replicate from A to B and some to C and some to D and vice versa, it sounds like we'd loose overview immediately ;-)).

From what it sounds like, even if we use 3 Nodes you still wouldn't recommend to use ceph (we are stuck with the dedicated 1Gbps link), right?

@MewBie, thanks for the benchmark, but I would like to see it with the network link in between (so 2 or 3 nodes).
 
With four nodes, you can split the replication logic in two and sync between A <-> B and between C <-> D. ;) You can also replicate to multiple hosts.

From what it sounds like, even if we use 3 Nodes you still wouldn't recommend to use ceph (we are stuck with the dedicated 1Gbps link), right?
Yes, this link would hold ceph client traffic and recovery. Your VMs on the nodes will drastically slow down and get unresponsive. Also, the latency is to question in a hosted environment, as the servers might not be on the same switch. This can also lead to troubles with corosync, this should be kept in mind when buying/renting the servers.
 
With four nodes, you can split the replication logic in two and sync between A <-> B and between C <-> D. ;) You can also replicate to multiple hosts.
The most obvious solution, and I didn't see it ;-).

Yes, this link would hold ceph client traffic and recovery. Your VMs on the nodes will drastically slow down and get unresponsive. Also, the latency is to question in a hosted environment, as the servers might not be on the same switch. This can also lead to troubles with corosync, this should be kept in mind when buying/renting the servers.

We would have a dedicated switch only for our servers. But I guess it would still be to slow.

Thanks a lot for your help :).
 
Hi,
I do not know if ceph can solve your problem, but for sure 2 Proxmox nodes with zfs can do it.

- you can create a 2 nodes cluster setup with zfs
- you can do async replication for any VM from A to B and from B to A, let say on each 5 minutes
- the downside it could be that your write speed could not be at the value that you expect, but adding a small SSD could improve the performance

As a side note, I can say that I run such kind of setup and is working for me. But in my case the I do not need a very high write performance. Also do not ask if tool X could solve your problem ... ask how Proxmox/any tool can solve your problem.
 
Hi,
I do not know if ceph can solve your problem, but for sure 2 Proxmox nodes with zfs can do it.

- the downside it could be that your write speed could not be at the value that you expect, but adding a small SSD could improve the performance
I guess the writing speed slowdown comes from ZFS. How bad is the impact?
 
I guess the writing speed slowdown comes from ZFS. How bad is the impact?

It is difficult to say. zfs write speed depend. .. ram, disk access patterns, os/zfs settings, guest settings and so on.
The best way to find is to try and see how it is in your case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!