To Ceph or not to Ceph

SuperMiguel

Active Member
Mar 30, 2010
30
0
26
Hello all, I currently have 4 servers, running nvme disk on a pci-e card. I setup a cluster and a ceph cluster but im getting really poor performance. When i run the ceph bench I get about 140 MB/s using a 1G line. Im going to upgrade to 10G next week (waiting on some NICs) but shouldnt i be getting more than 140MB/s with 1G ethernet? The nvme disk are rated for like 4000MB/s (i know distributed will never be close to bare metal)...

That being said this is for homelab i run things like homeassistant for my home automation that i want to keep up as much as possible, so thats why i picked ceph in the 1st place to give me redundancy, im also doing backups to an unraid server. Should i keep using ceph? not worth it>? should i switch to something else? I keep reading about needed a ton of servers or disk for ceph to make sense... I only got 4 servers and they all have 1 disk in ceph.
 
How many OSDs do you have per node?
Did you set it up with the default size of 3?

1G is a huge bottleneck if you're only using NVMes. I'd recommend waiting for your 10G NICs to arrive and then see how the performance changes.
 
but shouldnt i be getting more than 140MB/s with 1G ethernet?
How so? 1 G means 1 GBit/s (bit not byte), so 1000 MBit/s / 8 Bit/Byte -> 125 MByte/s.

As ceph works with more nodes and can do a bit locally to it's slightly faster than that, but not much as it can only continue if at least two copies are written of an object in the safe 3/2 size/min-size setting.

The nvme disk are rated for like 4000MB/s (i know distributed will never be close to bare metal)...
Yeah, it's probably just the network that's limiting you that hard.
10G ethernet for the ceph cluster (private) network is pretty much the lower limit I'd go for in a ceph setup nowadays, with small clusters sizes (<= 5 nodes) one can also do full-mesh setup, so avoiding the need to get also an expensive 10G or 40G+ switch
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server


Should i keep using ceph? not worth it>? should i switch to something else?
Depends, do you need shared storage? Some redundancy could else also be achieved with doing a RAID1 with ZFS, for example, you won't need to think about network there and you can expose that as CIFS/NFS share rather easily too, FWIW.

Ceph on the other hand has the benefit of being a shared storage, extending redundancy not only over a single disk but also over a single server, it is highly scalable (so if you plan to grow a lot of data it may be better suited than other techs) and can be recovered from basically anything - as long as one goes slowly to avoid digging their own grave deeper.

So, as said, depends on your needs.
 
Last edited:
How many OSDs do you have per node?
Did you set it up with the default size of 3?

1G is a huge bottleneck if you're only using NVMes. I'd recommend waiting for your 10G NICs to arrive and then see how the performance changes.

1 OSD per node, and yes default 3. Sounds good ill do that
 
How so? 1 G means 1 GBit/s (bit not byte), so 1000 MBit/s / 8 Bit/Byte -> 125 MByte/s.

As ceph works with more nodes and can do a bit locally to it's slightly faster than that, but not much as it can only continue if at least two copies are written of an object in the safe 3/2 size/min-size setting.


Yeah, it's probably just the network that's limiting you that hard.
10G ethernet for the ceph cluster (private) network is pretty much the lower limit I'd go for in a ceph setup nowadays, with small clusters sizes (<= 5 nodes) one can also do full-mesh setup, so avoiding the need to get also an expensive 10G or 40G+ switch
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server



Depends, do you need shared storage? Some redundancy could else also be achieved with doing a RAID1 with ZFS, for example, you won't need to think about network there and you can expose that as CIFS/NFS share rather easily too, FWIW.

Ceph on the other hand has the benefit of being a shared storage, extending redundancy not only over a single disk but also over a single server, it is highly scalable (so if you plan to grow a lot of data it may be better suited than other techs) and can be recovered from basically anything - as long as one goes slowly to avoid digging their own grave deeper.

So, as said, depends on your needs.

Yeah i was hopping the local part would be faster. So my theoretical highest speed will be 1250 MByte/s at 10G. Im going to take a look at the full mesh setup, never heard of it...

And as far as my needs go, it is my home lab main cluster, so i want it to be up as much as possible. I dont have hot swap ready servers, so it needs to be able to work with 1 or 2 of those servers down.
 
Yeah i was hopping the local part would be faster.

for write, with replication, no, it can't be faster as it's syncronous write.

for read, maybe if you have luck, some block could be read locally. (but it'll be really random, as all blocks are distributed on differents nodes)
 
so as far as network goes, the best way to set up ceph is to have 2 dedicated 10g connections? one for Ceph Public Network and one for Ceph Sync Network?

so at the end it would look something like:

1 1G for Proxmox Network
1 10G for Ceph Public Network
1 10G for Ceph Sync Network
 
The "Sync" network, (cluster network) is where the replicas get send over, so it may see about double (and a bit) of the network use than the public one, where clients (e.g., VMs or CTs) read and write data through.

If you can afford both 10G networks I' definitily go for it, the public one can also be configured for VM migration traffic, as it probably won't be fully saturated just for ceph public traffic, that way you can live-migrate running machines really fast, which is nice too.

Just for the record, having a uneven node number is always nicer in terms of trade-offs, as with:
  • three nodes: total one can fail with the others still being quorate
  • four nodes: still only one can fail, as when two fail there's no real majority for the remaining one to decide if it's a network split or a real failure, but remaining load can spread over three not two nodes, that can still be nice.
  • five nodes: now two can fail, and if one fails the extra load can be spread over more remaining nodes than in the previous cases.
Just saying to set expectations right, and for the case you happen to have a extra server around ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!