Running CEPH? on cheap NVME

Jannoke

Well-Known Member
Jul 13, 2016
58
10
48
So i'm slowly getting to the point where my small cluster is getting important enough to have redundancy. I'm already running local storage on zfs and samsung entry level nvme's and performance is great.But looking at moving my mechanical backup to something more "solid". So as i know that ceph is not good at QLC consumer drives. Still I'm thinking how bad it would be on shared storage like CEPH as there is two main points to follow:
* have redundancy on node failure
* have decent sequential write/read speed (NOT random RW)

Also any experience on how to have reasonably fast and efficient "noloss on one cluster node failure " backup solution it would be a good time to chime in.

CEPH seems a way , but redirect if you have a good knwoledge and experience on something else.
 
With all respect that was not the question. "works" is wage statement. I'm pretty sure it works on usb flashdrives. I'm asking if someone has one some setup on el' cheapo drives.. 2 drives per node etc. and what the performance on direct IO (no random IO) would sound like.
 
Yes,as i said, it works for a smaller companies(have one or two on them). As for usb, it works max 7 days before whole cluster starts dying(tested in practice with one of my customers also).
 
If you consider some Samsung QVOs for VM storage, then don't! They are terrible once the cache is full. I have some of them and use them for backup storage alone. And even for that they might be onthe verge of unusable. The write performance can drop down to 50 MiB/s quite easily :-/

Ceph will use sync writes and that means, that for the disk to ACK the write, datacenter SSDs (with power loss protection / PLP) are what you want if you run VMs on it. They are not that much more expensive as they used to be. At least the cheaper options.

If all you care about is actual backup storage via CephFS, then they might work okayish enough. But don't expect anything great.
* have decent sequential write/read speed (NOT random RW)
Ceph is an object store. Anything you write will be distributed among the cluster nodes / OSDs. SSDs will also internally use wear levelling. I wouldn't expect much difference between random and sequential writes.
 
  • Like
Reactions: Dunuin
We started our clusters with a mix of consumer and datacenter NVMEs .
We replaced all consumer NVMEs after a while.

The datacenter NVMEs had a read/write latency of 0ms to 1ms.
The consumer NMVEs had a read/write latency of 3ms to 7ms.

On top of that the main problem for you would be your Network.
1G will add another 2ms for each transaction.
40G will add only ~ 0.2ms for each transaction.

If you want to store data and backups, i think a 1G could work.
For running VMs i would recommand a 40G Network.

I also recommand to put coolers an all NVMEs.
 
  • I'm running 10gbe ethernet
  • All nvme's are in either dual or quad carriers that go onto pcie x8 or x16 sockets (running pcie bifurcation). They either have separate forced cooling or full aluminium double sided block heatsink on whole assembly.
  • Temps are also monitored to ensure that there is no ill going with the cooling.
  • Don't plan to run any big number of virtual machines on the storage. Only single machine with most of the ceph storage attached to this.

    Just trying to get shared storage that would auto-heal between nodes without having a separate backup system node with single point of failure and without loosing too much of disk storage because of storage being distributed between nodes.
 
Last edited:
Just trying to get shared storage that would auto-heal between nodes without having a separate backup system node with single point of failure and without loosing too much of disk storage because of storage being distributed between nodes.
well, it will accomplish that, but your config is not optimal at all. For starters, your goal of "without loosing too much disk storage" is least served by this configuration as you will have 33% usable space of your raw capacity, and thats before you account for the maximum practical pool utilization of 80%. The other (and arguably) bigger issue is the carriers. low quality ssds are more likely to fail/need replacement, and since they're in shared carriers you'd need to take the whole node offline and manually hunt down the failed module,

Not that its not workable, but this sort of deployment is only good if you are massively overprovisioned so you never actually replace anything.
 
  • Like
Reactions: Dunuin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!