Running CEPH? on cheap NVME

Jannoke · Jul 4, 2023

So i'm slowly getting to the point where my small cluster is getting important enough to have redundancy. I'm already running local storage on zfs and samsung entry level nvme's and performance is great.But looking at moving my mechanical backup to something more "solid". So as i know that ceph is not good at QLC consumer drives. Still I'm thinking how bad it would be on shared storage like CEPH as there is two main points to follow:
* have redundancy on node failure
* have decent sequential write/read speed (NOT random RW)

Also any experience on how to have reasonably fast and efficient "noloss on one cluster node failure " backup solution it would be a good time to chime in.

CEPH seems a way , but redirect if you have a good knwoledge and experience on something else.

ness1602 · Jul 4, 2023

Ceph works on consumer hdd,ssd and nvme, just that the performance is not that great.

Jannoke · Jul 4, 2023

With all respect that was not the question. "works" is wage statement. I'm pretty sure it works on usb flashdrives. I'm asking if someone has one some setup on el' cheapo drives.. 2 drives per node etc. and what the performance on direct IO (no random IO) would sound like.

ness1602 · Jul 5, 2023

Yes,as i said, it works for a smaller companies(have one or two on them). As for usb, it works max 7 days before whole cluster starts dying(tested in practice with one of my customers also).

aaron · Jul 5, 2023

If you consider some Samsung QVOs for VM storage, then don't! They are terrible once the cache is full. I have some of them and use them for backup storage alone. And even for that they might be onthe verge of unusable. The write performance can drop down to 50 MiB/s quite easily :-/

Ceph will use sync writes and that means, that for the disk to ACK the write, datacenter SSDs (with power loss protection / PLP) are what you want if you run VMs on it. They are not that much more expensive as they used to be. At least the cheaper options.

If all you care about is actual backup storage via CephFS, then they might work okayish enough. But don't expect anything great.

Jannoke said:
* have decent sequential write/read speed (NOT random RW)

Ceph is an object store. Anything you write will be distributed among the cluster nodes / OSDs. SSDs will also internally use wear levelling. I wouldn't expect much difference between random and sequential writes.

BenediktS · Jul 5, 2023

We started our clusters with a mix of consumer and datacenter NVMEs .
We replaced all consumer NVMEs after a while.

The datacenter NVMEs had a read/write latency of 0ms to 1ms.
The consumer NMVEs had a read/write latency of 3ms to 7ms.

On top of that the main problem for you would be your Network.
1G will add another 2ms for each transaction.
40G will add only ~ 0.2ms for each transaction.

If you want to store data and backups, i think a 1G could work.
For running VMs i would recommand a 40G Network.

I also recommand to put coolers an all NVMEs.

ness1602 · Jul 5, 2023

2.5g is minimum if there is resync, on 1gb/s it would take eternity

Jannoke · Jul 7, 2023

I'm running 10gbe ethernet
All nvme's are in either dual or quad carriers that go onto pcie x8 or x16 sockets (running pcie bifurcation). They either have separate forced cooling or full aluminium double sided block heatsink on whole assembly.
Temps are also monitored to ensure that there is no ill going with the cooling.
Don't plan to run any big number of virtual machines on the storage. Only single machine with most of the ceph storage attached to this.

Just trying to get shared storage that would auto-heal between nodes without having a separate backup system node with single point of failure and without loosing too much of disk storage because of storage being distributed between nodes.

alexskysilk · Jul 7, 2023

Jannoke said:
Just trying to get shared storage that would auto-heal between nodes without having a separate backup system node with single point of failure and without loosing too much of disk storage because of storage being distributed between nodes.

well, it will accomplish that, but your config is not optimal at all. For starters, your goal of "without loosing too much disk storage" is least served by this configuration as you will have 33% usable space of your raw capacity, and thats before you account for the maximum practical pool utilization of 80%. The other (and arguably) bigger issue is the carriers. low quality ssds are more likely to fail/need replacement, and since they're in shared carriers you'd need to take the whole node offline and manually hunt down the failed module,

Not that its not workable, but this sort of deployment is only good if you are massively overprovisioned so you never actually replace anything.

Search

Search

Running CEPH? on cheap NVME

Jannoke

Well-Known Member

ness1602

Renowned Member

Jannoke

Well-Known Member

ness1602

Renowned Member

aaron

Proxmox Staff Member

BenediktS

Member

ness1602

Renowned Member

Jannoke

Well-Known Member

alexskysilk

Distinguished Member