proxmox ceph performance with consumer grade samsung ssd

vasilis

Member
Jan 31, 2022
4
0
6
41
Hello all
i have a 3 node proxmox cluster with ceph. each node has 2x 4tb samsung 870 QVO ssds.
I have noticed my vms being really slow and i was wondering how much of that is because of the ssds.
I have checked my network and everything else.
Im here to just confirm if what AI assistant is telling can be true and it really enterprise grade ssds can make a huge difference. Anybody else experiencing similar issues?



  • QLC NAND writes are slow and have high write amplification. The drive hides this with a small pseudo‑SLC cache. Let CC be the cache size and RfoldRfold the background rate at which the drive folds SLC data into QLC (typically only tens of MiB/s). If your incoming write rate Rin>RfoldRin>Rfold, the cache drains; once empty, each write must program QLC and perform garbage collection, causing stalls.
  • Ceph BlueStore issues frequent flush/FUA for DB/WAL and data commits. With no power‑loss protection on the QVO, the drive must actually persist data before acknowledging, so fsync waits on the slow QLC path. When GC kicks in, the per‑flush time TflushTflush can jump to O(0.5–5),sO(0.5–5),s.
  • Your controller is in HBA mode, so there’s no controller cache masking these latencies; barriers go straight to the SSDs. Even at low fill (∼10∼10), sustained RinRin above RfoldRfold produces multi‑second tails.
  • Consumer QVOs also lack on‑drive PLP, so they can’t safely “ack” writes early; enterprise TLC SSDs with PLP keep TflushTflush in the O(1–10),msO(1–10),ms range under the same workload.
In short: the workload’s synchronous write/flush rate exceeds the QVO’s steady‑state consolidation capability, so the drives enter their slow QLC+GC path and become the limiting factor.
 
For once, AI is right :)
Any consumer drive will have low Ceph performance due to rocksDB and sync writes, but those drives in particular are terrible for anything but PC archiving purposes due to it's small SLC cache and very slow QLC nand chips. It's hard to get more than ~40MBytes/s from each disk once the cache is full.
Try to get your hands on some enterprise drives, specially due to their PLP which allows them to ack sync writes way faster and coalesce writes afterwards. Even second hand ones will serve you perfectly fine and last longer than any consumer drive. If still unsure, get 3 small enterprise disks, create a Ceph pool just with them and try yourself.
 
  • Like
Reactions: UdoB
Thank you
For once, AI is right :)
Any consumer drive will have low Ceph performance due to rocksDB and sync writes, but those drives in particular are terrible for anything but PC archiving purposes due to it's small SLC cache and very slow QLC nand chips. It's hard to get more than ~40MBytes/s from each disk once the cache is full.
Try to get your hands on some enterprise drives, specially due to their PLP which allows them to ack sync writes way faster and coalesce writes afterwards. Even second hand ones will serve you perfectly fine and last longer than any consumer drive. If still unsure, get 3 small enterprise disks, create a Ceph pool just with them and try yourself.
thanks you i appreciate the reply. If i remember correctly it wasnt too bad until i started adding multiple vms.
WOuld any consumer grade drive work reasonably well ? the cost differential is a bit steep when going to enterprise grade drives
 
Yes, they cost more and will get really expensive in the coming months, but second hand SATA/SAS are easy to find and not that costly. In the long run they end up being cheaper as they don't degrade as fast as consumer ones, so won't need to replace them so often. That depends on your workload too, of course, but server workioads demand server hardware.
 
  • Like
Reactions: leesteken
3 node proxmox cluster with ceph. each node has 2x 4tb
That's worst design possible. Beside the too cheap devices...

Look at one node: when one OSD fails, the other one on the same node has to take over the data from the dead one. It can not be sent to another node because there are already copies on all other nodes as there is no fourth node. With this in mind you can only store ~3.5 TB on Ceph (assuming the usual size=3/min=2 rule) if you do not want to stay degraded forever and drop the "self-healing" feature completely...

See also: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/