Hello all
i have a 3 node proxmox cluster with ceph. each node has 2x 4tb samsung 870 QVO ssds.
I have noticed my vms being really slow and i was wondering how much of that is because of the ssds.
I have checked my network and everything else.
Im here to just confirm if what AI assistant is telling can be true and it really enterprise grade ssds can make a huge difference. Anybody else experiencing similar issues?
i have a 3 node proxmox cluster with ceph. each node has 2x 4tb samsung 870 QVO ssds.
I have noticed my vms being really slow and i was wondering how much of that is because of the ssds.
I have checked my network and everything else.
Im here to just confirm if what AI assistant is telling can be true and it really enterprise grade ssds can make a huge difference. Anybody else experiencing similar issues?
- QLC NAND writes are slow and have high write amplification. The drive hides this with a small pseudo‑SLC cache. Let CC be the cache size and RfoldRfold the background rate at which the drive folds SLC data into QLC (typically only tens of MiB/s). If your incoming write rate Rin>RfoldRin>Rfold, the cache drains; once empty, each write must program QLC and perform garbage collection, causing stalls.
- Ceph BlueStore issues frequent flush/FUA for DB/WAL and data commits. With no power‑loss protection on the QVO, the drive must actually persist data before acknowledging, so fsync waits on the slow QLC path. When GC kicks in, the per‑flush time TflushTflush can jump to O(0.5–5),sO(0.5–5),s.
- Your controller is in HBA mode, so there’s no controller cache masking these latencies; barriers go straight to the SSDs. Even at low fill (∼10∼10), sustained RinRin above RfoldRfold produces multi‑second tails.
- Consumer QVOs also lack on‑drive PLP, so they can’t safely “ack” writes early; enterprise TLC SSDs with PLP keep TflushTflush in the O(1–10),msO(1–10),ms range under the same workload.