Need Advice. Replacing consumer NVMe used for Ceph DB/WAL in a 3-Node Cluster

royalj7

Member
Aug 11, 2020
8
0
21
46
Hi all,

I’m running a 3-node Proxmox homelab cluster with Ceph for VM storage. Each node has two 800GB Intel enterprise SSDs for OSD data, and a single 512GB consumer NVMe drive used for the DB/WAL for both OSDs on that node. I'm benchmarking the cluster and seeing low IOPS and high latency, especially under 4K random workloads. I suspect the consumer NVMe is the bottleneck and would like to replace it with an enterprise NVMe (likely something with higher sustained write and DWPD).

Before I go ahead, I want to:
  1. Get community input on whether this could significantly improve performance.
  2. Confirm the best way to replace the DB/WAL NVMe without breaking the cluster.
My plan:
  • One node at a time: stop OSDs using the DB/WAL device, zap them, shut down, replace NVMe, recreate OSDs with the new DB/WAL target.
  • Monitor rebalance between each step.
Has anyone here done something similar or have better suggestions to avoid downtime or data issues? Any gotchas I should be aware of?

Thanks in advance!
 
Last edited:
I run a similar setup as you describe and, while I've not (yet) performed this process, the workflow you describe is what I'm already planned to follow once the time comes to replace my WAL devices. Just curious though, is your cluster a new setup or has its performance just degraded over time? If you are interested, feel free to mention your current WAL device and the performance figures you are seeing and we can compare notes.
 
I would be interested to know benchmarks for the SSDs without the extra drive, since they are SSD already and that eliminates a single failure point for the OSDs. (Does the NVMe for DB help much, there).And then maybe add a third disk or use the NVMe as an OSD.

In general though yes your steps seem fine. On a bigger cluster I’d say wait until everything is green at each step but that’s not possible with only two nodes left (using 3/2).
 
@SteveITS Not all SSDs are created equal! In a lot of cases, when someone is running a separate NVMe device for their Ceph WAL, it's because their main block device is using a slower interface (e.g. SATA). In such cases, the NVMe device improves the OSD's overall IO performance beyond what the main block device could provide on it's own.