Hey folks,
I'm starting a greenfield data pipeline project for my startup. I need real-time stream processing so I'm upgrading some Milan hosts I have into a 3 node Proxmox + Ceph cluster.
Per-node snapshot
Why this mix?
Optane-backed NVMe Ceph keeps write latency low; Kafka buffers a 200 Mb/s stream ingest and tiers historical data to MinIO; Flink operates on Optane-fast state; Redis provides instant read access for the application layer; the CRS520 delivers affordable 100 Gb fabric that saturates Ceph recovery and Flink shuffles, while SATA boot mirrors leave PCIe lanes free for high-performance workloads.
I have included an annotated block diagram - there are a stale items/errors, but adobe is a massive pain to reedit, 99% of it is correct.
The main issue there is I'm kinda wasting some of the pcie slots with the optanes, but this setup will get me going for now I reckon.
For the Pcie expansion on Slot 4 I plan to use this: https://www.aliexpress.com/item/1005003768261205.html
I've had these milan servers a couple of years, and they've been nothing but stable. I'm planning to separate out corosync into its own physical network layer.
Looking for any massive gotchas I have missed, or problems with this design, all comments welcomed!
I'm starting a greenfield data pipeline project for my startup. I need real-time stream processing so I'm upgrading some Milan hosts I have into a 3 node Proxmox + Ceph cluster.
Per-node snapshot
- Motherboard – Gigabyte MZ72-HB0
- Compute & RAM – 2 × EPYC 7T83 (128 c/256 t) + 1 TB DDR4-3200: ample pinned cores for Ceph, Kafka, Flink and a small Redis instance with no cross-NUMA chatter.
- Boot – 2 × 480 GB Samsung PM893 SATA (ZFS mirror): PLP-protected, zero PCIe-lane cost.
- Ceph data tier – 6 × 3.84 TB Samsung PM9A3 U.2 NVMe: three per socket for balanced IOPS and rapid rebuilds.
- Ceph block.db/WAL – 2 × Optane P5801X 400 GB: one per three OSDs for µs-class sync writes.
- Flink RocksDB – 1 × Optane P5801X 400 GB: sub-second checkpoints and restores.
- Kafka log – 1 × 3.84 TB PM9A3: hot segments on NVMe; Kafka tiered storage off-loads cold data to a small MinIO S3 cluster.
- Redis cache – runs in-memory on spare cores, serving low-latency look-ups to front-ends.
- Network NIC – Mellanox ConnectX-5 EX 100 GbE via passive QSFP28 twin-ax.
- Top-of-rack switch – MikroTik CRS520-4XS-16XQ-RM: 16 × QSFP28 (100 GbE) + 4 × SFP28; cost-effective for six 100 Gb links today and headroom for expansion; enable jumbo frames and ECN/WRED to offset the modest 6 MB buffers.
Why this mix?
Optane-backed NVMe Ceph keeps write latency low; Kafka buffers a 200 Mb/s stream ingest and tiers historical data to MinIO; Flink operates on Optane-fast state; Redis provides instant read access for the application layer; the CRS520 delivers affordable 100 Gb fabric that saturates Ceph recovery and Flink shuffles, while SATA boot mirrors leave PCIe lanes free for high-performance workloads.
I have included an annotated block diagram - there are a stale items/errors, but adobe is a massive pain to reedit, 99% of it is correct.
The main issue there is I'm kinda wasting some of the pcie slots with the optanes, but this setup will get me going for now I reckon.
For the Pcie expansion on Slot 4 I plan to use this: https://www.aliexpress.com/item/1005003768261205.html
I've had these milan servers a couple of years, and they've been nothing but stable. I'm planning to separate out corosync into its own physical network layer.
Looking for any massive gotchas I have missed, or problems with this design, all comments welcomed!