3−Node Proxmox EPYC 7T83 (7763) • 10 × NVMe per host • 100 GbE Ceph + Flink + Kafka — sanity-check me!

mangos · 2025-05-31T08:36:26+0200

Hey folks,

I'm starting a greenfield data pipeline project for my startup. I need real-time stream processing so I'm upgrading some Milan hosts I have into a 3 node Proxmox + Ceph cluster.

Per-node snapshot

Motherboard – Gigabyte MZ72-HB0
Compute & RAM – 2 × EPYC 7T83 (128 c/256 t) + 1 TB DDR4-3200: ample pinned cores for Ceph, Kafka, Flink and a small Redis instance with no cross-NUMA chatter.
Boot – 2 × 480 GB Samsung PM893 SATA (ZFS mirror): PLP-protected, zero PCIe-lane cost.
Ceph data tier – 6 × 3.84 TB Samsung PM9A3 U.2 NVMe: three per socket for balanced IOPS and rapid rebuilds.
Ceph block.db/WAL – 2 × Optane P5801X 400 GB: one per three OSDs for µs-class sync writes.
Flink RocksDB – 1 × Optane P5801X 400 GB: sub-second checkpoints and restores.
Kafka log – 1 × 3.84 TB PM9A3: hot segments on NVMe; Kafka tiered storage off-loads cold data to a small MinIO S3 cluster.
Redis cache – runs in-memory on spare cores, serving low-latency look-ups to front-ends.
Network NIC – Mellanox ConnectX-5 EX 100 GbE via passive QSFP28 twin-ax.
Top-of-rack switch – MikroTik CRS520-4XS-16XQ-RM: 16 × QSFP28 (100 GbE) + 4 × SFP28; cost-effective for six 100 Gb links today and headroom for expansion; enable jumbo frames and ECN/WRED to offset the modest 6 MB buffers.

Why this mix?
Optane-backed NVMe Ceph keeps write latency low; Kafka buffers a 200 Mb/s stream ingest and tiers historical data to MinIO; Flink operates on Optane-fast state; Redis provides instant read access for the application layer; the CRS520 delivers affordable 100 Gb fabric that saturates Ceph recovery and Flink shuffles, while SATA boot mirrors leave PCIe lanes free for high-performance workloads.

I have included an annotated block diagram - there are a stale items/errors, but adobe is a massive pain to reedit, 99% of it is correct.

The main issue there is I'm kinda wasting some of the pcie slots with the optanes, but this setup will get me going for now I reckon.

For the Pcie expansion on Slot 4 I plan to use this: https://www.aliexpress.com/item/1005003768261205.html

I've had these milan servers a couple of years, and they've been nothing but stable. I'm planning to separate out corosync into its own physical network layer.

Looking for any massive gotchas I have missed, or problems with this design, all comments welcomed!

leesteken · 2025-05-31T09:13:26+0200

mangos said:
... into a 3 node Proxmox + Ceph cluster.

This might be of important for such a small cluster with Ceph: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

ness1602 · 2025-05-31T22:15:55+0200

You have disks for everything, but what will be on CEPH in this case?

Search

Search

3−Node Proxmox EPYC 7T83 (7763) • 10 × NVMe per host • 100 GbE Ceph + Flink + Kafka — sanity-check me!

mangos

Member

Attachments

leesteken

Distinguished Member

ness1602

Famous Member

We value your privacy