3−Node Proxmox EPYC 7T83 (7763) • 10 × NVMe per host • 100 GbE Ceph + Flink + Kafka — sanity-check me!

mangos

Member
Jan 26, 2022
6
1
8
36
Hey folks,

I'm starting a greenfield data pipeline project for my startup. I need real-time stream processing so I'm upgrading some Milan hosts I have into a 3 node Proxmox + Ceph cluster.



Per-node snapshot
  • Motherboard – Gigabyte MZ72-HB0
  • Compute & RAM – 2 × EPYC 7T83 (128 c/256 t) + 1 TB DDR4-3200: ample pinned cores for Ceph, Kafka, Flink and a small Redis instance with no cross-NUMA chatter.
  • Boot – 2 × 480 GB Samsung PM893 SATA (ZFS mirror): PLP-protected, zero PCIe-lane cost.
  • Ceph data tier – 6 × 3.84 TB Samsung PM9A3 U.2 NVMe: three per socket for balanced IOPS and rapid rebuilds.
  • Ceph block.db/WAL – 2 × Optane P5801X 400 GB: one per three OSDs for µs-class sync writes.
  • Flink RocksDB – 1 × Optane P5801X 400 GB: sub-second checkpoints and restores.
  • Kafka log – 1 × 3.84 TB PM9A3: hot segments on NVMe; Kafka tiered storage off-loads cold data to a small MinIO S3 cluster.
  • Redis cache – runs in-memory on spare cores, serving low-latency look-ups to front-ends.
  • Network NIC – Mellanox ConnectX-5 EX 100 GbE via passive QSFP28 twin-ax.
  • Top-of-rack switchMikroTik CRS520-4XS-16XQ-RM: 16 × QSFP28 (100 GbE) + 4 × SFP28; cost-effective for six 100 Gb links today and headroom for expansion; enable jumbo frames and ECN/WRED to offset the modest 6 MB buffers.

Why this mix?
Optane-backed NVMe Ceph keeps write latency low; Kafka buffers a 200 Mb/s stream ingest and tiers historical data to MinIO; Flink operates on Optane-fast state; Redis provides instant read access for the application layer; the CRS520 delivers affordable 100 Gb fabric that saturates Ceph recovery and Flink shuffles, while SATA boot mirrors leave PCIe lanes free for high-performance workloads.

I have included an annotated block diagram - there are a stale items/errors, but adobe is a massive pain to reedit, 99% of it is correct.

The main issue there is I'm kinda wasting some of the pcie slots with the optanes, but this setup will get me going for now I reckon.

For the Pcie expansion on Slot 4 I plan to use this: https://www.aliexpress.com/item/1005003768261205.html

I've had these milan servers a couple of years, and they've been nothing but stable. I'm planning to separate out corosync into its own physical network layer.

Looking for any massive gotchas I have missed, or problems with this design, all comments welcomed!
 

Attachments

Thanks for that.

The following things will be true:

1. 6 osds per node, exact same size nvme Samsung 9A3s
2. Redundant TOR switches (gonna use some dell z9100s, have changed my mind about the mikrotik)
3. 3 separate corosync rings using dell 4048s 10g (I once tried to have 20 NUCS in one cluster - I'll get it right this time)
4. 100 gbe networking
5. backup everything to nvme PBS

To rank the problems from the linked post by severity in terms of what I'm proposing:

Problem #1 - Only three nodes. They won't have immediate production duty, and I don't even expect them to have full duty/usage within a year. If things go well, I aim to have 4 nodes within 6 months, and 5 by a year. It's not an absolute deathknell if a node goes down, it's only a massive problem if I can't get it back up. We don't have external stakeholders/clients - so it's all on us.

Problem #3 - Valid concern

Problem #2 - 6 OSDS per node mitigates this

Problem #6 - One TB of Ram per node, with ability to upgrade this

Problem #4 - 2x 100 GbE

Problem #5 - Decent spec nvme samsung 9A3s

Anything else I'm missing?
 
Last edited:
  • Like
Reactions: Johannes S
You have disks for everything, but what will be on CEPH in this case?
6 Nvme Pcie4.0 x4 Samsung PM9A3 3.84tb per node - these will be the OSDS - three per cpu socket (dual socket mobo)

2 optane 5801X per node, 1 for every 3 OSDS - block.db & WAL
 
Last edited: