3 x AMD EPYC 7713, 512GB RAM, 2x1TB SSD (RAID 1, OS), 5 x 3.84TB NVMe (Ceph)
Networking:
- 2 port embedded 1G NIC: 1 port for public internet access, 1 port for private network - both connected to switch ports acting as access ports to different VLANs
10G setup was:
2 port 10G NIC connected to 10G switch via fiber links, ports we're bound in 802.2 LACP (layer2)
25G setup is:
2 port 25G NIC connected directly to other nodes via DAC cables - no bridges, no bonds - routed setup (ports connected by single cable have their own /30 network).
On top of that I've setup up FRR with OSPF to handle access via "identity" IPs - IPs assigned to loopback interface, routing is handled by FRR.
FRR is very quick in handling "node down" scenarios, so Ceph doesn't suffer when i shutdown/reboot node for maintenance.
Ceph uses that network for public and private communication, additionally Proxmox is set to use this network for migrations.
This setup is working very well for me.
I've had a single situation that needs investigation, but i cannot reproduce it - for some reason one of my VMs was stuck on IO during single node maintenance, however Ceph did not report any PGs unavailable, and other VMs we're fully functional. VM regained fully operational status once the node was booted up. I've spent 2 hours migrating VMs back and forth, restarting different nodes and I couldn't reproduce the issue.