[Proxmox Cluster with Ceph Full-Mesh Network Design] Sanity Check & Advice for 3-node cluster with separate 10GbE/25GbE networks

StudyItalia

New Member
Sep 6, 2025
3
1
3
Hello everyone,

I'm reaching out to the community for a design review of my version 2 network setup for a new 3-node Proxmox/Ceph hyper-converged cluster. The goal is to build a stable and high-performance configuration, overcoming the issues I faced with my first implementation.

Context and Lessons Learned from v1

The first version of the cluster was running for about two months. The Ceph replication network was also a full-mesh on 25GbE interfaces, but with a different approach:

  • []Each node had the same IP address (e.g., 10.10.10.1/32) configured on both of its 25GbE interfaces.
    []Traffic was managed using static routes to direct packets to the correct link.
Problems Encountered: I experienced severe performance and stability issues. When running tests with iperf using multiple concurrent streams, I observed massive packet loss on the 25GbE network. My hypothesis is that the Linux kernel struggled with routing (specifically ECMP - Equal-cost multi-path) when the same IP was present on multiple physical interfaces, leading to instability.

Goal for v2: For this reason, I am redesigning the network from scratch with the goal of maximizing simplicity and reliability. I want to avoid complex configurations and, if possible, additional software like dynamic routing protocols (e.g., FRR). Redundancy for the individual 25GbE links is not a priority; stability and performance are.


Hardware and Network Architecture v2

Nodes:

Network schema idea:
- [ ] Rete **Ceph Public** su `vmbr0` → `172.16.10.0/24`
- [ ] Rete **Management + VM** su `vmbr1` → `192.168.170.0/24` (VLAN 174)
- [ ] Rete **Ceph Cluster (replica)** on interface 25 Gb:
- `10.10.10.0/30`, `10.10.11.0/30`, `10.10.12.0/30`
- [ ] MTU 9000 all interface


mermaid-diagram-2025-09-06-124324.png

The links for the Ceph replication network are direct connections between the 25GbE ports of the nodes.​

Planned Network Configuration (/etc/network/interfaces)

Below is the complete network configuration I plan to apply to each node.

/etc/network/interfaces file on pve1:
Code:
auto lo
iface lo inet loopback

[HEADING=2]10GbE - Ceph Public Network[/HEADING]
iface ens3f0 inet manual
mtu 9000
auto vmbr0
iface vmbr0 inet static
address 172.16.10.1/24
bridge-ports ens3f0
bridge-stp off
bridge-fd 0
mtu 9000

[HEADING=2]10GbE - Management & VM Network[/HEADING]
iface ens3f1 inet manual
auto vmbr1
iface vmbr1 inet static
address 192.168.170.250/24
gateway 192.168.170.254
bridge-ports ens3f1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes

[HEADING=2]25GbE - Ceph Cluster Network (Mesh)[/HEADING]
[HEADING=2]Link to pve2[/HEADING]
auto ens9f0np0
iface ens9f0np0 inet static
address 10.10.10.1/30
mtu 9000

[HEADING=2]Link to pve3[/HEADING]
auto ens9f1np1
iface ens9f1np1 inet static
address 10.10.11.1/30
mtu 9000

/etc/network/interfaces file on pve2:
Code:
auto lo
iface lo inet loopback

[HEADING=2]10GbE - Ceph Public Network[/HEADING]
iface ens2f0 inet manual
mtu 9000
auto vmbr0
iface vmbr0 inet static
address 172.16.10.2/24
bridge-ports ens2f0
bridge-stp off
bridge-fd 0
mtu 9000

[HEADING=2]10GbE - Management & VM Network[/HEADING]
iface ens2f1 inet manual
auto vmbr1
iface vmbr1 inet static
address 192.168.170.251/24
gateway 192.168.170.254
bridge-ports ens2f1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes

[HEADING=2]25GbE - Ceph Cluster Network (Mesh)[/HEADING]
[HEADING=2]Link to pve1[/HEADING]
auto ens9f0np0
iface ens9f0np0 inet static
address 10.10.10.2/30
mtu 9000

[HEADING=2]Link to pve3[/HEADING]
auto ens9f1np1
iface ens9f1np1 inet static
address 10.10.12.1/30
mtu 9000

/etc/network/interfaces file on pve3:
Code:
auto lo
iface lo inet loopback

[HEADING=2]10GbE - Ceph Public Network[/HEADING]
iface ens2f0 inet manual
mtu 9000
auto vmbr0
iface vmbr0 inet static
address 172.16.10.3/24
bridge-ports ens2f0
bridge-stp off
bridge-fd 0
mtu 9000

[HEADING=2]10GbE - Management & VM Network[/HEADING]
iface ens2f1 inet manual
auto vmbr1
iface vmbr1 inet static
address 192.168.170.252/24
gateway 192.168.170.254
bridge-ports ens2f1
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes

[HEADING=2]25GbE - Ceph Cluster Network (Mesh)[/HEADING]
[HEADING=2]Link to pve2[/HEADING]
auto ens9f0np0
iface ens9f0np0 inet static
address 10.10.12.2/30
mtu 9000

[HEADING=2]Link to pve1[/HEADING]
auto ens9f1np1
iface ens9f1np1 inet static
address 10.10.11.2/30
mtu 9000


Ceph Configuration (ceph.conf)

Consequently, the ceph.conf file would be configured as follows:
Code:
[global]
...
public_network = 172.16.10.0/24
cluster_network = 10.10.10.0/30,10.10.11.0/30,10.10.12.0/30
...


Questions for the Community


  1. [] v2 Design Validity: Given the failure of v1, is this new approach with separate /30 subnets for each link considered more stable and a "best practice" for a full-mesh topology? Is it the right path for the simplicity I'm aiming for?
    [] Ceph's Native Routing Handling: With this setup, is Ceph able to natively handle the routing and always choose the direct link for OSD-to-OSD communication? Or could the Linux kernel still face routing issues between the two physical interfaces on the cluster_network?
    [] Simple Alternatives: Are there any alternatives to this design that maintain simplicity (no additional routing software) but are equally or more performant/stable?
    [] Proxmox GUI vs. Ceph Syntax: I've noticed a potential discrepancy. The official Ceph documentation states that multiple subnets in cluster_network should be separated by a comma (,). However, the Proxmox GUI (Datacenter -> Ceph -> Configuration) seems to use a space as a separator for multiple values. What is the correct syntax that is actually written to the ceph.conf file and interpreted by Ceph when using the Proxmox GUI? This might have been the root cause of my previous parsing issues.
  2. Practical Experiences: Has anyone implemented a similar setup (with separate /30 subnets) and can confirm its stability, especially under heavy I/O loads?
I would greatly appreciate any feedback, criticism, or suggestions you can provide, especially in light of the issues I encountered with my first configuration. Thank you very much
 
FRR is the official way to go in a three node setup. I would be better to look into its problems before trying to build something new. I'm running FRR on multiple clusters and never had a problem with it. Setup is also very easy compared to a MLAG/LACP bond setup if you count in the work on the switch.
 
Hello,
I am three steps behind you - all I know is that I need to get our organisation to embrace ProxMox and that I will be lucky if I get them to understand three nodes, however much I'd like more. HA is our goal for a couple of busy DB servers and some app servers with reasonably low usage, and Ceph will have to play a part. Thankfully we have fibre between the three buildings :0)
May I ask what hardware you are using? I am not at the stage yet where I can ask people to write me quotes, but it would be nice to get some hand-wavy ideas of investment (our current Hyper-V hardware is well loved and not expected to learn such new tricks :0)
Many thanks!
Hanry
 
I've only ever used mesh in a lab scenario, and that was some time ago. I used broadcast bonds and it worked well enough. Im fairly certain that no logical topology will result in any meaningful difference in performance, but as for stability (and to extend your cluster to more then 3 nodes- if this is for production and not a home lab I'd say thats a must) just buy a switch- alternatively just plug your 25gb ports to your 10gb switch. Since you only have 3 nodes you are not going to be subjected to rebalance storms anyway.
 
I run a 3-node Proxmox Ceph cluster using a full-mesh broadcast network per https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup

Each node is directly connected to each other without a switch. Yes, since it's broadcast traffic, each node gets Ceph public, private, and Corosync traffic. As we know, nodes drop packets not addressed for them. I use a 169.254.x.y/24 network to make sure it's link-local traffic and never gets routed.

No issues. I just made sure the migration traffic is using this broadcast network and it's unencrypted. Makes for fast migrations.
 
I run a 3-node Proxmox Ceph cluster using a full-mesh broadcast network per https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#Broadcast_Setup

Each node is directly connected to each other without a switch. Yes, since it's broadcast traffic, each node gets Ceph public, private, and Corosync traffic. As we know, nodes drop packets not addressed for them. I use a 169.254.x.y/24 network to make sure it's link-local traffic and never gets routed.

Do you have also redundancy e.g. for corosync? I would be wary to have just on one link due to the latency- sensitivity of corosync
 
Last edited:
Hm, then part of the docs contradict each other:
We recommend a network bandwidth of at least 10 Gbps, or more, to be used exclusively for Ceph traffic. A meshed network setup [4] is also an option for three to five node clusters, if there are no 10+ Gbps switches available.

Important The volume of traffic, especially during recovery, will interfere with other services on the same network, especially the latency sensitive Proxmox VE corosync cluster stack can be affected, resulting in possible loss of cluster quorum. Moving the Ceph traffic to dedicated and physical separated networks will avoid such interference, not only for corosync, but also for the networking services provided by any virtual guests.
For estimating your bandwidth needs, you need to take the performance of your disks into account.. While a single HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can already saturate 10 Gbps too. If modern NVMe-attached SSDs are used, a single one can already saturate 10 Gbps of bandwidth, or more. For such high-performance setups we recommend at least a 25 Gpbs, while even 40 Gbps or 100+ Gbps might be required to utilize the full performance potential of the underlying disks.

If unsure, we recommend using three (physical) separate networks for high-performance setups:

  • one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster traffic.
  • one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the ceph server and ceph client storage traffic. Depending on your needs this can also be used to host the virtual guest traffic and the VM live-migration traffic.
  • one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync cluster communication.

https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster

Network Requirements​

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.
The network should not be used heavily by other members, as while corosync does not uses much bandwidth it is sensitive to latency jitters; ideally corosync runs on its own physically separated network. Especially do not use a shared network for corosync and storage (except as a potential low-priority fallback in a redundant configuration).
Before setting up a cluster, it is good practice to check if the network is fit for that purpose. To ensure that the nodes can connect to each other on the cluster network, you can test the connectivity between them with the ping tool.
If the Proxmox VE firewall is enabled, ACCEPT rules for corosync will automatically be generated - no manual action is required.

https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_cluster_network

But since the tutorial you referenced is part of the wiki but not part of the reference manual so I still would follow the recommendations to have own dedicated network links for corosync and ceph traffic.
 
Hi everyone,


I wanted to give an update since I decided to start from scratch when building my 3-node Proxmox + Ceph cluster.


This time I dedicated 6 × 25Gb ports (2 per node) for Ceph, set up in a full mesh, and I’m using FRR (fabric) for routing and redundancy. The networking part is now stable and performant: from a single node with iperf3 I can easily saturate 25Gb (50Gb aggregated) towards the other two nodes.


On the storage side I’m trying to push my 9 × Samsung 990 Pro (3 per node) as much as possible, running with replica 3. Unfortunately, I can’t get past ~2000 MB/s for both reads and writes, even though the network could handle much more.


I’ve already gone through a lot of settings and checks, but I just can’t seem to get higher performance.
Has anyone got ideas or tips on what else I could look into?


With 64 thread i see 2000MB/s
rados bench -p bench 60 write --no-cleanup -b 4M -t 64 --run-name bench1
2025-09-12T16:45:03.836289+0200 min lat: 0.035427 max lat: 0.327924 avg lat: 0.138118
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
60 24 27798 27774 1851.35 1960 0.0448852 0.138118
Total time run: 60.0931
Total writes made: 27798
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1850.33
Stddev Bandwidth: 75.8292
Max bandwidth (MB/sec): 2012
Min bandwidth (MB/sec): 1664
Average IOPS: 462
Stddev IOPS: 18.9573
Max IOPS: 503
Min IOPS: 416
Average Latency(s): 0.138074
Stddev Latency(s): 0.0233944
Max latency(s): 0.327924
Min latency(s): 0.0227445
root@nodo1:~#

rados bench -p bench 60 seq -t 32 --run-name bench1
2025-09-12T16:46:35.525433+0200 min lat: 0.00928774 max lat: 0.399551 avg lat: 0.0713205
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
60 18 26744 26726 1781.08 1844 0.232954 0.0713205
Total time run: 60.0845
Total reads made: 26744
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1780.43
Average IOPS: 445
Stddev IOPS: 19.2463
Max IOPS: 481
Min IOPS: 387
Average Latency(s): 0.0713596
Max latency(s): 0.399551
Min latency(s): 0.00928774

any tips for me?
 
Last edited:
  • Like
Reactions: Johannes S
Hi, my Ceph network is divided into 25 GB clusters and a dedicated 10 GB public one.
So there's still plenty of bandwidth available.
- [ ] Rete **Ceph Public** su `vmbr0` → `172.16.10.0/24`

- [ ] Rete **Management + VM** su `vmbr1` → `192.168.170.0/24` (VLAN 174)

- [ ] Rete **Ceph Cluster (replica)** su interfacce 25 Gb:

- [ ] MTU 9000 su tutte le interfacce