Feedback on Using a Single /24 for All Traffic in a Proxmox meshed Cluster, with Ceph

mouk · Dec 4, 2024

Hi,

We’ve reviewed the relevant wiki articles, and we’re looking for feedback on a networking strategy for our Proxmox and Ceph cluster setup. Specifically, we aim to avoid using multiple arbitrary IPs and would prefer to use a single /24 network for all traffic, including Ceph cluster traffic. Here’s our current idea:

Setup Overview:

Proxmox Cluster: We have five Proxmox servers with IPs 192.168.2.50 through 192.168.2.55.
Ceph Cluster Network: The first three machines (with additional 10G NICs) will be used to create the Ceph mesh network. These machines will have some extra network configuration for ens18 and ens19.
Routing: All five machines will use 192.168.2.1/24 as the default gateway via a switch. (4 LACP bonded 1G UTP connections)

Network Configuration:

The first three nodes have additional configurations for ens18 and ens19 as follows:

Code:

/etc/network/interfaces

auto lo
iface lo inet loopback

# ens18 connected to Node2 (192.168.2.51)
auto ens18
iface ens18 inet manual
    address 169.254.50.51
    netmask 255.255.0.0
    post-up ip route add 192.168.2.51/32 dev ens18
    post-down ip route del 192.168.2.51/32 dev ens18

# ens19 connected to Node3 (192.168.2.52)
auto ens19
iface ens19 inet manual
    address 169.254.50.52
    netmask 255.255.0.0
    post-up ip route add 192.168.2.52/32 dev ens19
    post-down ip route del 192.168.2.52/32 dev ens19

# Primary bridge for external communication
auto vmbr0
iface vmbr0 inet static
    address 192.168.2.50
    netmask 255.255.255.0
    gateway 192.168.2.1
    bridge_ports ens20
    bridge_stp off
    bridge_fd 0

Key Points:

169.254.x.y/16 local-link addresses: Used for the Layer 2 direct peer-to-peer connection between cluster nodes.
Regular Traffic: Routed to the default gateway/switch (no change).
Specific Traffic to the other node: Only the traffic specifically for the specific other node is routed directly between the two connected nodes.
IP Usage: Each cluster server uses just one IP from the 192.168.2.0/24 subnet, leaving the remaining IPs available for other purposes.
Ceph Cluster Access: Even the two non-meshed nodes (without 10G connections) can still access the Ceph cluster.

Questions:

Is this approach valid as a meshed network configuration for Proxmox + Ceph, even though it’s not listed in the wiki?
Any potential issues with using 169.254.x.y addresses for peer-to-peer communication instead of using addresses from our /24?

We think this setup is clean and simple, but we'd love feedback or suggestions for improvement!

Thanks!

Gilou · Dec 4, 2024

As a general note, I'm not sure why you're using the 169.254/16 IPs, why not using another range from RFC1918 there? and if it's for direct communication, surely something smaller than a /16 (or even a /24) would work…

Which leads to my other point, why bother? Either it's full mesh, either it is not, if you have non-full mesh machines, then.. it's not full mesh? Maybe I'm missing something here? And if it's not full mesh, then just use bonding for the servers that can benefit from it?

mouk · Dec 8, 2024

Hi Gilou,
Thanks for your response. Appreciated.

It's "full mesh" for the first three of the five nodes, and the remaining two are not participating in the full mesh ceph replication, have no OSD's, but are in the same /24 ip range, and in the same pve cluster. They would have access to the ceph storage, but for 'consumption' only, more as a PoC, ad they would access it through a 4x 1G LACP bonded link, through a switch.

The reason for using 169.254.0.0/16 (or a /30 or /29) is that I am trying to avoid using other (existing or not) ranges. In the proposal above, each node has only one IP in 'main' 192.168.2.0/24 subnet.

This allows us a very easy upgrade path, if we ever want to et get rif of the meshed network parts, and move all hosts over to quick fibre of DAC. We would need to change almost nothing.

LnxBil · Dec 8, 2024

If you saturate your outbound interface, your cluster will break. This can be easily tested, just benchmark your storage and do some live migrations and look at the log how much corosync entries are appearing or even nodes rebooting. I would not suggest to run such a setup in production. There is a reason why you need to have multiple networks for this, especially if you're only operating on 1 GBE (despite LACP, which does not work for single connections). There are things you can do like corosync priorization in order to mitigate the problems with only one interface. I also don't get why you did not setup the system with FRR as described in the docs.

As a POC, this is VERY bad. What you show is that you have a system that does not perform good and breaks on load. This can easily be viewed as a flaw in PVE, which is not. Why don't you just stick to the best practice that is documentend EVERYWHERE (PVE handbook, PVE wiki, Corosync and CEPH).

Search

Search

Feedback on Using a Single /24 for All Traffic in a Proxmox meshed Cluster, with Ceph

mouk

Renowned Member

Gilou

Renowned Member

mouk

Renowned Member

LnxBil

Distinguished Member