Proxmox with ceph network setup

RandomVash

New Member
Mar 1, 2025
5
0
1
I've been playing with a 4 node cluster using some server nodes that used to be in a hyperconverged setup. I had everything running on the main network and the performance was pretty slow. I then added a 1g switch for ceph_cluster and things improved maybe 2 times. After that I added 4 port sfp+ nics to all four nodes with one cable connected from each of the nodes to a 10g switch. This improved the performance more than 10 times.

After this I decided to try moving the ceph_public to the same 10g port with the ceph_cluster and performance went back down to where it was when it was all on a 1g connection. Since then I separated ceph_public and ceph_cluster to different subnets, connecting to two different 10g switches and the performance has not gone back up, if anything it's gone down.

All OSDs were restarted, then I restarted ceph.target on each node, then I restarted the whole nodes. No improvement after separating to two different 10g switches. I don't know what I'm missing here. We've also been using the storage pretty regularly on the server so I don't want to scrap the whole thing and start over if I don't have to. Reading forums I saw some people suggested switching MTU from the default 1500 to 9000 but its weird that I had a configuration where it was working decently and now that it's on a higher bandwidth network it's tanking.

Current setup:
Main network delivered over a 1g switch to each node. - 192.168.30.0/24
10g switch to one port on each of the nodes for ceph_public - 10.0.1.0/24
a separate 10g switch to each of the nodes for ceph_cluster - 10.0.0.0/24
 
1741811014601.png

This is how each node is setup, in case I'm doing something wrong here.

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.0.0.0/24
fsid = 421c8c2e-1155-43c2-8d86-8e294570195d
mon_allow_pool_delete = true
mon_host = 10.0.1.2 10.0.1.3 10.0.1.4 10.0.1.5
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.0.1.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.prox1]
host = prox1
mds_standby_for_name = pve

[mds.prox2]
host = prox2
mds_standby_for_name = pve

[mds.prox3]
host = prox3
mds_standby_for_name = pve

[mds.prox4]
host = prox4
mds_standby_for_name = pve

[mon.prox1]
public_addr = 10.0.1.2

[mon.prox2]
public_addr = 10.0.1.3

[mon.prox3]
public_addr = 10.0.1.4

[mon.prox4]
public_addr = 10.0.1.5
 
Last edited:
First I would check if there are no network issues (packet loss, all links at 10G, no loops etc). Can each host reach each other over that network, what is the measured bandwidth between each point.

How are you measuring performance?
What performance are you getting?

I would also stack the two switches and do LACP across them and group the two ports together, but first you want to get to ~2-3Gbps of throughput to an image (or better) under load,
 
Last edited:
Agree with guruevi, may you can test network performance by iperf, to get the base line about currently network throughput. Then use nmon to collect all nodes daily performance chart, compare both can show you which performance bottleneck are you facing.
 
Agree with guruevi, may you can test network performance by iperf, to get the base line about currently network throughput. Then use nmon to collect all nodes daily performance chart, compare both can show you which performance bottleneck are you facing.
I used iperf3 earlier from each server to each server. They were all practically the same. I'll have to look into using nmon.

1741828695776.png


First I would check if there are no network issues (packet loss, all links at 10G, no loops etc). Can each host reach each other over that network, what is the measured bandwidth between each point.

How are you measuring performance?
What performance are you getting?

I would also stack the two switches and do LACP across them and group the two ports together, but first you want to get to ~2-3Gbps of throughput to an image (or better) under load,
I've been judging the performance overall based on the Ceph Reads/Writes. These are currently around 1-3 MiBs on reads and 0.5-2 MiBs writes which is the slowest it's been. When I first moved cluster to it's own 1g dumb switch it was running around 8-10 on both and when I first switched the cluster to the 10g it was running around 40-50.

I was wanting to get everything working smooth before I tried to stack the switches. Right now they are on the default config.

Doing a 4gb read write test on one of the VMs I'm also getting 1.52GB/s write and 2.87GB/s reads but navigating folders or moving files corresponds with the ceph read write speeds.
 
I'm not sure what changed but everything seemed to be great today. Speeds hit much higher and no lag on the VMs. Maybe it just took a few days to auto adjust after the changes?
 
Uhm, the retr (retransmits) are way too high. They should be 0 or near 0. You may have cabling or switch issues. Your congestion window swings wildly as a result.
 
Last edited:
Uhm, the retr (retransmits) are way too high. They should be 0 or near 0. You may have cabling or switch issues. Your congestion window swings wildly as a result.
Thank you for pointing that out. I didn't realize that's what that value was. That explains the swings in performance. Checking the connection between each the number of retries is significantly lower on certain interfaces so I'm going to order new cables. Some of the cables I was using in testing were way longer than needed and they were all used previously.