confused...CEPH delivering same performance on 100G as it did on 1G test

jmpfas

Renowned Member
Oct 29, 2015
21
4
68
so...been using proxmox for many years (almost 20). have several existing server clusters.
Just building a new server stack. 6x Dell R760 768G ram, dual 32 core cpu. each server with 11 850G 24GPS SA SCSI SSD, 4x 1G nic plus 2X dual 100G nic
The drives are 3,400 MB/s capable


Build cluster using the 1G nics because I did not have the 100G cables yet. no production - just building things out

SO - testing the server to server speed on the 1G delivered exactly what was expected
iperf showed 0.95Gbis/sec on the G nics being uised for CEPH

Using fio to test disk speed on a VM with disk on CEPH delivered
WRITE: bw=115MiB/s (121MB/s), 28.8MiB/s-48.5MiB/s (30.2MB/s-50.9MB/s), io=4096MiB (4295MB), run=21115-35574msec
which I thought was pretty good over 1G

Then we plugged in 100G cables and I moved the CEPH IPs from the 1G nic to one of the 100G (will eventually bond two 100G for failover/speed, but starting simple)
aaaand this is what I get
WRITE: bw=138MiB/s (145MB/s), 34.6MiB/s-34.7MiB/s (36.3MB/s-36.3MB/s), io=4096MiB (4295MB), run=29543-29604msec

Disk stats (read/write):
dm-0: ios=0/10579, merge=0/0, ticks=0/6294053, in_queue=6294200, util=98.46%, aggrios=0/9163, aggrmerge=0/1413, aggrticks=0/5415964, aggrin_queue=5415960, aggrutil=98.46%
sda: ios=0/9163, merge=0/1413, ticks=0/5415964, in_queue=5415960, util=98.46%

barely any improvement.
BUT - using iperf to test server to server on that nic I get:94.1Gbit/sec

so. 100G interfaces are in fact delivering 100G. but somehow CEPH is not seeing it. On my other clusters, using 100G but 6GSATA SSDs (and less than half the number of OSDs) I get 577MB/s. which is just about the speed of the EVO 870 SSDs in that cluster

Any thoughts here? This is one of those "the simple crap breaks and the complex stuff works flawlessly".
Here is my ceph config (all IPs are private so no security issue)
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.180.195.211/24
fsid = d316d239-221c-495c-b59c-11b89120c17d
mon_allow_pool_delete = true
mon_host = 10.180.194.211 10.180.194.212 10.180.194.213 10.180.194.214
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.180.194.211/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.node21]
host = node21
mds_standby_for_name = pve

[mds.node22]
host = node22
mds_standby_for_name = pve

[mds.node23]
host = node23
mds_standby_for_name = pve

[mon.node21]
public_addr = 10.180.194.211

[mon.node22]
public_addr = 10.180.194.212

[mon.node23]
public_addr = 10.180.194.213

[mon.node24]
public_addr = 10.180.194.214

and interfaces (with public IPs xx'd out)

auto lo
iface lo inet loopback

auto nic0_1G
iface nic0_1G inet static
address 10.180.192.211/24
#internal/backup

iface nic1_1G inet manual

auto nic2_11G
iface nic2_11G inet static
address 10.180.195.211/24
#cluster

iface nic3_1G inet manual

auto nic4_100G00
iface nic4_100G00 inet static
address 10.180.194.211/24
#CEPH

iface nic5_100G01 inet manual

iface nic6_100G10 inet manual

iface nic7_100G11 inet manual

auto vmbr0
iface vmbr0 inet static
address xx.xx.xx.xx/26
gateway xx.xx.xx.xx
bridge-ports nic1_1G
bridge-stp off
bridge-fd 0

auto vmbr2
iface vmbr2 inet static
address 10.180.191.211/24
bridge-ports nic7_100G11
bridge-stp off
bridge-fd 0

source /etc/network/interfaces.d/*
 
Last edited:
one more note. after I moved the CEPH interface from one NIC to the other, CEPH seemed fine but the virt bridges for the vms were not working. ended up rebooting the servers one at a time. then those came back.
All tests done after the server reboot.
 
may have found it. based on install manual I had left num_pgs=auto, which resulted in 48 PGs...for 66 850Gb drives!
I just set it to 4096 and will test after it finishes rebuilding
 
well, increasing num_PGs to 4096 helped. but not enough
now getting
WRITE: bw=6598MiB/s (6918MB/s), 217MiB/s-708MiB/s (228MB/s-743MB/s), io=476GiB (511GB), run=73787-73845msec

Disk stats (read/write):
dm-0: ios=0/33834, merge=0/0, ticks=0/12202032, in_queue=12202030, util=99.77%, aggrios=0/32873, aggrmerge=0/955, aggrticks=0/11576479, aggrin_queue=11576457, aggrutil=99.77%
sda: ios=0/32873, merge=0/955, ticks=0/11576479, in_queue=11576457, util=99.77%

but on old cluster, on sloer/fewer OSDs I get
WRITE: bw=12.3GiB/s (13.3GB/s), 618MiB/s-979MiB/s (648MB/s-1027MB/s), io=805GiB (864GB), run=62597-65199msec

Disk stats (read/write):
dm-0: ios=0/88709, merge=0/0, ticks=0/10806199, in_queue=10807338, util=99.61%, aggrios=0/84180, aggrmerge=0/4483, aggrticks=0/10070944, aggrin_queue=10070794, aggrutil=99.63%
sda: ios=0/84180, merge=0/4483, ticks=0/10070944, in_queue=10070794, util=99.63%
so 13.3GB/s on slower hardware vs 6.9GB.s on faster /more OSDS
 
cluster_network = 10.180.195.211/24
public_network = 10.180.194.211/24
auto nic2_11G
iface nic2_11G inet static
address 10.180.195.211/24
#cluster
auto nic4_100G00
iface nic4_100G00 inet static
address 10.180.194.211/24
#CEPH

Either use only the public network or move the cluster network to one of the other 100G ports (on all nodes, of course).

[0] https://docs.ceph.com/en/squid/rados/configuration/network-config-ref
 
move your cluster_network ip on the 100Gb too. (cluster_network is used for osd replication when defined, so it's limiting your write speed)

Code:
auto nic4_100G00
iface nic4_100G00 inet static
    address 10.180.194.211/24
    address 10.180.195.211/24
#CEPH
 
Last edited:
move your cluster_network ip on the 100Gb too. (cluster_network is used for osd replication when defined, so it's limiting your write speed)

Code:
auto nic4_100G00
iface nic4_100G00 inet static
    address 10.180.194.211/24
    address 10.180.195.211/24
#CEPH
Hmm. I will try that (already was planning on moving cluster to 100G), but the docs say the ceph interface handles both replication and RADOS and talks about moving replication to a different network for performance. If what you said is correct, the docs are wrong.
But I will try that
 
num_pgs=auto, which resulted in 48 PGs
On the off chance it applies, I found if (only) a target size is set the autoscaler seems to not make changes. Once I set a target ratio it kicked in immediately.

The Ceph public network is the original communication, and the optional cluster network is replication and heartbeats.

https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster#pve_ceph_install_wizard
https://docs.ceph.com/en/latest/rad...f/#:~:text=dedicated replication network,-for

Above it looks like that is on nic2_11G, which I'm assuming (11G) is the speed?
 
move your cluster_network ip on the 100Gb too. (cluster_network is used for osd replication when defined, so it's limiting your write speed)
This doesnt result in any meaningful benefit vs just having the same address for public and private traffic. OP, if you have multiple switches, I would create laggs for public and private traffic- and make sure to cross physical nics (presuming nic4 and nic5 are actually nic1s0p0 and nic1s0p1, bond0 would enslave nic1s0p0 and nic2s0p0, etc) If you dont have multiple switches just assign the public and private networks to seperate interfaces.
 
  • Like
Reactions: Johannes S
On the off chance it applies, I found if (only) a target size is set the autoscaler seems to not make changes. Once I set a target ratio it kicked in immediately.

The Ceph public network is the original communication, and the optional cluster network is replication and heartbeats.

https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster#pve_ceph_install_wizard
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#:~:text=dedicated replication network,-for

Above it looks like that is on nic2_11G, which I'm assuming (11G) is the speed?
ok - the nic labeled cluster is the proxmox cluster, not optional ceph cluster.
Right now it is on 10G but moving to 100G once I am 100% sure I have the 100G working properly.
later today adding cross connect cables between the two 100G switches and will try LAG group for CEPH. well, first just set up a new interface on a LAG group (two 100G ports bonded balance-xor, one port on switch A from 100G NIC0, P0, one port on switch B from 100G NIC2 p0 for full redundancy)
Once I see 200G traffic flowing I will shift CPH IPs to this set and see what we see.
As I indicated above, this is same setup on our older server cluster. so I know what to expect...just now seeing it yet.
 
This doesnt result in any meaningful benefit vs just having the same address for public and private traffic. OP, if you have multiple switches, I would create laggs for public and private traffic- and make sure to cross physical nics (presuming nic4 and nic5 are actually nic1s0p0 and nic1s0p1, bond0 would enslave nic1s0p0 and nic2s0p0, etc) If you dont have multiple switches just assign the public and private networks to seperate interfaces.
ok - the nic labeled cluster is the proxmox cluster, not optional ceph cluster.
Right now it is on 10G but moving to 100G once I am 100% sure I have the 100G working properly.
later today adding cross connect cables between the two 100G switches and will try LAG group for CEPH. well, first just set up a new interface on a LAG group (two 100G ports bonded balance-xor, one port on switch A from 100G NIC0, P0, one port on switch B from 100G NIC2 p0 for full redundancy)
Once I see 200G traffic flowing I will shift CPH IPs to this set and see what we see.
As I indicated above, this is same setup on our older server cluster. so I know what to expect...just now seeing it yet.
 
Right now it is on 10G but moving to 100G
that explains your observed performance.

first just set up a new interface on a LAG group (two 100G ports bonded balance-xor, one port on switch A from 100G NIC0, P0, one port on switch B from 100G NIC2 p0 for full redundancy)
LACP is your first choice. if thats not possible, use active-backup and MAKE SURE the switches have plenty of bandwidth interconnecting them. balance-xor sounds good on paper but not in practice.
Once I see 200G traffic flowing I will shift CPH IPs to this set and see what we see.
set your expectations. bonding isnt the same as "adding." a single packet cannot go across two interfaces :) you should be able to realize ~160 in aggregate using LACP.