confused...CEPH delivering same performance on 100G as it did on 1G test

jmpfas

Renowned Member
Oct 29, 2015
18
4
68
so...been using proxmox for many years (almost 20). have several existing server clusters.
Just building a new server stack. 6x Dell R760 768G ram, dual 32 core cpu. each server with 11 850G 24GPS SA SCSI SSD, 4x 1G nic plus 2X dual 100G nic
The drives are 3,400 MB/s capable


Build cluster using the 1G nics because I did not have the 100G cables yet. no production - just building things out

SO - testing the server to server speed on the 1G delivered exactly what was expected
iperf showed 0.95Gbis/sec on the G nics being uised for CEPH

Using fio to test disk speed on a VM with disk on CEPH delivered
WRITE: bw=115MiB/s (121MB/s), 28.8MiB/s-48.5MiB/s (30.2MB/s-50.9MB/s), io=4096MiB (4295MB), run=21115-35574msec
which I thought was pretty good over 1G

Then we plugged in 100G cables and I moved the CEPH IPs from the 1G nic to one of the 100G (will eventually bond two 100G for failover/speed, but starting simple)
aaaand this is what I get
WRITE: bw=138MiB/s (145MB/s), 34.6MiB/s-34.7MiB/s (36.3MB/s-36.3MB/s), io=4096MiB (4295MB), run=29543-29604msec

Disk stats (read/write):
dm-0: ios=0/10579, merge=0/0, ticks=0/6294053, in_queue=6294200, util=98.46%, aggrios=0/9163, aggrmerge=0/1413, aggrticks=0/5415964, aggrin_queue=5415960, aggrutil=98.46%
sda: ios=0/9163, merge=0/1413, ticks=0/5415964, in_queue=5415960, util=98.46%

barely any improvement.
BUT - using iperf to test server to server on that nic I get:94.1Gbit/sec

so. 100G interfaces are in fact delivering 100G. but somehow CEPH is not seeing it. On my other clusters, using 100G but 6GSATA SSDs (and less than half the number of OSDs) I get 577MB/s. which is just about the speed of the EVO 870 SSDs in that cluster

Any thoughts here? This is one of those "the simple crap breaks and the complex stuff works flawlessly".
Here is my ceph config (all IPs are private so no security issue)
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.180.195.211/24
fsid = d316d239-221c-495c-b59c-11b89120c17d
mon_allow_pool_delete = true
mon_host = 10.180.194.211 10.180.194.212 10.180.194.213 10.180.194.214
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.180.194.211/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.node21]
host = node21
mds_standby_for_name = pve

[mds.node22]
host = node22
mds_standby_for_name = pve

[mds.node23]
host = node23
mds_standby_for_name = pve

[mon.node21]
public_addr = 10.180.194.211

[mon.node22]
public_addr = 10.180.194.212

[mon.node23]
public_addr = 10.180.194.213

[mon.node24]
public_addr = 10.180.194.214

and interfaces (with public IPs xx'd out)

auto lo
iface lo inet loopback

auto nic0_1G
iface nic0_1G inet static
address 10.180.192.211/24
#internal/backup

iface nic1_1G inet manual

auto nic2_11G
iface nic2_11G inet static
address 10.180.195.211/24
#cluster

iface nic3_1G inet manual

auto nic4_100G00
iface nic4_100G00 inet static
address 10.180.194.211/24
#CEPH

iface nic5_100G01 inet manual

iface nic6_100G10 inet manual

iface nic7_100G11 inet manual

auto vmbr0
iface vmbr0 inet static
address xx.xx.xx.xx/26
gateway xx.xx.xx.xx
bridge-ports nic1_1G
bridge-stp off
bridge-fd 0

auto vmbr2
iface vmbr2 inet static
address 10.180.191.211/24
bridge-ports nic7_100G11
bridge-stp off
bridge-fd 0

source /etc/network/interfaces.d/*
 
Last edited:
one more note. after I moved the CEPH interface from one NIC to the other, CEPH seemed fine but the virt bridges for the vms were not working. ended up rebooting the servers one at a time. then those came back.
All tests done after the server reboot.
 
may have found it. based on install manual I had left num_pgs=auto, which resulted in 48 PGs...for 66 850Gb drives!
I just set it to 4096 and will test after it finishes rebuilding
 
well, increasing num_PGs to 4096 helped. but not enough
now getting
WRITE: bw=6598MiB/s (6918MB/s), 217MiB/s-708MiB/s (228MB/s-743MB/s), io=476GiB (511GB), run=73787-73845msec

Disk stats (read/write):
dm-0: ios=0/33834, merge=0/0, ticks=0/12202032, in_queue=12202030, util=99.77%, aggrios=0/32873, aggrmerge=0/955, aggrticks=0/11576479, aggrin_queue=11576457, aggrutil=99.77%
sda: ios=0/32873, merge=0/955, ticks=0/11576479, in_queue=11576457, util=99.77%

but on old cluster, on sloer/fewer OSDs I get
WRITE: bw=12.3GiB/s (13.3GB/s), 618MiB/s-979MiB/s (648MB/s-1027MB/s), io=805GiB (864GB), run=62597-65199msec

Disk stats (read/write):
dm-0: ios=0/88709, merge=0/0, ticks=0/10806199, in_queue=10807338, util=99.61%, aggrios=0/84180, aggrmerge=0/4483, aggrticks=0/10070944, aggrin_queue=10070794, aggrutil=99.63%
sda: ios=0/84180, merge=0/4483, ticks=0/10070944, in_queue=10070794, util=99.63%
so 13.3GB/s on slower hardware vs 6.9GB.s on faster /more OSDS
 
cluster_network = 10.180.195.211/24
public_network = 10.180.194.211/24
auto nic2_11G
iface nic2_11G inet static
address 10.180.195.211/24
#cluster
auto nic4_100G00
iface nic4_100G00 inet static
address 10.180.194.211/24
#CEPH

Either use only the public network or move the cluster network to one of the other 100G ports (on all nodes, of course).

[0] https://docs.ceph.com/en/squid/rados/configuration/network-config-ref
 
move your cluster_network ip on the 100Gb too. (cluster_network is used for osd replication when defined, so it's limiting your write speed)

Code:
auto nic4_100G00
iface nic4_100G00 inet static
    address 10.180.194.211/24
    address 10.180.195.211/24
#CEPH
 
Last edited: