so...been using proxmox for many years (almost 20). have several existing server clusters.
Just building a new server stack. 6x Dell R760 768G ram, dual 32 core cpu. each server with 11 850G 24GPS SA SCSI SSD, 4x 1G nic plus 2X dual 100G nic
The drives are 3,400 MB/s capable
Build cluster using the 1G nics because I did not have the 100G cables yet. no production - just building things out
SO - testing the server to server speed on the 1G delivered exactly what was expected
iperf showed 0.95Gbis/sec on the G nics being uised for CEPH
Using fio to test disk speed on a VM with disk on CEPH delivered
WRITE: bw=115MiB/s (121MB/s), 28.8MiB/s-48.5MiB/s (30.2MB/s-50.9MB/s), io=4096MiB (4295MB), run=21115-35574msec
which I thought was pretty good over 1G
Then we plugged in 100G cables and I moved the CEPH IPs from the 1G nic to one of the 100G (will eventually bond two 100G for failover/speed, but starting simple)
aaaand this is what I get
WRITE: bw=138MiB/s (145MB/s), 34.6MiB/s-34.7MiB/s (36.3MB/s-36.3MB/s), io=4096MiB (4295MB), run=29543-29604msec
Disk stats (read/write):
dm-0: ios=0/10579, merge=0/0, ticks=0/6294053, in_queue=6294200, util=98.46%, aggrios=0/9163, aggrmerge=0/1413, aggrticks=0/5415964, aggrin_queue=5415960, aggrutil=98.46%
sda: ios=0/9163, merge=0/1413, ticks=0/5415964, in_queue=5415960, util=98.46%
barely any improvement.
BUT - using iperf to test server to server on that nic I get:94.1Gbit/sec
so. 100G interfaces are in fact delivering 100G. but somehow CEPH is not seeing it. On my other clusters, using 100G but 6GSATA SSDs (and less than half the number of OSDs) I get 577MB/s. which is just about the speed of the EVO 870 SSDs in that cluster
Any thoughts here? This is one of those "the simple crap breaks and the complex stuff works flawlessly".
Here is my ceph config (all IPs are private so no security issue)
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.180.195.211/24
fsid = d316d239-221c-495c-b59c-11b89120c17d
mon_allow_pool_delete = true
mon_host = 10.180.194.211 10.180.194.212 10.180.194.213 10.180.194.214
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.180.194.211/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.node21]
host = node21
mds_standby_for_name = pve
[mds.node22]
host = node22
mds_standby_for_name = pve
[mds.node23]
host = node23
mds_standby_for_name = pve
[mon.node21]
public_addr = 10.180.194.211
[mon.node22]
public_addr = 10.180.194.212
[mon.node23]
public_addr = 10.180.194.213
[mon.node24]
public_addr = 10.180.194.214
and interfaces (with public IPs xx'd out)
auto lo
iface lo inet loopback
auto nic0_1G
iface nic0_1G inet static
address 10.180.192.211/24
#internal/backup
iface nic1_1G inet manual
auto nic2_11G
iface nic2_11G inet static
address 10.180.195.211/24
#cluster
iface nic3_1G inet manual
auto nic4_100G00
iface nic4_100G00 inet static
address 10.180.194.211/24
#CEPH
iface nic5_100G01 inet manual
iface nic6_100G10 inet manual
iface nic7_100G11 inet manual
auto vmbr0
iface vmbr0 inet static
address xx.xx.xx.xx/26
gateway xx.xx.xx.xx
bridge-ports nic1_1G
bridge-stp off
bridge-fd 0
auto vmbr2
iface vmbr2 inet static
address 10.180.191.211/24
bridge-ports nic7_100G11
bridge-stp off
bridge-fd 0
source /etc/network/interfaces.d/*
Just building a new server stack. 6x Dell R760 768G ram, dual 32 core cpu. each server with 11 850G 24GPS SA SCSI SSD, 4x 1G nic plus 2X dual 100G nic
The drives are 3,400 MB/s capable
Build cluster using the 1G nics because I did not have the 100G cables yet. no production - just building things out
SO - testing the server to server speed on the 1G delivered exactly what was expected
iperf showed 0.95Gbis/sec on the G nics being uised for CEPH
Using fio to test disk speed on a VM with disk on CEPH delivered
WRITE: bw=115MiB/s (121MB/s), 28.8MiB/s-48.5MiB/s (30.2MB/s-50.9MB/s), io=4096MiB (4295MB), run=21115-35574msec
which I thought was pretty good over 1G
Then we plugged in 100G cables and I moved the CEPH IPs from the 1G nic to one of the 100G (will eventually bond two 100G for failover/speed, but starting simple)
aaaand this is what I get
WRITE: bw=138MiB/s (145MB/s), 34.6MiB/s-34.7MiB/s (36.3MB/s-36.3MB/s), io=4096MiB (4295MB), run=29543-29604msec
Disk stats (read/write):
dm-0: ios=0/10579, merge=0/0, ticks=0/6294053, in_queue=6294200, util=98.46%, aggrios=0/9163, aggrmerge=0/1413, aggrticks=0/5415964, aggrin_queue=5415960, aggrutil=98.46%
sda: ios=0/9163, merge=0/1413, ticks=0/5415964, in_queue=5415960, util=98.46%
barely any improvement.
BUT - using iperf to test server to server on that nic I get:94.1Gbit/sec
so. 100G interfaces are in fact delivering 100G. but somehow CEPH is not seeing it. On my other clusters, using 100G but 6GSATA SSDs (and less than half the number of OSDs) I get 577MB/s. which is just about the speed of the EVO 870 SSDs in that cluster
Any thoughts here? This is one of those "the simple crap breaks and the complex stuff works flawlessly".
Here is my ceph config (all IPs are private so no security issue)
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.180.195.211/24
fsid = d316d239-221c-495c-b59c-11b89120c17d
mon_allow_pool_delete = true
mon_host = 10.180.194.211 10.180.194.212 10.180.194.213 10.180.194.214
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.180.194.211/24
[client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.node21]
host = node21
mds_standby_for_name = pve
[mds.node22]
host = node22
mds_standby_for_name = pve
[mds.node23]
host = node23
mds_standby_for_name = pve
[mon.node21]
public_addr = 10.180.194.211
[mon.node22]
public_addr = 10.180.194.212
[mon.node23]
public_addr = 10.180.194.213
[mon.node24]
public_addr = 10.180.194.214
and interfaces (with public IPs xx'd out)
auto lo
iface lo inet loopback
auto nic0_1G
iface nic0_1G inet static
address 10.180.192.211/24
#internal/backup
iface nic1_1G inet manual
auto nic2_11G
iface nic2_11G inet static
address 10.180.195.211/24
#cluster
iface nic3_1G inet manual
auto nic4_100G00
iface nic4_100G00 inet static
address 10.180.194.211/24
#CEPH
iface nic5_100G01 inet manual
iface nic6_100G10 inet manual
iface nic7_100G11 inet manual
auto vmbr0
iface vmbr0 inet static
address xx.xx.xx.xx/26
gateway xx.xx.xx.xx
bridge-ports nic1_1G
bridge-stp off
bridge-fd 0
auto vmbr2
iface vmbr2 inet static
address 10.180.191.211/24
bridge-ports nic7_100G11
bridge-stp off
bridge-fd 0
source /etc/network/interfaces.d/*
Last edited: