Ceph OSD Performance is Slow ?

hanturaya · Aug 9, 2022

Hi guys,
I'm currently testing ceph in proxmox. I've followed the documentation and configured the ceph

I have 3 identical nodes and configured as follows:
CPU: 16 x Intel Xeon Bronze @ 1.90GHz (2 Sockets)
RAM: 32 GB DDR4 2133Mhz
Boot/Proxmox Disk: Patriot Burst SSD 240GB
Disk: 3x HGST 10TB HDD SAS
NIC1: 1 GbE used for Corosync
NIC2: 2x10GbE bonded with LACP for Ceph Traffic

Before that i test my disk one by one using FIO with this command

 fio --ioengine=libaio --filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

This is the result (the result is similar to each disk on server)
For 4K Block Size

For 4M Block Size

After that i set up the ceph and set OSD with 1 disk on each server, but the speed is decreasing

rados -p test bench 30 write

Code:

Total time run:         30.3825
Total writes made:      939
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     123.624
Stddev Bandwidth:       11.7765
Max bandwidth (MB/sec): 148
Min bandwidth (MB/sec): 100
Average IOPS:           30
Stddev IOPS:            2.94412
Max IOPS:               37
Min IOPS:               25
Average Latency(s):     0.514558
Stddev Latency(s):      0.255017
Max latency(s):         1.72565
Min latency(s):         0.124276

rados -p test bench 30 write -b 4K -t 1

Code:

Total time run:         30.0195
Total writes made:      3146
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     0.40937
Stddev Bandwidth:       0.0648556
Max bandwidth (MB/sec): 0.53125
Min bandwidth (MB/sec): 0.238281
Average IOPS:           104
Stddev IOPS:            16.603
Max IOPS:               136
Min IOPS:               61
Average Latency(s):     0.0095316
Stddev Latency(s):      0.00568832
Max latency(s):         0.047987
Min latency(s):         0.00263956

My question is, Is it really the best OSD speed that i can get with my current configuration ?

hanturaya · Aug 9, 2022

I forgot to say that this topology connected to Cisco Nexus 3064 with this configuration

for each port:

Code:

interface Ethernet1/1
  description str1-enp129s0f0
  lacp rate fast
  switchport access vlan 459
  channel-group 1 mode active
interface Ethernet1/2
  description str1-enp129s0f1
  lacp rate fast
  switchport access vlan 459
  channel-group 1 mode active
interface Ethernet1/3
  description str2-enp129s0f0
  lacp rate fast
  switchport access vlan 459
  channel-group 2 mode active
interface Ethernet1/4
  description str2-enp129s0f1
  lacp rate fast
  switchport access vlan 459
  channel-group 2 mode active
interface Ethernet1/5
  description str3-enp129s0f0
  lacp rate fast
  switchport access vlan 459
  channel-group 3 mode active
interface Ethernet1/6
  description str3-enp129s0f1
  lacp rate fast
  switchport access vlan 459
  channel-group 3 mode active

for each bonding

Code:

interface port-channel1
  description bonding-str1
  switchport access vlan 459
  spanning-tree bpduguard enable
  spanning-tree bpdufilter enable
  no negotiate auto
interface port-channel2
  description bonding-str2
  switchport access vlan 459
  spanning-tree bpduguard enable
  spanning-tree bpdufilter enable
  no negotiate auto

interface port-channel3
  description bonding-str3
  switchport access vlan 459
  spanning-tree bpduguard enable
  spanning-tree bpdufilter enable
  no negotiate auto

The MTU is set on 9000

spirit · Aug 9, 2022

3500 iops with 4k && iodepth=1 for 1 HGST 10TB HDD SAS ????

do you use some cache with a raid controller ?

because you shouldn't have more than 150-200 iops for 1 hdd disk.

hanturaya · Aug 9, 2022

Hello spirit thank u for your answer,
i've tried my best to avoid cache being used in my test. I run these command before I did the test and made some configuration on my raid controller (I'm using Dell R740xd with H730p raid controller)

Drop Cache
sync; echo 3 > /proc/sys/vm/drop_caches

Disable Cache for Non-Raid (I'm using HBA)

spirit said:
do you use some cache with a raid controller ?

Disable Write Cache

But the test still showing that I get around 3500 IOps.

"because you shouldn't have more than 150-200 iops for 1 hdd disk"
So the rados test was right that my disk only capable to reach that performance ?

Thank you

kenneth104 · Aug 9, 2022

In my opinion this is expected low performance
ceph quantity is needed to improve performance

you cluster only 3 node , it's default minimum requirements
By the way, not recommended any bonded with ceph traffic, can use one for Public Network another for Cluster Network

I'm manage a cluster

4node
CPU: 2x E5-2690 v2
RAM: 192GB
cache tier: 1x Samsung PM1725b 1.6TB
hdd tier: 10x Toshiba MG06 8TB

hanturaya · Aug 9, 2022

hello kenneth thank u for the answer,

kenneth104 said:
can use one for Public Network another for Cluster Network

is separating cluster network and public network will improve the ceph OSD performance ?

kenneth104 said:
you cluster only 3 node , it's default minimum requirements

actually right now i have 6 servers with identical hardware, this is the test with 6 server running 1 OSD each server

This is the result i get
rados -p test bench 30 write -b 4K -t 1

Code:

Total time run:         30.0309
Total writes made:      2412
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     0.31374
Stddev Bandwidth:       0.0318584
Max bandwidth (MB/sec): 0.398438
Min bandwidth (MB/sec): 0.261719
Average IOPS:           80
Stddev IOPS:            8.15574
Max IOPS:               102
Min IOPS:               67
Average Latency(s):     0.0124404
Stddev Latency(s):      0.00842863
Max latency(s):         0.0689223
Min latency(s):         0.00282277

rados -p test bench 30 write

Code:

Total time run:         30.4837
Total writes made:      1427
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     187.247
Stddev Bandwidth:       17.7798
Max bandwidth (MB/sec): 212
Min bandwidth (MB/sec): 148
Average IOPS:           46
Stddev IOPS:            4.44494
Max IOPS:               53
Min IOPS:               37
Average Latency(s):     0.338962
Stddev Latency(s):      0.245017
Max latency(s):         1.54223
Min latency(s):         0.050242

kenneth104 said:
In my opinion this is expected low performance

What do you think the problem here ?

Thank you

kenneth104 · Aug 10, 2022

In SUSE Enterprise Storage 7 documents, There are 2 different views

https://documentation.suse.com/ses/7/html/ses-all/storage-bp-hwreq.html#storage-bp-net-private

Code:

If you do not specify a cluster network during Ceph deployment, it assumes a single public network environment. While Ceph operates fine with a public network, its performance and security improves when you set a second private cluster network. To support two networks, each Ceph node needs to have at least two network cards.

https://documentation.suse.com/ses/7/html/ses-all/storage-bp-hwreq.html#ses-bp-minimum-cluster

Code:

A minimal product cluster configuration consists of:
At least four physical nodes (OSD nodes) with co-location of services
Dual-10 Gb Ethernet as a bonded network
A separate Admin Node (can be virtualized on an external node)

In PVE documents

https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

Code:

Public Network: You can set up a dedicated network for Ceph. This setting is required. Separating your Ceph traffic is highly recommended. Otherwise, it could cause trouble with other latency dependent services, for example, cluster communication may decrease Ceph’s performance.

Cluster Network: As an optional step, you can go even further and separate the OSD replication & heartbeat traffic as well. This will relieve the public network and could lead to significant performance improvements, especially in large clusters.

consider you have only one 2 ports 10GbE card , you can test 2 different configurations performance
“In my opinion this is expected low performance” Various references based on different documents , need more OSDs in SUSE7 minimum-cluster recommend 4 nodes each have 8 OSDs

In simple clusters or small production clusters 4k I/O performance have low performance is normal ,it's need powerful CPU, switch, OptaneSSD for WAL/RocksDB, and a lot of client to push stress test

English is not my first language, grammar may be wrong

hanturaya · Feb 28, 2023

kenneth104 said:
In my opinion this is expected low performance
ceph quantity is needed to improve performance

you cluster only 3 node , it's default minimum requirements
By the way, not recommended any bonded with ceph traffic, can use one for Public Network another for Cluster Network

I'm manage a cluster

4node
CPU: 2x E5-2690 v2
RAM: 192GB
cache tier: 1x Samsung PM1725b 1.6TB
hdd tier: 10x Toshiba MG06 8TB

hello kenneth, sorry for the late question, just as for my reference could u show me your cluster performance ?. Thank you

mciantar · Apr 1, 2023

I have very similar fio results to you and I would like to understand if there is anything I can change to improve performance. I am running a 3 not cluster, very similar configuration.

Search

Search

Ceph OSD Performance is Slow ?

hanturaya

Member

hanturaya

Member

spirit

Distinguished Member

hanturaya

Member

kenneth104

New Member

hanturaya

Member

kenneth104

New Member

hanturaya

Member

mciantar

New Member