Enterprise SSD, Dell R730xd servers, 20Gbps link, still Ceph iops is too low

jsengupta · Feb 24, 2022

Hi,

We are running a 3 node cluster of Proxmox ceph and getting really low iops from the VMs. Around 4000 to 5000.

Host 1:
Server Model: Dell R730xd
Ceph network: 10Gbps x2 (LACP configured)
SSD: Kingstone DC500M 1.92TB x3
Storage Controller: PERC H730mini
RAM: 192GB
CPU: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz x2

Host2:
Server Model: Dell R730xd
Ceph network: 10Gbps x2 (LACP configured)
SSD: Kingstone DC500M 1.92TB x3
Storage Controller: HBA330mini
RAM: 192GB
CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz x2

Host3:
Server Model: Dell R730xd
Ceph network: 10Gbps x2 (LACP configured)
SSD: Kingstone DC500M 1.92TB x2
Storage Controller: HBA330mini
RAM: 192GB
CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz x3

We cannot see no more than, 4gbps traffic in the 20Gbps LACP link.

Could anyone please guide us, what is being wrong here? Why are not we achieving at least 20K iops?

Please note we are also running other ceph pool of HDD drives. We have segregated the ceph pool of SSDs using class based ceph rule which proxmox provides.

Thanks in advanced.

julyat · Feb 25, 2022

did you do the rados bench for the ssd pool?

jsengupta · Feb 25, 2022

This is what are we getting from the SSD Pool:

root@host3:~# rados bench -p Ceph-SSD-Pool1 10 write --no-cleanup -b 4096 -t 10
hints = 1
Maintaining 10 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_host5_1675810
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 10 2412 2402 9.38231 9.38281 0.0104754 0.00415354
2 10 4882 4872 9.51481 9.64844 0.0041712 0.00409497
3 10 7426 7416 9.65529 9.9375 0.00369216 0.00403751
4 10 9956 9946 9.71187 9.88281 0.00335676 0.00401399
5 10 12335 12325 9.62787 9.29297 0.00353398 0.00405086
6 10 14871 14861 9.67408 9.90625 0.00415697 0.0040307
7 10 17374 17364 9.68866 9.77734 0.003889 0.00402553
8 10 19891 19881 9.70639 9.83203 0.00318856 0.00401828
9 10 21989 21979 9.53835 8.19531 0.00286752 0.00408926
10 3 24056 24053 9.39458 8.10156 0.00249983 0.00415232
Total time run: 10.0052
Total writes made: 24056
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 9.39199
Stddev Bandwidth: 0.692647
Max bandwidth (MB/sec): 9.9375
Min bandwidth (MB/sec): 8.10156
Average IOPS: 2404
Stddev IOPS: 177.318
Max IOPS: 2544
Min IOPS: 2074
Average Latency(s): 0.00415226
Stddev Latency(s): 0.00215413
Max latency(s): 0.0815412
Min latency(s): 0.00202285

I don't know, why are we getting too low IOPS. Please guide us.

jsengupta · Feb 26, 2022

Can anyone please help me on this issue?

jdancer · Feb 26, 2022

May want to change the VM disk cache to none. I got a significant increase in IOPS from writeback.

I also have WCE (write-cache enabled) on the SAS drives. Set it with "sdparm --set WCE --save /dev/sd[x]"

jozefrebjak · Feb 27, 2022

@jsengupta Hello, we have similar setup and I trying to find solution to similar problem. We are using 2x Ethernet 10G 2P X520 Adapter NICs so we have 4 10Gb ports. Cluster network 20Gb and Public Network 20Gb bonded. Each NIC handles a half of Bond. One NIC is in CPU1 and second in CPU2 so in different NUMAs what can be maybe a problem, I am trying to find if it's OK to have this kind of setup or do We need to bond interfaces in same NUMA

jsengupta · Feb 27, 2022

jdancer said:
May want to change the VM disk cache to none. I got a significant increase in IOPS from writeback.

I also have WCE (write-cache enabled) on the SAS drives. Set it with "sdparm --set WCE --save /dev/sd[x]"

The rados benchmark is giving 2500 iops. It is before setting up a VM. Is it really depending on the setting that you have mentioned?

julyat · Feb 28, 2022

well i have pretty much the same situation and i dont think the iops is an issue for me , so is there anything really bothering you or just the low iops makes you wondering why?

julyat · Feb 28, 2022

and i noticed you have one H730mini and two HBA330mini, did you put them all on HBA mode yet cause raid has bad performace for ceph.

jsengupta · Feb 28, 2022

All of them are in HBA mode.

As you said, the only problem is the too low iops. That is why we cannot put any databases into this pool.

If you take a look at the specification of this SSD that we are using here, you will see each of the disks will give you 75,000 random write iops.

We are running 8 of such disks and what are we getting, 2500 write iops. OUT OF THIS 8 SSDs POOL!!! That is where I am getting shocked.

Where could the possible bottleneck be I wonder?

Below is the IOPS specification of the disk

Tmanok · Feb 28, 2022

jsengupta said:
PERC H730mini

The datasheet claims LSI SAS 3108 1200MT/s PowerPC 476 dual-core 12Gbps, that's a modern chipset with very good performance.

jsengupta said:
Kingstone DC500M

Those are reasonable enterprise drives, as you said 75K IOPs RAND...

Your LACP 10Gbps NICs (20Gbps Throughput) should be more than sufficient.

While performing these benchmarks, what kind of throughput does the GUI for Ceph display? (Datacentre > Node > Ceph)
Can you also retest while monitoring the network throughput? (apt update ; apt install bmon) and try bmon.

jsengupta said:
E5-2660 v3

Those are reasonable CPUs but not the best, to be safe, verify that no individual cores are "pinned" / 100% usage. (apt update ; apt install htop) and try htop to view the cores. Consider that with CEPH, you are not offloading computational tasks to that SAS 3108 chip, instead you are forcing the CPU and memory to perform the processing and it is much more complicated than standard RAID. If your boot volumes are ZFS, also consider limiting the amount of memory consumed by ARC Cache, currently the default is half of host memory even for small volumes that do not need it...

In the meantime, I'll try to think about other issues. Posting your CEPH config would be helpful including the Proxmox GUI for your ceph config (section where you configure # of replicas to be precise). For now, I can't think of anything else obvious ¯\_(ツ)_/¯

Cheers,

Tmanok

jsengupta · Feb 28, 2022

Hi,

First of all thank you for your gentle and informative reply.

I have once again issued the Rados benchmark command.

During the execution of the command the Ceph dashboard gives the following output:

However, I cannot figure out how we can monitor the cores from htop command because of the number of cores each server is having. Can you please help? I am getting the following output during the benchmark :

We are not running ZFS anywhere. The OS is with hardware RAID on Host1. on the other 2 hosts, OS are on single disk.

As you requested,

CrushMAP:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class ssd
device 10 osd.10 class hdd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class hdd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host host1 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 11.108
alg straw2
hash 0 # rjenkins1
item osd.1 weight 2.024
item osd.10 weight 1.819
item osd.0 weight 2.024
item osd.16 weight 1.747
item osd.17 weight 1.747
item osd.18 weight 1.747
}
host host2 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 5.868
alg straw2
hash 0 # rjenkins1
item osd.2 weight 2.024
item osd.3 weight 2.024
item osd.8 weight 1.819
}
host host3 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
id -15 class ssd # do not change unnecessarily
# weight 9.917
alg straw2
hash 0 # rjenkins1
item osd.4 weight 2.024
item osd.5 weight 2.024
item osd.6 weight 2.024
item osd.7 weight 2.024
item osd.14 weight 1.819
}
host host5 {
id -9 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
id -16 class ssd # do not change unnecessarily
# weight 3.493
alg straw2
hash 0 # rjenkins1
item osd.9 weight 1.747
item osd.11 weight 1.747
}
host host4 {
id -11 # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
id -17 class ssd # do not change unnecessarily
# weight 5.240
alg straw2
hash 0 # rjenkins1
item osd.12 weight 1.747
item osd.13 weight 1.747
item osd.15 weight 1.747
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
id -18 class ssd # do not change unnecessarily
# weight 35.626
alg straw2
hash 0 # rjenkins1
item host1 weight 11.108
item host2 weight 5.868
item host3 weight 9.917
item host5 weight 3.493
item host4 weight 5.240
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule only-HDD {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule only-SSD {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Configuration:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 172.32.0.0/24
fsid = cf5312a1-c3ea-46ad-a06f-7bf662396aa3
mon_allow_pool_delete = true
mon_host = 172.32.0.1 172.32.0.2 172.32.0.3 172.32.0.4 172.32.0.5
ms_bind_ipv4 = true
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 172.32.0.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.host1]
public_addr = 172.32.0.1

[mon.host2]
public_addr = 172.32.0.2

[mon.host3]
public_addr = 172.32.0.3

[mon.host4]
public_addr = 172.32.0.4

[mon.host5]
public_addr = 172.32.0.5

jsengupta · Feb 28, 2022

I have additionally tested proxmox ceph in my laptop. My laptop is running with NVMe WDC PC SN530 SDBPMPZ-512G-1101. I am getting around 90,000 write IOPS from my installed windows 11.

I have now spun up 3 Proxmox nodes using Oracle Virtualbox and created a ceph cluster of 10GB disk each from the Proxmox nodes in our the NVMe. I am getting 1100 write IOPS.

The ceph network is connected using Virutalbox's internal adapter (Host-only adapter)

Installed Windows 11 is giving 90,000 write IOPS
Proxmox 3 node cluster is giving 1,100 write IOPS

I know, running 3 nodes on a Windows system generates load. Still should not I get at least 20,000 write IOPS?

Why so drastic change in IOPS?

czechsys · Mar 1, 2022

Do you really understand what are you benchmarking? You can't calculate ceph iops based on ssd iops.

Read this https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/ .

jsengupta · Mar 2, 2022

czechsys said:
Do you really understand what are you benchmarking? You can't calculate ceph iops based on ssd iops.

Read this https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/ .

There are 2 class of drives in our environment. HDD and SSD. And 2 pools have been created based on the class based crush replication rule. If I go ahead and benchmark IOPS based on the class based pool, am I not doing it in correct way?

What I am thinking so far is as below:

rados bench -p Ceph-SSD-Pool1 10 write --no-cleanup -b 4096 -t 10 # For SSD based pool
rados bench -p ceph-storage1 10 write --no-cleanup -b 4096 -t 10 # For HDD based pool

Portion of the crush rule where the pool is being segregated based on the class:
rule only-HDD {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule only-SSD {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}

If I am not doing it in correct way, can you please guide me?

Our goal is to achieve write IOPS with SSDs. Is not it possible with the hardware set that we have got? If we need to tweak any configuration, what would it be?

dagutman · Oct 31, 2023

This is a a year later, but curious if you ever figured out your issue. I have a similar collection of Dell R730's and have also been getting oddly slow IOPS... not even sure where to begin...

Enterprise SSD, Dell R730xd servers, 20Gbps link, still Ceph iops is too low

jsengupta

Well-Known Member

julyat

Member

jsengupta

Well-Known Member

jsengupta

Well-Known Member

jdancer

Renowned Member

jozefrebjak

Member

jsengupta

Well-Known Member

julyat

Member

julyat

Member

jsengupta

Well-Known Member

Tmanok

Renowned Member

jsengupta

Well-Known Member

jsengupta

Well-Known Member

czechsys

Renowned Member

jsengupta

Well-Known Member

dagutman

New Member

We value your privacy