Ceph reading and writing performance problems, fast reading and slow writing

Otter7721 · Feb 2, 2023

Hello, we need to migrate all cloud environments to Proxmox. At present, I am evaluating and testing Proxmox+Ceph+OpenStack.

But now we are facing the following difficulties:

When VMware vSAN was migrated to ceph, I found that hdd+ssd performed very poorly in ceph, and the write performance was very poor. Performance is far less than vSAN
The sequential writing performance of ceph in the full flash memory structure is not as good as that of a single hard disk, or even a single mechanical hard disk
In using the hdd+ssd structure in bcache, the sequential write performance of ceph is far lower than that of a single hard disk

Please forgive my poor English.

Test server parameters (this is not important)

CPU：Dual Intel® Xeon® E5-2698Bv3

Memory： 8 x 16G DDR3

Dual 1 Gbit NIC：Realtek Semiconductor Co., Ltd. RTL8111/8168/8411

Disk：

1 x 500G NVME SAMSUNG MZALQ512HALU-000L1 (It is also the ssd-data Thinpool in PVE)

1 x 500G SATA WDC_WD5000AZLX-60K2TA0 (Physical machine system disk)

2 x 500G SATA WDC_WD5000AZLX-60K2TA0

1 x 1T SATA ST1000LM035-1RK172

PVE：pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.74-1-pve)

Network Configure：

enp4s0 (OVS Port) -> vmbr0 (OVS Bridge) -> br0mgmt (192.168.1.3/24,192.168.1.1)

enp5s0 (OVS Port,MTU=9000) -> vmbr1 (OVS Bridge,MTU=9000)

vmbr2 (OVS Bridge,MTU=9000)

Test virtual machine parameters x 3 (three virtual machines are the same parameters)

CPU：32 (1 sockets, 32 cores) [host]

Memory：32G

Disk：

1 x local-lvm:vm-101-disk-0,iothread=1,size=32G

2 x ssd-data:vm-101-disk-0,iothread=1,size=120G

Network Device：

net0: bridge=vmbr0,firewall=1

net1: bridge=vmbr2,firewall=1,mtu=1 (Ceph Cluster/Public Network)

net2: bridge=vmbr0,firewall=1

net3: bridge=vmbr0,firewall=1

Network Configure：

ens18 (net0,OVS Port) -> vmbr0 (OVS Bridge) -> br0mgmt (10.10.1.11/24,10.10.1.1)

ens19 (net1,OVS Port,MTU=9000) -> vmbr1 (OVS Bridge,MTU=9000) -> br1ceph (192.168.10.1/24,MTU=9000)

ens20 (net2,Network Device,Active=No)

ens21 (net3,Network Device,Active=No)

Benchmarking tools

fio
fio-cdm (https://github.com/xlucn/fio-cdm)

For fio-cdm, if no parameters are filled in, the configuration file corresponding to fio is as follows

Use 'python fio-cdm - f -' to get

Code:

[global]
ioengine=libaio
filename=.fio_testmark
directory=/root
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting

[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

Environment construction steps

Code:

# prepare tools
root@pve01:~# apt update -y && apt upgrade -y
root@pve01:~# apt install fio git -y
root@pve01:~# git clone https://github.com/xlucn/fio-cdm.git

# create test block
root@pve01:~# rbd create test -s 20G
root@pve01:~# rbd map test
root@pve01:~# mkfs.xfs /dev/rbd0
root@pve01:~# mkdir /mnt/test
root@pve01:/mnt# mount /dev/rbd0 /mnt/test

# start test
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm

Environmental test

Network Bandwidth

Code:

root@pve01:~# apt install iperf3 -y
root@pve01:~# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 10.10.1.12, port 52968
[  5] local 10.10.1.11 port 5201 connected to 10.10.1.12 port 52972
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.87 GBytes  16.0 Gbits/sec                 
[  5]   1.00-2.00   sec  1.92 GBytes  16.5 Gbits/sec                 
[  5]   2.00-3.00   sec  1.90 GBytes  16.4 Gbits/sec                 
[  5]   3.00-4.00   sec  1.90 GBytes  16.3 Gbits/sec                 
[  5]   4.00-5.00   sec  1.85 GBytes  15.9 Gbits/sec                 
[  5]   5.00-6.00   sec  1.85 GBytes  15.9 Gbits/sec                 
[  5]   6.00-7.00   sec  1.70 GBytes  14.6 Gbits/sec                 
[  5]   7.00-8.00   sec  1.75 GBytes  15.0 Gbits/sec                 
[  5]   8.00-9.00   sec  1.89 GBytes  16.2 Gbits/sec                 
[  5]   9.00-10.00  sec  1.87 GBytes  16.0 Gbits/sec                 
[  5]  10.00-10.04  sec  79.9 MBytes  15.9 Gbits/sec                 
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  18.6 GBytes  15.9 Gbits/sec                  receiver

Jumbo Frames

Code:

root@pve01:~# ping -M do -s 8000 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 8000(8028) bytes of data.
8008 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=1.51 ms
8008 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.500 ms
^C
--- 192.168.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.500/1.007/1.514/0.507 ms
root@pve01:~#

Benchmark category

Physical Disk Benchmark
Single osd, single server benchmark
Multiple OSDs, single server benchmarks
Multiple OSDs, multiple server benchmarks

Benchmark results (Ceph and the system have not been tuned, and bcache acceleration has not been used)

1. Physical Disk Benchmark (Test sequence is 4)

step.

Code:

root@pve1:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0 465.8G  0 disk
├─sda1                         8:1    0  1007K  0 part
├─sda2                         8:2    0   512M  0 part /boot/efi
└─sda3                         8:3    0 465.3G  0 part
  ├─pve-root                 253:0    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:1    0   3.5G  0 lvm 
  │ └─pve-data-tpool         253:3    0 346.2G  0 lvm 
  │   ├─pve-data             253:4    0 346.2G  1 lvm 
  │   └─pve-vm--100--disk--0 253:5    0    16G  0 lvm 
  └─pve-data_tdata           253:2    0 346.2G  0 lvm 
    └─pve-data-tpool         253:3    0 346.2G  0 lvm 
      ├─pve-data             253:4    0 346.2G  1 lvm 
      └─pve-vm--100--disk--0 253:5    0    16G  0 lvm 
sdb                            8:16   0 931.5G  0 disk
sdc                            8:32   0 465.8G  0 disk
sdd                            8:48   0 465.8G  0 disk
nvme0n1                      259:0    0 476.9G  0 disk
root@pve1:~# mkfs.xfs /dev/nvme0n1 -f
root@pve1:~# mkdir /mnt/nvme
root@pve1:~# mount /dev/nvme0n1 /mnt/nvme
root@pve1:~# cd /mnt/nvme/

result.

Code:

root@pve1:/mnt/nvme# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/nvme 3.4GiB/476.7GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2361.95|     1435.48|
|SEQ1M Q1 T1 |     1629.84|     1262.63|
|RND4K Q32T16|      954.86|     1078.88|
|. IOPS      |   233119.53|   263398.08|
|. latency us|     2194.84|     1941.78|
|RND4K Q1 T1 |       55.56|      225.06|
|. IOPS      |    13565.49|    54946.21|
|. latency us|       72.76|       16.97|

2. Single osd, single server benchmark (Test sequence is 3)

Modify ceph.conf set osd_pool_default_min_size and osd_pool_default_size with 1, then systemctl restart ceph.target and fix all errors

step.

Code:

root@pve01:/mnt/test# ceph osd pool get rbd size
size: 2
root@pve01:/mnt/test# ceph config set global  mon_allow_pool_size_one true
root@pve01:/mnt/test# ceph osd pool set rbd min_size 1
set pool 2 min_size to 1
root@pve01:/mnt/test# ceph osd pool set rbd size 1 --yes-i-really-mean-it
set pool 2 size to 1

result

Code:

root@pve01:/mnt/test# ceph -s
  cluster:
    id:     1f3eacc8-2488-4e1a-94bf-7181ee7db522
    health: HEALTH_WARN
            2 pool(s) have no replicas configured
 
  services:
    mon: 3 daemons, quorum pve01,pve02,pve03 (age 17m)
    mgr: pve01(active, since 17m), standbys: pve02, pve03
    osd: 6 osds: 1 up (since 19s), 1 in (since 96s)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 281 objects, 1.0 GiB
    usage:   1.1 GiB used, 119 GiB / 120 GiB avail
    pgs:     33 active+clean
 
root@pve01:/mnt/test# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1     down         0  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2     down         0  1.00000
 3    ssd  0.11719          osd.3     down         0  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4     down         0  1.00000
 5    ssd  0.11719          osd.5     down         0  1.00000
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1153.07|      515.29|
|SEQ1M Q1 T1 |      447.35|      142.98|
|RND4K Q32T16|       99.07|       32.19|
|. IOPS      |    24186.26|     7859.91|
|. latency us|    21148.94|    65076.23|
|RND4K Q1 T1 |        7.47|        1.48|
|. IOPS      |     1823.24|      360.98|
|. latency us|      545.98|     2765.23|
root@pve01:/mnt/test#

3. Multiple OSDs, single server benchmarks (Test sequence is 2)

Change crushmap set step chooseleaf firstn 0 type host to step chooseleaf firstn 0 type osd

OSD tree

Code:

root@pve01:/etc/ceph# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1       up   1.00000  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2     down         0  1.00000
 3    ssd  0.11719          osd.3     down         0  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4     down         0  1.00000
 5    ssd  0.11719          osd.5     down         0  1.00000

result

Code:

root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1376.59|      397.29|
|SEQ1M Q1 T1 |      442.74|      111.41|
|RND4K Q32T16|      114.97|       29.08|
|. IOPS      |    28068.12|     7099.90|
|. latency us|    18219.04|    72038.06|
|RND4K Q1 T1 |        6.82|        1.04|
|. IOPS      |     1665.27|      254.40|
|. latency us|      598.00|     3926.30|

4. Multiple OSDs, multiple server benchmarks (Test sequence is 1)

OSD tree

Code:

root@pve01:/etc/ceph# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1       up   1.00000  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2       up   1.00000  1.00000
 3    ssd  0.11719          osd.3       up   1.00000  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4       up   1.00000  1.00000
 5    ssd  0.11719          osd.5       up   1.00000  1.00000

result

Code:

tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1527.37|      296.25|
|SEQ1M Q1 T1 |      408.86|      106.43|
|RND4K Q32T16|      189.20|       43.00|
|. IOPS      |    46191.94|    10499.01|
|. latency us|    11068.93|    48709.85|
|RND4K Q1 T1 |        4.99|        0.95|
|. IOPS      |     1219.16|      232.37|
|. latency us|      817.51|     4299.14|

Conclusions

It can be seen that the gap between the write performance of ceph (106.43MB/s) and the write performance of physical disk (1262.63MB/s) is huge, and even the RND4K Q1 T1 directly becomes a mechanical hard disk
One or more OSDs and one or more machines have little impact on ceph (it may be that my number of clusters is not enough)
The ceph cluster built with three nodes will cause the disk read performance to drop by half and the write performance to drop by a quarter or more

APPENDIX

Due to the length limitation of the article, the appendix is written on another floor

Finally, the question I want to know is:

How to fix the write performance problem in ceph? Can ceph achieve the same performance as VMware vSAN.
The results show that the performance of full flash disk is not as good as that of hdd+ssd. So if I do not use bcache, what should I do to fix the performance problem of ceph full flash disk?
Is there a better solution for the hdd+ssd architecture?

Otter7721 · Feb 2, 2023

APPENDIX

1 - Some ssd benchmark results

Micron_1 100_MTFDDAK1T0TB SCSI Disk Device

Code:

G:\fio>python "E:\Programing\PycharmProjects\fio-cdm\fio-cdm"
tests: 5, size: 1.0GiB, target: G:\fio 228.2GiB/953.8GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |      363.45|      453.54|
|SEQ1M Q1 T1 |      329.47|      404.09|
|RND4K Q32T16|      196.16|      212.42|
|. IOPS      |    47890.44|    51861.48|
|. latency us|    10677.71|     9862.74|
|RND4K Q1 T1 |       20.66|       65.44|
|. IOPS      |     5044.79|    15976.40|
|. latency us|      197.04|       61.07|

SAMSUNG MZALQ512HALU-000L1

Code:

root@pve1:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 3.4GiB/476.7GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2358.84|     1476.54|
|SEQ1M Q1 T1 |     1702.19|     1291.18|
|RND4K Q32T16|      955.34|     1070.17|
|. IOPS      |   233238.46|   261273.09|
|. latency us|     2193.90|     1957.79|
|RND4K Q1 T1 |       55.04|      229.99|
|. IOPS      |    13437.11|    56149.97|
|. latency us|       73.17|       16.65|

2 - bcache

Test results of hdd+ssd mixed disk ceph architecture accelerated by bcache We can see that READ has improved significantly, but WRITE is still very poor

Code:

tests: 5, size: 1.0GiB, target: /mnt/test 104.3MiB/10.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1652.93|      242.41|
|SEQ1M Q1 T1 |      552.91|       81.16|
|RND4K Q32T16|      429.52|       31.95|
|. IOPS      |   104862.76|     7799.72|
|. latency us|     4879.87|    65618.50|
|RND4K Q1 T1 |       13.10|        0.45|
|. IOPS      |     3198.16|      110.09|
|. latency us|      310.07|     9077.11|

Even multiple osds on one disk cannot solve the WRITE problem

Detailed test data: https://www.reddit.com/r/ceph/comments/xnse2j/comment/j6qs57g/?context=3

If I use VMware vSAN, I can easily accelerate the speed of hdd to ssd, and I can hardly perceive the existence of hdd (I haven't compared it in detail, I just by feeling)

3 - Test report analysis of other disciplines

I analyzed and compared several reports, and the summary is as follows

Proxmox-VE_Ceph-Benchmark-201802.pdf

Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf

Dell_R730xd_RedHat_Ceph_Performance_SizingGuide_WhitePaper.pdf

micron_9300_and_red_hat_ceph_reference_architecture.pdf

1) - pve 201802

According to the report, the test scale is 6 x Server，Each server 4 x Samsung SM863 Series, 2.5", 240 GB SSD, SATA-3 (6 Gb/s) MLC.

Code:

# Samsung SM863 Series, 2.5", 240 GB SSD
# from https://www.samsung.com/us/business/support/owners/product/sm863-series-240gb/
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ?M Q? T? |      520.00|      485.00|
|RND4K Q? T? |           ?|           ?|
|. IOPS      |    97000.00|    20000.00|

Report result display

Code:

# 3 Node Cluster/ 4 x Samsung SM863 as OSD per Node
# rados bench 60 write -b 4M -t 16
# rados bench 60 read -t 16 (uses 4M from write)
|Name        |  Read(MB/s)| Write(MB/s)|
# 10 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16|     1064.42|      789.12|
# 100 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16|     3087.82|     1011.63|

We can see that the impact of network bandwidth on performance is huge. Although the performance under the 10 Gbit network is insufficient, at least the read and write performance approaches the bandwidth limit However, looking at my test results, WRITE is very bad (296.25MB/s)

2) - pve 202009

According to the report, the test scale is 3 x Server; Each server 4 x Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR); 1 x 100 GbE DACs, in a full-mesh topology

Code:

# Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR)
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ128KQ32T?|     3500.00|     3100.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?|     3340.00|      840.00| (Estimate according to formula, throughput ~= iops * 4k / 1000)
|. IOPS      |   835000.00|   210000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q1 T1 |            |      205.82| (from the report)
|. IOPS      |            |    51000.00| (from the report)
|. latency ms|            |        0.02| (from the report)

Report result display

Code:

# MULTI-VM WORKLOAD (LINUX)
# I don't understand the difference between Thread and Job, and the queue depth is not identified in the document
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q? T1 |     7176.00|     2581.00| (SEQUENTIAL BANDWIDTH BY NUMBER OF JOBS)
|RND4K Q1 T1 |       86.00|       28.99| (Estimate according to formula)
|. IOPS      |    21502.00|     7248.00| (RANDOM IO/S BY NUMBER OF JOBS)

Similarly, the RND4K Q1 T1 WRITE test result is very bad, only 7k iops, and the physical disk has 51k iops, which I feel is unacceptable.

3) - Dell R730xd report

According to the report, the test scale is 5 x Storage Server; Each Server 12HDD+3SSD, 3 x replication 2 x 10GbE NIC

Code:

# Test results extracted from the report
# Figure 8  Throughput/server comparison by using different configurations
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q64T1 |     1150.00|      300.00|

In this case, the WRITE in the SEQ4M Q64T1 test result is only about 300MB/s, which is about twice that of a single SAS, that is, 2 x 158.16 MB/s (4M blocks). This makes me unbelievable. It's even faster than my nvme disk. However, another important fact is that 12 * 5=60 HDDs have only 300MB/s sequential write speed. Is this performance loss too large?

4) - Micron report

According to the report, the test scale is 3 x Storage Server；Each Server 10 x micron 9300MAX 12.8T，2 x replication ，2 x 100GbE NIC

Code:

# micron 9300MAX 12.8T (MTFDHAL12T8TDR-1AT1ZABYY) Physical disk benchmark
|Name        |  Read(MB/s)| Write(MB/s)| (? is the parameter not given)
|------------|------------|------------|
|SEQ?M Q? T? |    48360.00|           ?| (from the report)
|SEQ128KQ32T?|     3500.00|     3500.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?|     3400.00|     1240.00| (Estimate according to formula)
|. IOPS      |   850000.00|   310000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|. latency us|       86.00|       11.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q? T? |     8397.77|     1908.11| (Estimate according to formula)
|. IOPS      |  2099444.00|   477029.00| (from the report, Executive Summary)
|. latency ms|        1.50|        6.70| (from the report, Executive Summary)

Report result display

Code:

# (Test results extracted from the report)
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|RND4KQ32T100|           ?|           ?|
|. IOPS      |  2099444.00|   477029.00|
|. latency ms|        1.52|        6.71|
# (I don't know if there is a problem reported on the official website. There is no performance loss here)

It has to be said that the official test platform of Micron is too high-end for our small and medium-sized enterprises to afford. From the results, WRITE is close to the performance of a single physical disk. Does it mean that if only a single node and a single disk are used, the WRITE performance will drop to 477k/30=15.9k iops If so, this will be the performance of sata ssd.

4 - More of the same questions

Finally, the question I want to know is:

How to fix the write performance problem in ceph? Can ceph achieve the same performance as VMware vSAN.
The results show that the performance of full flash disk is not as good as that of hdd+ssd. So if I do not use bcache, what should I do to fix the performance problem of ceph full flash disk?
Is there a better solution for the hdd+ssd architecture?

Otter7721 · Feb 11, 2023

Does anyone have the same problem as me?

spirit · Feb 11, 2023

Well, yes, ceph is slower for write than read.

The main bottleneck is the cpu usage, so you need fastest frequencies as much on possible . (on both osd nodes, but old client nodes where vm are running). Proxmox have a project name "crimson", a rewrite of osd from stratch, to reduce cpu but it'll not be ready before 2 years I think.

personnaly, for 1 vm, with 4k randwrite && queue depth=1 , I'm able to reach 5-7k iops for write. (then of course it's scale with more queues, or more vm is parallel).

Some tuning:

you can enable writeback on vm. If you have small writes in the same object, they will be push together, so the ceph crush algo is played once, and it's really faster. (I'll works very great with small sequencial writes).

you can try to multiple multiple osd by disk. (for nvme, it's recommended 2-4 osd by nvme). I don't think it's possible with proxmox gui, but command line with "ceph-volume" it's possible.

How many pg do you have ? if your cluster is empty for your bench, dont use pg autoscaling, because it'll reduce the number of pg to minimum.

disable some debug in ceph.conf , on both client && osd nodes. (reduce cpu, so improve iops)

Code:

[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0

spirit · Feb 11, 2023

also, try to disable c-state on cpu to always force max frequencies

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1"

this should work for both intel && amd cpu.

Otter7721 · Feb 11, 2023

spirit said:
also, try to disable c-state on cpu to always force max frequencies

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1"

this should work for both intel && amd cpu.

Thank you for your detailed answer. I'll try it right away

Otter7721 · Feb 11, 2023

spirit said:
also, try to disable c-state on cpu to always force max frequencies

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1"

this should work for both intel && amd cpu.

I destroyed the test bench and reinstalled proxmox7.3 and quincy ceph. In this experiment, an all-in-one cluster was built, and only one nvme was used. Four OSDs were created, and debug and automatic PG scaling were prohibited. In RND4K Q1 T1, 283iops was obtained.

I don't know what to do to improve this terrible 4k WRITE. I have been tossing and turning for more than a month. VMware vSAN has been used before.

test result

Code:

root@pve:~# rbd create test
root@pve:~# rbd map test
/dev/rbd0
root@pve:~# mkfs.xfs /dev/rbd0
root@pve:~# mkdir -p /mnt/test
root@pve:~# mount /dev/rbd0 /mnt/test
root@pve:~# cd /mnt/test
root@pve:/mnt/test# fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1504.26|      432.16|
|SEQ1M Q1 T1 |      502.07|      165.83|
|RND4K Q32T16|      467.55|       82.70|
|. IOPS      |   114149.05|    20189.40|
|. latency us|     4482.50|    25340.41|
|RND4K Q1 T1 |       11.86|        1.16|
|. IOPS      |     2894.42|      283.89|
|. latency us|      343.16|     3517.51|

fio conf

Code:

[global]
ioengine=libaio
filename=.fio_testmark
# directory=/root
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting

[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

ceph.conf

Code:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.1.3/24
fsid = c83e1e24-1a29-40b5-b4e1-fa227aeb6458
mon_allow_pool_delete = true
mon_host =  192.168.1.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 2
public_network = 192.168.1.3/24

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0


[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve]
public_addr = 192.168.1.3

osd tree

Code:

root@pve:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME      STATUS  REWEIGHT  PRI-AFF
-1         0.46558  root default                          
-3         0.46558      host pve                          
 0    ssd  0.11638          osd.0      up   1.00000  1.00000
 1    ssd  0.11638          osd.1      up   1.00000  1.00000
 2    ssd  0.11638          osd.2      up   1.00000  1.00000
 3    ssd  0.11638          osd.3      up   1.00000  1.00000
root@pve:~#

ceph cluster status

Code:

root@pve:~# ceph -s
  cluster:
    id:     c83e1e24-1a29-40b5-b4e1-fa227aeb6458
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum pve (age 24m)
    mgr: pve(active, since 24m)
    osd: 4 osds: 4 up (since 24m), 4 in (since 32m)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 280 objects, 1.0 GiB
    usage:   2.6 GiB used, 474 GiB / 477 GiB avail
    pgs:     33 active+clean

screenshot

aaron · Feb 13, 2023

As @spirit already suggested, one reason why the performance isn't great, can be the number of PGs being too low.

A single pool (we can ignore the .mgr pool), with 4 OSDs should have 128 PGs:

https://old.ceph.com/pgcalc/

Ideally, you configure a target ratio on the rbd pool. Doesn't matter how much it is, as it is a weight for all pools, and if you only set it for the one pool, then it will always be 100%.
This way, the autoscaler should decide that it should have 128 PGs.
The other option would be to set it manually, but also to have the autoscaler set to something other than "on".
Without a target size or ratio, the autoscaler will only take the current usage of the pool into account. This can lead to way to low PG numbers on empty pools.

The other thing that might also cause performance problems, is that you seem to test it on a single server with the CRUSH rule changed so that the failure domain is the OSD instead of host?
Ceph benefits a lot by spreading the load over many OSDs and hosts. Doing it all on a single host could mean, that the host might cost you performance. I don't have any experience how much since the only few times I did such a setup, was to test some functionality, but never the performance.

Otter7721 · Feb 13, 2023

aaron said:
As @spirit already suggested, one reason why the performance isn't great, can be the number of PGs being too low.

A single pool (we can ignore the .mgr pool), with 4 OSDs should have 128 PGs:
View attachment 46734
https://old.ceph.com/pgcalc/

Ideally, you configure a target ratio on the rbd pool. Doesn't matter how much it is, as it is a weight for all pools, and if you only set it for the one pool, then it will always be 100%.
This way, the autoscaler should decide that it should have 128 PGs.
The other option would be to set it manually, but also to have the autoscaler set to something other than "on".
Without a target size or ratio, the autoscaler will only take the current usage of the pool into account. This can lead to way to low PG numbers on empty pools.

The other thing that might also cause performance problems, is that you seem to test it on a single server with the CRUSH rule changed so that the failure domain is the OSD instead of host?
Ceph benefits a lot by spreading the load over many OSDs and hosts. Doing it all on a single host could mean, that the host might cost you performance. I don't have any experience how much since the only few times I did such a setup, was to test some functionality, but never the performance.

I have adjusted the number of PGs many times, such as 128256, 64, 32. I have also tested that the sequential write speed of my single nvme or three nvme cannot exceed 500MB/s.
In the last test, I used an nvme and set -- osds-per-device=4 in ceph-volume lvm create. So based on this, I tested 128, 64 and 32, and the results were not much different.
PG32 comes from the following picture:

I think the only way to break through this problem is to add more machines. Perhaps the smallest cluster cannot even play half the performance of ceph.
Thank you for your detailed answer.

DC-CA1 · Apr 2, 2023

@Otter7721

did you disabled CEPHX in any of your test ?

Otter7721 · Apr 18, 2023

DC-CA1 said:
@Otter7721

did you disabled CEPHX in any of your test ?

I'm very sorry, it took me so long to see your message.
I did not disable cephx because all the steps I took to build a ceph cluster were reflected in the article, and no other actions were taken.

skliarie · May 20, 2024

We had poor CEPH performance because consumer grade Samsung QVO 860 SSD drives don't have PLP, that happens to be crucial for CEPH:
https://www.reddit.com/r/ceph/comments/14ccg87/comment/jolgkuh/

We moved the drives behind PERC H730 RAID card (as bunch of RAID0 virtual volumes, one per drive), for the card's write-cache to mitigate lack of PLP on the drives.

Ceph reading and writing performance problems, fast reading and slow writing

New Member

​

Test server parameters (this is not important)​

Test virtual machine parameters x 3 (three virtual machines are the same parameters)​

Benchmarking tools​

Environment construction steps​

Environmental test​

Network Bandwidth​

Jumbo Frames​

Benchmark category​

Benchmark results (Ceph and the system have not been tuned, and bcache acceleration has not been used)​

1. Physical Disk Benchmark (Test sequence is 4)​

2. Single osd, single server benchmark (Test sequence is 3)​

3. Multiple OSDs, single server benchmarks (Test sequence is 2)​

4. Multiple OSDs, multiple server benchmarks (Test sequence is 1)​

​

Conclusions​

APPENDIX​

Finally, the question I want to know is:​

New Member

APPENDIX​

1 - Some ssd benchmark results​

Micron_1 100_MTFDDAK1T0TB SCSI Disk Device​

SAMSUNG MZALQ512HALU-000L1​

2 - bcache​

3 - Test report analysis of other disciplines​

1) - pve 201802​

2) - pve 202009​

3) - Dell R730xd report​

4) - Micron report​

4 - More of the same questions​

Finally, the question I want to know is:​

Attachments

New Member

Distinguished Member

Distinguished Member

New Member

New Member

Proxmox Staff Member

New Member

Member

New Member

New Member

Test server parameters (this is not important)

Test virtual machine parameters x 3 (three virtual machines are the same parameters)

Benchmarking tools

Environment construction steps

Environmental test

Network Bandwidth

Jumbo Frames

Benchmark category

Benchmark results (Ceph and the system have not been tuned, and bcache acceleration has not been used)

1. Physical Disk Benchmark (Test sequence is 4)

2. Single osd, single server benchmark (Test sequence is 3)

3. Multiple OSDs, single server benchmarks (Test sequence is 2)

4. Multiple OSDs, multiple server benchmarks (Test sequence is 1)

Conclusions

APPENDIX

Finally, the question I want to know is:

APPENDIX

1 - Some ssd benchmark results

Micron_1 100_MTFDDAK1T0TB SCSI Disk Device

SAMSUNG MZALQ512HALU-000L1

2 - bcache

3 - Test report analysis of other disciplines

1) - pve 201802

2) - pve 202009

3) - Dell R730xd report

4) - Micron report

4 - More of the same questions

Finally, the question I want to know is: