Ceph reading and writing performance problems, fast reading and slow writing

Otter7721

New Member
Jan 29, 2023
24
4
3

Hello, we need to migrate all cloud environments to Proxmox. At present, I am evaluating and testing Proxmox+Ceph+OpenStack.

But now we are facing the following difficulties:

  1. When VMware vSAN was migrated to ceph, I found that hdd+ssd performed very poorly in ceph, and the write performance was very poor. Performance is far less than vSAN
  2. The sequential writing performance of ceph in the full flash memory structure is not as good as that of a single hard disk, or even a single mechanical hard disk
  3. In using the hdd+ssd structure in bcache, the sequential write performance of ceph is far lower than that of a single hard disk
Please forgive my poor English.



Test server parameters (this is not important)​

CPU:Dual Intel® Xeon® E5-2698Bv3

Memory: 8 x 16G DDR3

Dual 1 Gbit NIC:Realtek Semiconductor Co., Ltd. RTL8111/8168/8411

Disk:

1 x 500G NVME SAMSUNG MZALQ512HALU-000L1 (It is also the ssd-data Thinpool in PVE)

1 x 500G SATA WDC_WD5000AZLX-60K2TA0 (Physical machine system disk)

2 x 500G SATA WDC_WD5000AZLX-60K2TA0

1 x 1T SATA ST1000LM035-1RK172

PVE:pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.74-1-pve)

Network Configure:

enp4s0 (OVS Port) -> vmbr0 (OVS Bridge) -> br0mgmt (192.168.1.3/24,192.168.1.1)

enp5s0 (OVS Port,MTU=9000) -> vmbr1 (OVS Bridge,MTU=9000)

vmbr2 (OVS Bridge,MTU=9000)



Test virtual machine parameters x 3 (three virtual machines are the same parameters)​

CPU:32 (1 sockets, 32 cores) [host]

Memory:32G

Disk:

1 x local-lvm:vm-101-disk-0,iothread=1,size=32G

2 x ssd-data:vm-101-disk-0,iothread=1,size=120G

Network Device:

net0: bridge=vmbr0,firewall=1

net1: bridge=vmbr2,firewall=1,mtu=1 (Ceph Cluster/Public Network)

net2: bridge=vmbr0,firewall=1

net3: bridge=vmbr0,firewall=1

Network Configure:

ens18 (net0,OVS Port) -> vmbr0 (OVS Bridge) -> br0mgmt (10.10.1.11/24,10.10.1.1)

ens19 (net1,OVS Port,MTU=9000) -> vmbr1 (OVS Bridge,MTU=9000) -> br1ceph (192.168.10.1/24,MTU=9000)

ens20 (net2,Network Device,Active=No)

ens21 (net3,Network Device,Active=No)



Benchmarking tools​

  1. fio
  2. fio-cdm (https://github.com/xlucn/fio-cdm)
For fio-cdm, if no parameters are filled in, the configuration file corresponding to fio is as follows

Use 'python fio-cdm - f -' to get

Code:
[global]
ioengine=libaio
filename=.fio_testmark
directory=/root
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting

[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall




Environment construction steps​

Code:
# prepare tools
root@pve01:~# apt update -y && apt upgrade -y
root@pve01:~# apt install fio git -y
root@pve01:~# git clone https://github.com/xlucn/fio-cdm.git

# create test block
root@pve01:~# rbd create test -s 20G
root@pve01:~# rbd map test
root@pve01:~# mkfs.xfs /dev/rbd0
root@pve01:~# mkdir /mnt/test
root@pve01:/mnt# mount /dev/rbd0 /mnt/test

# start test
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm


Environmental test​

  1. Network Bandwidth​

Code:
root@pve01:~# apt install iperf3 -y
root@pve01:~# iperf3 -s
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 10.10.1.12, port 52968
[  5] local 10.10.1.11 port 5201 connected to 10.10.1.12 port 52972
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.87 GBytes  16.0 Gbits/sec                 
[  5]   1.00-2.00   sec  1.92 GBytes  16.5 Gbits/sec                 
[  5]   2.00-3.00   sec  1.90 GBytes  16.4 Gbits/sec                 
[  5]   3.00-4.00   sec  1.90 GBytes  16.3 Gbits/sec                 
[  5]   4.00-5.00   sec  1.85 GBytes  15.9 Gbits/sec                 
[  5]   5.00-6.00   sec  1.85 GBytes  15.9 Gbits/sec                 
[  5]   6.00-7.00   sec  1.70 GBytes  14.6 Gbits/sec                 
[  5]   7.00-8.00   sec  1.75 GBytes  15.0 Gbits/sec                 
[  5]   8.00-9.00   sec  1.89 GBytes  16.2 Gbits/sec                 
[  5]   9.00-10.00  sec  1.87 GBytes  16.0 Gbits/sec                 
[  5]  10.00-10.04  sec  79.9 MBytes  15.9 Gbits/sec                 
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  18.6 GBytes  15.9 Gbits/sec                  receiver

  1. Jumbo Frames​

Code:
root@pve01:~# ping -M do -s 8000 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 8000(8028) bytes of data.
8008 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=1.51 ms
8008 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.500 ms
^C
--- 192.168.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.500/1.007/1.514/0.507 ms
root@pve01:~#


Benchmark category​

  1. Physical Disk Benchmark
  2. Single osd, single server benchmark
  3. Multiple OSDs, single server benchmarks
  4. Multiple OSDs, multiple server benchmarks


Benchmark results (Ceph and the system have not been tuned, and bcache acceleration has not been used)​

1. Physical Disk Benchmark (Test sequence is 4)​

step.

Code:
root@pve1:~# lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0 465.8G  0 disk
├─sda1                         8:1    0  1007K  0 part
├─sda2                         8:2    0   512M  0 part /boot/efi
└─sda3                         8:3    0 465.3G  0 part
  ├─pve-root                 253:0    0    96G  0 lvm  /
  ├─pve-data_tmeta           253:1    0   3.5G  0 lvm 
  │ └─pve-data-tpool         253:3    0 346.2G  0 lvm 
  │   ├─pve-data             253:4    0 346.2G  1 lvm 
  │   └─pve-vm--100--disk--0 253:5    0    16G  0 lvm 
  └─pve-data_tdata           253:2    0 346.2G  0 lvm 
    └─pve-data-tpool         253:3    0 346.2G  0 lvm 
      ├─pve-data             253:4    0 346.2G  1 lvm 
      └─pve-vm--100--disk--0 253:5    0    16G  0 lvm 
sdb                            8:16   0 931.5G  0 disk
sdc                            8:32   0 465.8G  0 disk
sdd                            8:48   0 465.8G  0 disk
nvme0n1                      259:0    0 476.9G  0 disk
root@pve1:~# mkfs.xfs /dev/nvme0n1 -f
root@pve1:~# mkdir /mnt/nvme
root@pve1:~# mount /dev/nvme0n1 /mnt/nvme
root@pve1:~# cd /mnt/nvme/


result.

Code:
root@pve1:/mnt/nvme# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/nvme 3.4GiB/476.7GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2361.95|     1435.48|
|SEQ1M Q1 T1 |     1629.84|     1262.63|
|RND4K Q32T16|      954.86|     1078.88|
|. IOPS      |   233119.53|   263398.08|
|. latency us|     2194.84|     1941.78|
|RND4K Q1 T1 |       55.56|      225.06|
|. IOPS      |    13565.49|    54946.21|
|. latency us|       72.76|       16.97|


2. Single osd, single server benchmark (Test sequence is 3)​

Modify ceph.conf set osd_pool_default_min_size and osd_pool_default_size with 1, then systemctl restart ceph.target and fix all errors

step.
Code:
root@pve01:/mnt/test# ceph osd pool get rbd size
size: 2
root@pve01:/mnt/test# ceph config set global  mon_allow_pool_size_one true
root@pve01:/mnt/test# ceph osd pool set rbd min_size 1
set pool 2 min_size to 1
root@pve01:/mnt/test# ceph osd pool set rbd size 1 --yes-i-really-mean-it
set pool 2 size to 1


result
Code:
root@pve01:/mnt/test# ceph -s
  cluster:
    id:     1f3eacc8-2488-4e1a-94bf-7181ee7db522
    health: HEALTH_WARN
            2 pool(s) have no replicas configured
 
  services:
    mon: 3 daemons, quorum pve01,pve02,pve03 (age 17m)
    mgr: pve01(active, since 17m), standbys: pve02, pve03
    osd: 6 osds: 1 up (since 19s), 1 in (since 96s)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 281 objects, 1.0 GiB
    usage:   1.1 GiB used, 119 GiB / 120 GiB avail
    pgs:     33 active+clean
 
root@pve01:/mnt/test# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1     down         0  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2     down         0  1.00000
 3    ssd  0.11719          osd.3     down         0  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4     down         0  1.00000
 5    ssd  0.11719          osd.5     down         0  1.00000
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1153.07|      515.29|
|SEQ1M Q1 T1 |      447.35|      142.98|
|RND4K Q32T16|       99.07|       32.19|
|. IOPS      |    24186.26|     7859.91|
|. latency us|    21148.94|    65076.23|
|RND4K Q1 T1 |        7.47|        1.48|
|. IOPS      |     1823.24|      360.98|
|. latency us|      545.98|     2765.23|
root@pve01:/mnt/test#




3. Multiple OSDs, single server benchmarks (Test sequence is 2)​

Change crushmap set step chooseleaf firstn 0 type host to step chooseleaf firstn 0 type osd

OSD tree
Code:
root@pve01:/etc/ceph# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1       up   1.00000  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2     down         0  1.00000
 3    ssd  0.11719          osd.3     down         0  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4     down         0  1.00000
 5    ssd  0.11719          osd.5     down         0  1.00000

result
Code:
root@pve01:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1376.59|      397.29|
|SEQ1M Q1 T1 |      442.74|      111.41|
|RND4K Q32T16|      114.97|       29.08|
|. IOPS      |    28068.12|     7099.90|
|. latency us|    18219.04|    72038.06|
|RND4K Q1 T1 |        6.82|        1.04|
|. IOPS      |     1665.27|      254.40|
|. latency us|      598.00|     3926.30|



4. Multiple OSDs, multiple server benchmarks (Test sequence is 1)​

OSD tree

Code:
root@pve01:/etc/ceph# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         0.70312  root default                             
-3         0.23438      host pve01                           
 0    ssd  0.11719          osd.0       up   1.00000  1.00000
 1    ssd  0.11719          osd.1       up   1.00000  1.00000
-5         0.23438      host pve02                           
 2    ssd  0.11719          osd.2       up   1.00000  1.00000
 3    ssd  0.11719          osd.3       up   1.00000  1.00000
-7         0.23438      host pve03                           
 4    ssd  0.11719          osd.4       up   1.00000  1.00000
 5    ssd  0.11719          osd.5       up   1.00000  1.00000

result
Code:
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1527.37|      296.25|
|SEQ1M Q1 T1 |      408.86|      106.43|
|RND4K Q32T16|      189.20|       43.00|
|. IOPS      |    46191.94|    10499.01|
|. latency us|    11068.93|    48709.85|
|RND4K Q1 T1 |        4.99|        0.95|
|. IOPS      |     1219.16|      232.37|
|. latency us|      817.51|     4299.14|

Conclusions​

  1. It can be seen that the gap between the write performance of ceph (106.43MB/s) and the write performance of physical disk (1262.63MB/s) is huge, and even the RND4K Q1 T1 directly becomes a mechanical hard disk
  2. One or more OSDs and one or more machines have little impact on ceph (it may be that my number of clusters is not enough)
  3. The ceph cluster built with three nodes will cause the disk read performance to drop by half and the write performance to drop by a quarter or more

APPENDIX​

Due to the length limitation of the article, the appendix is written on another floor

Finally, the question I want to know is:​

  1. How to fix the write performance problem in ceph? Can ceph achieve the same performance as VMware vSAN.
  2. The results show that the performance of full flash disk is not as good as that of hdd+ssd. So if I do not use bcache, what should I do to fix the performance problem of ceph full flash disk?
  3. Is there a better solution for the hdd+ssd architecture?
 

APPENDIX​

1 - Some ssd benchmark results​

Micron_1 100_MTFDDAK1T0TB SCSI Disk Device​

Code:
G:\fio>python "E:\Programing\PycharmProjects\fio-cdm\fio-cdm"
tests: 5, size: 1.0GiB, target: G:\fio 228.2GiB/953.8GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |      363.45|      453.54|
|SEQ1M Q1 T1 |      329.47|      404.09|
|RND4K Q32T16|      196.16|      212.42|
|. IOPS      |    47890.44|    51861.48|
|. latency us|    10677.71|     9862.74|
|RND4K Q1 T1 |       20.66|       65.44|
|. IOPS      |     5044.79|    15976.40|
|. latency us|      197.04|       61.07|


SAMSUNG MZALQ512HALU-000L1​

Code:
root@pve1:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 3.4GiB/476.7GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2358.84|     1476.54|
|SEQ1M Q1 T1 |     1702.19|     1291.18|
|RND4K Q32T16|      955.34|     1070.17|
|. IOPS      |   233238.46|   261273.09|
|. latency us|     2193.90|     1957.79|
|RND4K Q1 T1 |       55.04|      229.99|
|. IOPS      |    13437.11|    56149.97|
|. latency us|       73.17|       16.65|


2 - bcache​

Test results of hdd+ssd mixed disk ceph architecture accelerated by bcache We can see that READ has improved significantly, but WRITE is still very poor

Code:
tests: 5, size: 1.0GiB, target: /mnt/test 104.3MiB/10.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1652.93|      242.41|
|SEQ1M Q1 T1 |      552.91|       81.16|
|RND4K Q32T16|      429.52|       31.95|
|. IOPS      |   104862.76|     7799.72|
|. latency us|     4879.87|    65618.50|
|RND4K Q1 T1 |       13.10|        0.45|
|. IOPS      |     3198.16|      110.09|
|. latency us|      310.07|     9077.11|
Even multiple osds on one disk cannot solve the WRITE problem

Detailed test data: https://www.reddit.com/r/ceph/comments/xnse2j/comment/j6qs57g/?context=3

If I use VMware vSAN, I can easily accelerate the speed of hdd to ssd, and I can hardly perceive the existence of hdd (I haven't compared it in detail, I just by feeling)



3 - Test report analysis of other disciplines​

I analyzed and compared several reports, and the summary is as follows

Proxmox-VE_Ceph-Benchmark-201802.pdf

Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf

Dell_R730xd_RedHat_Ceph_Performance_SizingGuide_WhitePaper.pdf

micron_9300_and_red_hat_ceph_reference_architecture.pdf



1) - pve 201802​

According to the report, the test scale is 6 x Server,Each server 4 x Samsung SM863 Series, 2.5", 240 GB SSD, SATA-3 (6 Gb/s) MLC.

Code:
# Samsung SM863 Series, 2.5", 240 GB SSD
# from https://www.samsung.com/us/business/support/owners/product/sm863-series-240gb/
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ?M Q? T? |      520.00|      485.00|
|RND4K Q? T? |           ?|           ?|
|. IOPS      |    97000.00|    20000.00|
Report result display

Code:
# 3 Node Cluster/ 4 x Samsung SM863 as OSD per Node
# rados bench 60 write -b 4M -t 16
# rados bench 60 read -t 16 (uses 4M from write)
|Name        |  Read(MB/s)| Write(MB/s)|
# 10 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16|     1064.42|      789.12|
# 100 Gbit Network
|------------|------------|------------|
|SEQ4M Q? T16|     3087.82|     1011.63|
We can see that the impact of network bandwidth on performance is huge. Although the performance under the 10 Gbit network is insufficient, at least the read and write performance approaches the bandwidth limit However, looking at my test results, WRITE is very bad (296.25MB/s)



2) - pve 202009​

According to the report, the test scale is 3 x Server; Each server 4 x Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR); 1 x 100 GbE DACs, in a full-mesh topology

Code:
# Micron 9300 Max 3.2 TB (MTFDHAL3T2TDR)
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ128KQ32T?|     3500.00|     3100.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?|     3340.00|      840.00| (Estimate according to formula, throughput ~= iops * 4k / 1000)
|. IOPS      |   835000.00|   210000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q1 T1 |            |      205.82| (from the report)
|. IOPS      |            |    51000.00| (from the report)
|. latency ms|            |        0.02| (from the report)
Report result display

Code:
# MULTI-VM WORKLOAD (LINUX)
# I don't understand the difference between Thread and Job, and the queue depth is not identified in the document
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q? T1 |     7176.00|     2581.00| (SEQUENTIAL BANDWIDTH BY NUMBER OF JOBS)
|RND4K Q1 T1 |       86.00|       28.99| (Estimate according to formula)
|. IOPS      |    21502.00|     7248.00| (RANDOM IO/S BY NUMBER OF JOBS)
Similarly, the RND4K Q1 T1 WRITE test result is very bad, only 7k iops, and the physical disk has 51k iops, which I feel is unacceptable.



3) - Dell R730xd report​

According to the report, the test scale is 5 x Storage Server; Each Server 12HDD+3SSD, 3 x replication 2 x 10GbE NIC

Code:
# Test results extracted from the report
# Figure 8  Throughput/server comparison by using different configurations
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ4M Q64T1 |     1150.00|      300.00|
In this case, the WRITE in the SEQ4M Q64T1 test result is only about 300MB/s, which is about twice that of a single SAS, that is, 2 x 158.16 MB/s (4M blocks). This makes me unbelievable. It's even faster than my nvme disk. However, another important fact is that 12 * 5=60 HDDs have only 300MB/s sequential write speed. Is this performance loss too large?



4) - Micron report​

According to the report, the test scale is 3 x Storage Server;Each Server 10 x micron 9300MAX 12.8T,2 x replication ,2 x 100GbE NIC

Code:
# micron 9300MAX 12.8T (MTFDHAL12T8TDR-1AT1ZABYY) Physical disk benchmark
|Name        |  Read(MB/s)| Write(MB/s)| (? is the parameter not given)
|------------|------------|------------|
|SEQ?M Q? T? |    48360.00|           ?| (from the report)
|SEQ128KQ32T?|     3500.00|     3500.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|RND4K Q512T?|     3400.00|     1240.00| (Estimate according to formula)
|. IOPS      |   850000.00|   310000.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|. latency us|       86.00|       11.00| (MTFDHAL12T8TDR-1AT1ZABYY-Micron-LBGA-2022.pdf)
|------------|------------|------------|
|RND4K Q? T? |     8397.77|     1908.11| (Estimate according to formula)
|. IOPS      |  2099444.00|   477029.00| (from the report, Executive Summary)
|. latency ms|        1.50|        6.70| (from the report, Executive Summary)
Report result display

Code:
# (Test results extracted from the report)
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|RND4KQ32T100|           ?|           ?|
|. IOPS      |  2099444.00|   477029.00|
|. latency ms|        1.52|        6.71|
# (I don't know if there is a problem reported on the official website. There is no performance loss here)

It has to be said that the official test platform of Micron is too high-end for our small and medium-sized enterprises to afford. From the results, WRITE is close to the performance of a single physical disk. Does it mean that if only a single node and a single disk are used, the WRITE performance will drop to 477k/30=15.9k iops If so, this will be the performance of sata ssd.



4 - More of the same questions​

  1. https://forum.proxmox.com/threads/bad-rand-read-write-i-o-proxmox-ceph.68404/#post-530006
  2. https://forum.proxmox.com/threads/c...mple-hardware-slow-writing.96697/#post-529524
  3. https://forum.proxmox.com/threads/bad-rand-read-write-i-o-proxmox-ceph.68404/#post-529520
  4. https://forum.proxmox.com/threads/bad-rand-read-write-i-o-proxmox-ceph.68404/#post-529486
  5. https://www.reddit.com/r/ceph/comme.../?utm_source=share&utm_medium=web2x&context=3
  6. https://www.reddit.com/r/ceph/comme.../?utm_source=share&utm_medium=web2x&context=3


Finally, the question I want to know is:​

  1. How to fix the write performance problem in ceph? Can ceph achieve the same performance as VMware vSAN.
  2. The results show that the performance of full flash disk is not as good as that of hdd+ssd. So if I do not use bcache, what should I do to fix the performance problem of ceph full flash disk?
  3. Is there a better solution for the hdd+ssd architecture?
 

Attachments

Well, yes, ceph is slower for write than read.

The main bottleneck is the cpu usage, so you need fastest frequencies as much on possible . (on both osd nodes, but old client nodes where vm are running). Proxmox have a project name "crimson", a rewrite of osd from stratch, to reduce cpu but it'll not be ready before 2 years I think.

personnaly, for 1 vm, with 4k randwrite && queue depth=1 , I'm able to reach 5-7k iops for write. (then of course it's scale with more queues, or more vm is parallel).

Some tuning:

you can enable writeback on vm. If you have small writes in the same object, they will be push together, so the ceph crush algo is played once, and it's really faster. (I'll works very great with small sequencial writes).

you can try to multiple multiple osd by disk. (for nvme, it's recommended 2-4 osd by nvme). I don't think it's possible with proxmox gui, but command line with "ceph-volume" it's possible.

How many pg do you have ? if your cluster is empty for your bench, dont use pg autoscaling, because it'll reduce the number of pg to minimum.

disable some debug in ceph.conf , on both client && osd nodes. (reduce cpu, so improve iops)

Code:
[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 
  • Like
Reactions: Otter7721
also, try to disable c-state on cpu to always force max frequencies

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1"

this should work for both intel && amd cpu.
 
also, try to disable c-state on cpu to always force max frequencies

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1"

this should work for both intel && amd cpu.
Thank you for your detailed answer. I'll try it right away
 
also, try to disable c-state on cpu to always force max frequencies

GRUB_CMDLINE_LINUX="idle=poll intel_idle.max_cstate=0 intel_pstate=disable processor.max_cstate=1"

this should work for both intel && amd cpu.

I destroyed the test bench and reinstalled proxmox7.3 and quincy ceph. In this experiment, an all-in-one cluster was built, and only one nvme was used. Four OSDs were created, and debug and automatic PG scaling were prohibited. In RND4K Q1 T1, 283iops was obtained.

I don't know what to do to improve this terrible 4k WRITE. I have been tossing and turning for more than a month. VMware vSAN has been used before.

test result
Code:
root@pve:~# rbd create test
root@pve:~# rbd map test
/dev/rbd0
root@pve:~# mkfs.xfs /dev/rbd0
root@pve:~# mkdir -p /mnt/test
root@pve:~# mount /dev/rbd0 /mnt/test
root@pve:~# cd /mnt/test
root@pve:/mnt/test# fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 175.8MiB/20.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1504.26|      432.16|
|SEQ1M Q1 T1 |      502.07|      165.83|
|RND4K Q32T16|      467.55|       82.70|
|. IOPS      |   114149.05|    20189.40|
|. latency us|     4482.50|    25340.41|
|RND4K Q1 T1 |       11.86|        1.16|
|. IOPS      |     2894.42|      283.89|
|. latency us|      343.16|     3517.51|

fio conf
Code:
[global]
ioengine=libaio
filename=.fio_testmark
# directory=/root
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting

[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

ceph.conf

Code:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.1.3/24
fsid = c83e1e24-1a29-40b5-b4e1-fa227aeb6458
mon_allow_pool_delete = true
mon_host =  192.168.1.3
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 2
public_network = 192.168.1.3/24

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0


[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve]
public_addr = 192.168.1.3

osd tree
Code:
root@pve:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME      STATUS  REWEIGHT  PRI-AFF
-1         0.46558  root default                          
-3         0.46558      host pve                          
 0    ssd  0.11638          osd.0      up   1.00000  1.00000
 1    ssd  0.11638          osd.1      up   1.00000  1.00000
 2    ssd  0.11638          osd.2      up   1.00000  1.00000
 3    ssd  0.11638          osd.3      up   1.00000  1.00000
root@pve:~#

ceph cluster status
Code:
root@pve:~# ceph -s
  cluster:
    id:     c83e1e24-1a29-40b5-b4e1-fa227aeb6458
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum pve (age 24m)
    mgr: pve(active, since 24m)
    osd: 4 osds: 4 up (since 24m), 4 in (since 32m)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 280 objects, 1.0 GiB
    usage:   2.6 GiB used, 474 GiB / 477 GiB avail
    pgs:     33 active+clean

screenshot
1676110842317.png
1676110851913.png
1676110858261.png
 
Last edited:
  • Like
Reactions: ThomasBlock
As @spirit already suggested, one reason why the performance isn't great, can be the number of PGs being too low.

A single pool (we can ignore the .mgr pool), with 4 OSDs should have 128 PGs:
1676277327019.png
https://old.ceph.com/pgcalc/

Ideally, you configure a target ratio on the rbd pool. Doesn't matter how much it is, as it is a weight for all pools, and if you only set it for the one pool, then it will always be 100%.
This way, the autoscaler should decide that it should have 128 PGs.
The other option would be to set it manually, but also to have the autoscaler set to something other than "on".
Without a target size or ratio, the autoscaler will only take the current usage of the pool into account. This can lead to way to low PG numbers on empty pools.

The other thing that might also cause performance problems, is that you seem to test it on a single server with the CRUSH rule changed so that the failure domain is the OSD instead of host?
Ceph benefits a lot by spreading the load over many OSDs and hosts. Doing it all on a single host could mean, that the host might cost you performance. I don't have any experience how much since the only few times I did such a setup, was to test some functionality, but never the performance.
 
As @spirit already suggested, one reason why the performance isn't great, can be the number of PGs being too low.

A single pool (we can ignore the .mgr pool), with 4 OSDs should have 128 PGs:
View attachment 46734
https://old.ceph.com/pgcalc/

Ideally, you configure a target ratio on the rbd pool. Doesn't matter how much it is, as it is a weight for all pools, and if you only set it for the one pool, then it will always be 100%.
This way, the autoscaler should decide that it should have 128 PGs.
The other option would be to set it manually, but also to have the autoscaler set to something other than "on".
Without a target size or ratio, the autoscaler will only take the current usage of the pool into account. This can lead to way to low PG numbers on empty pools.

The other thing that might also cause performance problems, is that you seem to test it on a single server with the CRUSH rule changed so that the failure domain is the OSD instead of host?
Ceph benefits a lot by spreading the load over many OSDs and hosts. Doing it all on a single host could mean, that the host might cost you performance. I don't have any experience how much since the only few times I did such a setup, was to test some functionality, but never the performance.

I have adjusted the number of PGs many times, such as 128256, 64, 32. I have also tested that the sequential write speed of my single nvme or three nvme cannot exceed 500MB/s.
In the last test, I used an nvme and set -- osds-per-device=4 in ceph-volume lvm create. So based on this, I tested 128, 64 and 32, and the results were not much different.
PG32 comes from the following picture:
1676278178091.png

I think the only way to break through this problem is to add more machines. Perhaps the smallest cluster cannot even play half the performance of ceph.
Thank you for your detailed answer.
 
  • Like
Reactions: maj0rmil4d
Hi there, I have the exact same problem, read speed is good but write is super bad.
I use pm1643 3.84T ssd as osd and evo970 plus 1TB as db/wal per node on 5 nodes.
I created 4 osd per each ssd on 5 nodes.
What can I do ?
and the results are as below:

1732709621598.png
I have adjusted the number of PGs many times, such as 128256, 64, 32. I have also tested that the sequential write speed of my single nvme or three nvme cannot exceed 500MB/s.
In the last test, I used an nvme and set -- osds-per-device=4 in ceph-volume lvm create. So based on this, I tested 128, 64 and 32, and the results were not much different.
PG32 comes from the following picture:


View attachment 46736

I think the only way to break through this problem is to add more machines. Perhaps the smallest cluster cannot even play half the performance of ceph.
Thank you for your detailed answer.
 
Hi there, I have the exact same problem, read speed is good but write is super bad.
I use pm1643 3.84T ssd as osd and evo970 plus 1TB as db/wal per node on 5 nodes.

remove you evo970

don't use consumer ssd. And more, don't use consumer ssd for db/wal.

you have already a enterprise drive os osd (with fast fsync write), that don't make any sense.
Simply create your osd with the wal/db on the same pm1643 drive
 
  • Like
Reactions: Azunai333
i second spirits post on the removal of evo970.
just destroy the whole SSD and create new with dl/wal on same "disk".
1 OSD per 1 drive will also suffice.
remove you evo970

don't use consumer ssd. And more, don't use consumer ssd for db/wal.

you have already a enterprise drive os osd (with fast fsync write), that don't make any sense.
Simply create your osd with the wal/db on the same pm1643 drive

I did it exactly before this step, I just used the db/wall on nvme to enhance the speed but no difference, I have exactly the same benchmark status with just pm1643 ssd

I created 4 osd per each drive, this means I have 20 osd at total, the only problem is write speed.

How can I troubleshoot it ?
We are using PM 1643 at least we should get more write speed even without cache

I really have no Idea
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!