Bad rand read/write I/O Proxmox + CEPH

Kenny Huynh

Member
Dec 26, 2018
26
2
8
27
Hello,

I am deploying 3 servers for Proxmox HCI ( CEPH ) and have a result ( as the image)

Linux:
testing with 75% read, 25% write file 4k at the same time, iodepth=64
image_2020_04_09T09_12_56_091Z.png

The seq very hight but rand read/write so bad :(
random read 100% file 4k, iodepth=64 : 30k IOPS
random write 100% file 4k, iodepth=64: 15k IOPS

Window:
picturemessage_d4gcrht1.elu.png

My Hardware:
3 x Dell R730
CPU: 2x E5-2630 v3 ( 8 core, 16 thread )
MEMORY: 96GB DDR4 bus 1866
RAID: H730 cache 1Gb
STORAGE: 4x SSD 1.92 TB SAMSUNG PM883 Enterprise

CEPH:
RAID 0 for each disk
4 osd each server
Network Cluster and Public: 2 x 10Gb Ethernet bonding LACP mode for each server

Please help me optimize random I / O or give me some advice. I am stuck in there one month :(
 
Last edited:
what is your level of replication ? (x3 ?)

raid0 could also impact performance (better to use passthrough controller). Do you have disable the cache of h730 ? (writhrough).

Do you have tested from a linux vm ? or directly from host ?

(From a vm, I can reach reach 80k iops randread 4k, but with 3ghz cpu, as 1disk can't use more than 1core. For write, I can reach around 30-40k iops. (Still cpu limited))
 
  • Like
Reactions: Kenny Huynh
you can also reach some more iops, disabling debug in /etc/ceph/ceph.conf




Code:
[global]
 debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0
perf = true
mutex_perf_counter = false
throttler_perf_counter = false
 
what is your level of replication ? (x3 ?)

raid0 could also impact performance (better to use passthrough controller). Do you have disable the cache of h730 ? (writhrough).

Do you have tested from a linux vm ? or directly from host ?

(From a vm, I can reach reach 80k iops randread 4k, but with 3ghz cpu, as 1disk can't use more than 1core. For write, I can reach around 30-40k iops. (Still cpu limited))


Hi Spirit,

Thank you very much for replying me

- Replication = 2 , min size = 1

- I have tested from linux VM, the result as above, 30k/15k . I have also tested directly from host by Rados , from host I/O is : 25k/9k

- I config rule cache:

Read: No read ahead
Write: Write Back
Disk cache policy: Enable

1586764046196.png
 
Last edited:
try first to disable writeback on your controll, you really don't need any cache with your ssd, and it could increase latency.

(Same for yours vms , use cache=none, if you use cache=writeback, reads will be twice slower)

Maybe try to bench with fio directory on your host (with rbd_engine)

here a sample for rbd:
Code:
fio.conf

[write-4k]
description="write test with block size of 4M"
ioengine=rbd
clientname=admin
pool=rbd
rbdname=block...
rw=randwrite
bs=4k
iodepth=64
direct=1
 
  • Like
Reactions: Kenny Huynh
Thanks Spirit,

I will try to follow what you say

Hope to have good results.

Thank you very múch

Have a nice day.
 
sorry for warming up this old thread, but it is appropriate for my question.

My environment:
pve: 7.3-3
ceph: 17.2.5
disk: NVME 120g * 3

Since I am using SATA SSD on the physical machine and may have low test results by then, I used the VMware Nesting Virtualization test pve + Ceph below.
Test environment setup process:
1. Install the pve7.3-3 image normally
2. Install Ceph
3. reconfigure CrushMap. Change step chooseleaf firstn 0 type host to step chooseleaf firstn 0 type osd
4. Upload Centos 7.9image
5. Create VM
6. Install fio , fio-cdm(https://github.com/xlucn/fio-cdm) and miniconda
7. Type python fio-cdm. Start test.

The fio-cdm default test conf.
Code:
(base) [root@localhost fio-cdm]# ./fio-cdm -f test
(base) [root@localhost fio-cdm]# cat test
[global]
ioengine=libaio
filename=.fio_testmark
directory=/root/fio-cdm
size=1073741824.0
direct=1
runtime=5
refill_buffers
norandommap
randrepeat=0
allrandrepeat=0
group_reporting

[seq-read-1m-q8-t1]
rw=read
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-write-1m-q8-t1]
rw=write
bs=1m
rwmixread=0
iodepth=8
numjobs=1
loops=5
stonewall

[seq-read-1m-q1-t1]
rw=read
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[seq-write-1m-q1-t1]
rw=write
bs=1m
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-read-4k-q32-t16]
rw=randread
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-write-4k-q32-t16]
rw=randwrite
bs=4k
rwmixread=0
iodepth=32
numjobs=16
loops=5
stonewall

[rnd-read-4k-q1-t1]
rw=randread
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

[rnd-write-4k-q1-t1]
rw=randwrite
bs=4k
rwmixread=0
iodepth=1
numjobs=1
loops=5
stonewall

Here are the results of my tests on this machine:

Code:
E:\Programing\PycharmProjects\fio-cdm>python fio-cdm
tests: 5, size: 1.0GiB, target: E:\Programing\PycharmProjects\fio-cdm 639.2GiB/931.2GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     2081.70|     1720.74|
|SEQ1M Q1 T1 |     1196.77|     1332.19|
|RND4K Q32T16|     1678.69|     1687.20|
|. IOPS      |   409835.70|   411913.22|
|. latency us|     1233.11|     1022.39|
|RND4K Q1 T1 |       40.58|       97.54|
|. IOPS      |     9908.40|    23813.04|
|. latency us|       99.94|       40.53|

Here are the test results for nested virtualization pve + Ceph:

Code:
(base) [root@localhost fio-cdm]# ./fio-cdm
tests: 5, size: 1.0GiB, target: /root/fio-cdm 2.0GiB/27.8GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1143.50|        4.81|
|SEQ1M Q1 T1 |      504.79|      193.58|
|RND4K Q32T16|      111.63|       28.25|
|. IOPS      |    27254.04|     6896.46|
|. latency us|    18699.01|    73813.12|
|RND4K Q1 T1 |        6.18|        2.10|
|. IOPS      |     1508.50|      512.79|
|. latency us|      658.65|     1943.99|

From the test results of random 4K 32 queue 16 threads, the gap is huge.
I Can't believe this is an NVME disk.

Other info:
Code:
root@pve:~# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS
 0    ssd  0.11719   1.00000  120 GiB  2.3 GiB  2.2 GiB   0 B  138 MiB  118 GiB  1.95  1.07   24      up
 1    ssd  0.11719   1.00000  120 GiB  2.5 GiB  2.3 GiB   0 B  158 MiB  117 GiB  2.09  1.14   25      up
 2    ssd  0.11719   1.00000  120 GiB  1.7 GiB  1.6 GiB   0 B  106 MiB  118 GiB  1.44  0.79   17      up
                       TOTAL  360 GiB  6.6 GiB  6.2 GiB   0 B  402 MiB  353 GiB  1.82                 
MIN/MAX VAR: 0.79/1.14  STDDEV: 0.28
root@pve:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME      STATUS  REWEIGHT  PRI-AFF
-1         0.35155  root default                          
-3         0.35155      host pve                          
 0    ssd  0.11719          osd.0      up   1.00000  1.00000
 1    ssd  0.11719          osd.1      up   1.00000  1.00000
 2    ssd  0.11719          osd.2      up   1.00000  1.00000

Even SATA SSD can run 20K + IOPS in the same test configuration if tested in VMware vSAN.
 
Last edited:
well.
I found that testing the performance of RBD directly does not seem to be a bottleneck, except that virtual machine read-write does not release the performance of the entire Ceph cluster.

Code:
rbd bench pool1/vm-100-disk-0 --io-type write --io-size 4K --io-pattern rand  --io-threads 32 --io-total 1G

Use the command above to test. From the results, you can see that the read and write performance of the test environment is limited by the CPU.


1675070205499.png


Based on the results so far, the read and write performance of the Virtual Machines in pve should be at this level (8k Iops) . But unfortunately, I can not find the reason for this. I think the bottleneck may not be in Ceph. As long as the CPU performance is strong enough, using CEPH's built-in benchmarks is bound to run up to NVME bandwidth.
 
use bcahce.
3hdd + 1nvme
test bcache disk directly.
Code:
root@pve1:/mnt/sdb# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/sdb 6.5GiB/931.1GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |     1844.92|     1168.63|
|SEQ1M Q1 T1 |     1566.59|     1161.81|
|RND4K Q32T16|      955.06|      115.84|
|. IOPS      |   233168.50|    28280.66|
|. latency us|     2194.71|    18082.12|
|RND4K Q1 T1 |       53.63|       98.47|
|. IOPS      |    13093.18|    24040.59|
|. latency us|       75.71|       40.24|

rbd create test, then mount
Code:
root@pve1:/mnt/test# python3 ~/fio-cdm/fio-cdm
tests: 5, size: 1.0GiB, target: /mnt/test 104.3MiB/10.0GiB
|Name        |  Read(MB/s)| Write(MB/s)|
|------------|------------|------------|
|SEQ1M Q8 T1 |      234.20|       95.06|
|SEQ1M Q1 T1 |      149.89|       17.96|
|RND4K Q32T16|       74.99|       14.65|
|. IOPS      |    18308.44|     3576.15|
|. latency us|    27919.29|   142214.86|
|RND4K Q1 T1 |        0.78|        0.04|
|. IOPS      |      191.11|       10.03|
|. latency us|     5229.82|    99721.42|

rbd bench
Code:
root@pve1:~# rbd bench --io-type write test --io-size 4K --io-pattern rand
bench  type write io_size 4096 io_threads 16 bytes 1073741824 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
    1      1536   1347.24   5.3 MiB/s
    2      1600   738.582   2.9 MiB/s
    3      1664   546.881   2.1 MiB/s
    4      1728   417.629   1.6 MiB/s
    5      1776   345.683   1.4 MiB/s

that's too bad.
 
@Otter7721 did you manage to solve the issue regarding slow write speeds?

I recently set up something similar to your build, 6x MZ1LB960B MZ1LB9T80 rated at around 2000-3000MB/S, upon creating a pool, 2 OSD each over 3 servers, I am only getting around 200MB/S+ write, which is extremely bad, even slower than an SDD.

Appreciate some insight on this, thanks!
1677581732194.png

 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!