Ceph Luminous with Bluestore - slow VM read

feriicko · Apr 20, 2018

Hi everyone,

recently we installed proxmox with Ceph Luminous and Bluestore on our brand new cluster and we experiencing problem with slow reads inside VMs. We tried different settings in proxmox VM but the read speed is still the same - around 20-40 MB/s.

Here is our hardware configuration:

3x node each with 6 osd (classic 4TB spinning disk Toshiba mg04aca400e) + intel SSD for Ceph DB (created by default settings when creating OSDs with proxmox GUI)
CPU is Intel E5-2620 v4 with 8 cores in 2 sockets.
128 GB RAM
Ceph has it's own 10G network
Ceph is the newest version 12.2.4

The rados benchmark shows promising results with around 300MB/s write speed and 800MB/s read speed, but inside proxmox VM, we are getting only 20-40 MB/s (read). The pool is replicated with rule 3/2. We have setup pool with 512 PGs.

Can you please help?

ceph.conf

Code:

[global] auth client required = cephx 
auth cluster required = cephx 
auth service required = cephx 
cluster network = 10.100.210.0/24 
fsid = e99379a9-d2ac-494e-bcf9-a76a12e9835d 
keyring = /etc/pve/priv/$cluster.$name.keyring 
mon allow pool delete = true 
osd journal size = 5120 
osd pool default min size = 2 
osd pool default size = 3 
public network = 10.100.210.0/24 
osd deep scrub interval = 1209600 
osd scrub begin hour = 23 
osd scrub end hour = 5 
osd scrub sleep = 0.1 
debug ms = 0/0 

[osd] 
keyring = /var/lib/ceph/osd/ceph-$id/keyring 

[mon.node03] 
host = node03 
mon addr = 10.100.210.3:6789 

[mon.node01] 
host = node01 
mon addr = 10.100.210.1:6789 

[mon.node02] 
host = node02
mon addr = 10.100.210.2:6789

Klaus Steinberger · Apr 20, 2018

which driver did you use inside the VM's ?

For best performance use virtio-scsi or virtio-blk (virtio-scsi gives you also trim capabilities)

feriicko · Apr 20, 2018

I tried both options in proxmox Virtio-scsi and also virtio-blk but the results are the same.

Klaus Steinberger said:
which driver did you use inside the VM's ?

For best performance use virtio-scsi or virtio-blk (virtio-scsi gives you also trim capabilities)

Alwin · Apr 20, 2018

What cache mode do you use? The settings from qemu are used by ceph.

feriicko · Apr 20, 2018

Alwin said:
What cache mode do you use? The settings from qemu are used by ceph.

Tried different setting and combinations and the performance was quite the same.

aderumier · Apr 20, 2018

how do you bench it ? rados bench use multiple thread.

try to test with fio, with iodepth=32 for example.

feriicko · Apr 20, 2018

with rados bench -p Ceph-test 20 write --no-cleanup

Code:

Total time run:         20.455545
Total writes made:      971
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     189.875
Stddev Bandwidth:       27.0703
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 108
Average IOPS:           47
Stddev IOPS:            6
Max IOPS:               56
Min IOPS:               27
Average Latency(s):     0.336378
Stddev Latency(s):      0.227068
Max latency(s):         1.20641
Min latency(s):         0.04353

rados bench -p Ceph-test 20 seq -t 8

Code:

Total time run:       10.977036
Total reads made:     971
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   353.83
Average IOPS:         88
Stddev IOPS:          12
Max IOPS:             104
Min IOPS:             68
Average Latency(s):   0.0895998
Max latency(s):       0.497639
Min latency(s):       0.0275173

northe · Apr 20, 2018

It is the same here in my post::
https://forum.proxmox.com/threads/pve-5-1-46-ceph-bluestore-poor-performance-with-smal-files.42928/

Yes, I do have all spinning drives but they deliver ~ 175mb/sec and my fio test with 4k size report me 133mb/sec.
However, this does not justify a bandwidth of 20-30mb/sec within the VM machine.

Do you also have such a bad performance when you copy files within one drive of the VM?

aderumier · Apr 21, 2018

northe said:
Do you also have such a bad performance when you copy files within one drive of the VM?

The problem is coming from network latency + ceph latency. If you copy 1 file, sequentially and with small blocks, it's iodepth=1. (same with dd command for example).

For each block, you'll have your network latency (0,1ms for example), you'll be able to do 10000 iops.
if you do it with 4k block, this will give you 40MB/s max .
if you do the same with 4MB block, 40GB/s.

and they are also ceph latency (code which use cpu to retreive the block).
To improve this , you need fast frequency cpu (ceph cluster and ceph client).
and you can reduce cpu by disabling debug in ceph.conf client side

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

I'm currently able to reach 4000 io/s with a single queue depth. (good network, 3,2ghz frequency)

(of course with many vm and more queue depth, I'm able to reach 600000 io/s).
It's all about latency (network + ceph).

A last thing:
cache=writeback improve 1thing : sequential write. (if you do sequential write by small block, ceph client put it in a small cache, and ceph a big block to ceph cluster, so less latency, and better throughput).
But writeback, increase latency for read, so you'll have less io/s on read.

So until you have a special workload which need small sequential write, keep cache=none.

feriicko · Apr 23, 2018

aderumier said:
The problem is coming from network latency + ceph latency. If you copy 1 file, sequentially and with small blocks, it's iodepth=1. (same with dd command for example).

For each block, you'll have your network latency (0,1ms for example), you'll be able to do 10000 iops.
if you do it with 4k block, this will give you 40MB/s max .
if you do the same with 4MB block, 40GB/s.

and they are also ceph latency (code which use cpu to retreive the block).
To improve this , you need fast frequency cpu (ceph cluster and ceph client).
and you can reduce cpu by disabling debug in ceph.conf client side

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

I'm currently able to reach 4000 io/s with a single queue depth. (good network, 3,2ghz frequency)

(of course with many vm and more queue depth, I'm able to reach 600000 io/s).
It's all about latency (network + ceph).

A last thing:
cache=writeback improve 1thing : sequential write. (if you do sequential write by small block, ceph client put it in a small cache, and ceph a big block to ceph cluster, so less latency, and better throughput).
But writeback, increase latency for read, so you'll have less io/s on read.

So until you have a special workload which need small sequential write, keep cache=none.

I tried meny different test, also read performance with 4k blocks, 1M or 4M blocks, but the result still the same. Can it be something wrong with settings for the ceph, or maybe bad create of the cluster with default settings from prox? What you say about it?

northe, i tried copy file from to the same disk in VM and the read speed is still around 30 MB/s

feriicko · Apr 23, 2018

poslat

northe said:
It is the same here in my post::
https://forum.proxmox.com/threads/pve-5-1-46-ceph-bluestore-poor-performance-with-smal-files.42928/

Yes, I do have all spinning drives but they deliver ~ 175mb/sec and my fio test with 4k size report me 133mb/sec.
However, this does not justify a bandwidth of 20-30mb/sec within the VM machine.

Do you also have such a bad performance when you copy files within one drive of the VM?

sorry, i tried today and directly copy within VM goes around 200 and more MB/s . I copy file with 1,4 GB.

northe · May 4, 2018

feriicko, within one drive (GPT, >2tb) I do not get higher rates than ~30-40MB/s when I start to copy from directory to directory.

But I found out an interessting thing:
If you have several disks attached to your VM and try to backup with i.e. Acronis or ShadowProtect to a physical NAS you get rates over +150MB/s if you let the backup start the job all togehter, saving the disks simultanious. If you start the backup job to save drive by drive I do not exceed 30-40MB/s.

Perhaps it has to do with read ahead settings because this post describes my situation very well:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-November/014525.html
and according to cephfs this might be the pill for it
http://docs.ceph.com/docs/jewel/rbd/rbd-config-ref/

Emmanuel Lacour · Aug 13, 2018

feriicko, did you solved this problem? I just set up a new cluster, my first with bluestore and I'm facing similar issue. Write BW or IOPS are quiet good, but randread are just 10 times slower thant my old filestore clusters (hammer).
Setup is:
- ceph luminous
- 3 nodes with 2ssds for rocksdb/wal and 4 hdd (4TB)

I played with read ahead without success

AlexLup · Aug 23, 2018

I have had the same issue since the beginning and the only thing that "helped" was setting up a cache tier as per http://technik.blogs.nde.ag/2017/07/14/ceph-caching-for-image-pools/

I tried updating the network to 10gb, uping the DB/WAL for osds, writeback cache, more bluestore cache etc..

Rizki Rivai · Sep 10, 2021

Sorry to bump this old thread. We also notice this issue recently on octopus. It works normally before using nautilus. Rados bench on host or from ceph mds show ok value (read higher than write), but inside vm seq read tops at 80MB/s. Maybe its related, during test we notice iowait raised to 50.

Search

Search

Ceph Luminous with Bluestore - slow VM read

feriicko

New Member

Klaus Steinberger

Renowned Member

feriicko

New Member

Alwin

Proxmox Retired Staff

feriicko

New Member

aderumier

Renowned Member

feriicko

New Member

northe

Active Member

aderumier

Renowned Member

feriicko

New Member

feriicko

New Member

northe

Active Member

Emmanuel Lacour

New Member

AlexLup

Well-Known Member

Rizki Rivai

Member

We value your privacy