Ceph Luminous with Bluestore - slow VM read

feriicko

New Member
Apr 20, 2018
7
0
1
38
Hi everyone,

recently we installed proxmox with Ceph Luminous and Bluestore on our brand new cluster and we experiencing problem with slow reads inside VMs. We tried different settings in proxmox VM but the read speed is still the same - around 20-40 MB/s.

Here is our hardware configuration:
  • 3x node each with 6 osd (classic 4TB spinning disk Toshiba mg04aca400e) + intel SSD for Ceph DB (created by default settings when creating OSDs with proxmox GUI)
  • CPU is Intel E5-2620 v4 with 8 cores in 2 sockets.
  • 128 GB RAM
  • Ceph has it's own 10G network
  • Ceph is the newest version 12.2.4
The rados benchmark shows promising results with around 300MB/s write speed and 800MB/s read speed, but inside proxmox VM, we are getting only 20-40 MB/s (read). The pool is replicated with rule 3/2. We have setup pool with 512 PGs.

Can you please help?

ceph.conf

Code:
[global] auth client required = cephx 
auth cluster required = cephx 
auth service required = cephx 
cluster network = 10.100.210.0/24 
fsid = e99379a9-d2ac-494e-bcf9-a76a12e9835d 
keyring = /etc/pve/priv/$cluster.$name.keyring 
mon allow pool delete = true 
osd journal size = 5120 
osd pool default min size = 2 
osd pool default size = 3 
public network = 10.100.210.0/24 
osd deep scrub interval = 1209600 
osd scrub begin hour = 23 
osd scrub end hour = 5 
osd scrub sleep = 0.1 
debug ms = 0/0 

[osd] 
keyring = /var/lib/ceph/osd/ceph-$id/keyring 

[mon.node03] 
host = node03 
mon addr = 10.100.210.3:6789 

[mon.node01] 
host = node01 
mon addr = 10.100.210.1:6789 

[mon.node02] 
host = node02
mon addr = 10.100.210.2:6789
 
What cache mode do you use? The settings from qemu are used by ceph.
 
how do you bench it ? rados bench use multiple thread.

try to test with fio, with iodepth=32 for example.
 
with rados bench -p Ceph-test 20 write --no-cleanup
Code:
Total time run:         20.455545
Total writes made:      971
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     189.875
Stddev Bandwidth:       27.0703
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 108
Average IOPS:           47
Stddev IOPS:            6
Max IOPS:               56
Min IOPS:               27
Average Latency(s):     0.336378
Stddev Latency(s):      0.227068
Max latency(s):         1.20641
Min latency(s):         0.04353
rados bench -p Ceph-test 20 seq -t 8
Code:
Total time run:       10.977036
Total reads made:     971
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   353.83
Average IOPS:         88
Stddev IOPS:          12
Max IOPS:             104
Min IOPS:             68
Average Latency(s):   0.0895998
Max latency(s):       0.497639
Min latency(s):       0.0275173
 
Do you also have such a bad performance when you copy files within one drive of the VM?

The problem is coming from network latency + ceph latency. If you copy 1 file, sequentially and with small blocks, it's iodepth=1. (same with dd command for example).

For each block, you'll have your network latency (0,1ms for example), you'll be able to do 10000 iops.
if you do it with 4k block, this will give you 40MB/s max .
if you do the same with 4MB block, 40GB/s.

and they are also ceph latency (code which use cpu to retreive the block).
To improve this , you need fast frequency cpu (ceph cluster and ceph client).
and you can reduce cpu by disabling debug in ceph.conf client side

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0


I'm currently able to reach 4000 io/s with a single queue depth. (good network, 3,2ghz frequency)

(of course with many vm and more queue depth, I'm able to reach 600000 io/s).
It's all about latency (network + ceph).


A last thing:
cache=writeback improve 1thing : sequential write. (if you do sequential write by small block, ceph client put it in a small cache, and ceph a big block to ceph cluster, so less latency, and better throughput).
But writeback, increase latency for read, so you'll have less io/s on read.

So until you have a special workload which need small sequential write, keep cache=none.
 
The problem is coming from network latency + ceph latency. If you copy 1 file, sequentially and with small blocks, it's iodepth=1. (same with dd command for example).

For each block, you'll have your network latency (0,1ms for example), you'll be able to do 10000 iops.
if you do it with 4k block, this will give you 40MB/s max .
if you do the same with 4MB block, 40GB/s.

and they are also ceph latency (code which use cpu to retreive the block).
To improve this , you need fast frequency cpu (ceph cluster and ceph client).
and you can reduce cpu by disabling debug in ceph.conf client side

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0


I'm currently able to reach 4000 io/s with a single queue depth. (good network, 3,2ghz frequency)

(of course with many vm and more queue depth, I'm able to reach 600000 io/s).
It's all about latency (network + ceph).


A last thing:
cache=writeback improve 1thing : sequential write. (if you do sequential write by small block, ceph client put it in a small cache, and ceph a big block to ceph cluster, so less latency, and better throughput).
But writeback, increase latency for read, so you'll have less io/s on read.

So until you have a special workload which need small sequential write, keep cache=none.
I tried meny different test, also read performance with 4k blocks, 1M or 4M blocks, but the result still the same. Can it be something wrong with settings for the ceph, or maybe bad create of the cluster with default settings from prox? What you say about it?

northe, i tried copy file from to the same disk in VM and the read speed is still around 30 MB/s
 
poslat
It is the same here in my post::
https://forum.proxmox.com/threads/pve-5-1-46-ceph-bluestore-poor-performance-with-smal-files.42928/

Yes, I do have all spinning drives but they deliver ~ 175mb/sec and my fio test with 4k size report me 133mb/sec.
However, this does not justify a bandwidth of 20-30mb/sec within the VM machine.

Do you also have such a bad performance when you copy files within one drive of the VM?
sorry, i tried today and directly copy within VM goes around 200 and more MB/s . I copy file with 1,4 GB.
 
feriicko, within one drive (GPT, >2tb) I do not get higher rates than ~30-40MB/s when I start to copy from directory to directory.

But I found out an interessting thing:
If you have several disks attached to your VM and try to backup with i.e. Acronis or ShadowProtect to a physical NAS you get rates over +150MB/s if you let the backup start the job all togehter, saving the disks simultanious. If you start the backup job to save drive by drive I do not exceed 30-40MB/s.

Perhaps it has to do with read ahead settings because this post describes my situation very well:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-November/014525.html
and according to cephfs this might be the pill for it
http://docs.ceph.com/docs/jewel/rbd/rbd-config-ref/
 
feriicko, did you solved this problem? I just set up a new cluster, my first with bluestore and I'm facing similar issue. Write BW or IOPS are quiet good, but randread are just 10 times slower thant my old filestore clusters (hammer).
Setup is:
- ceph luminous
- 3 nodes with 2ssds for rocksdb/wal and 4 hdd (4TB)

I played with read ahead without success :(
 
Sorry to bump this old thread. We also notice this issue recently on octopus. It works normally before using nautilus. Rados bench on host or from ceph mds show ok value (read higher than write), but inside vm seq read tops at 80MB/s. Maybe its related, during test we notice iowait raised to 50.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!