CEPH read performance

rados uses 16 threads by default.
KVM provides a single IO thread.

So I ran rados with one thread, its performance is nearly identical to what I see inside the VM.
It to drastically improves if the data is already in the cache on the CEPH server.

Read first time = slow
Re-read = really fast

The cause of the slow initial read is caused by the latency involved with reading the data from the disk.
This latency can be masked by using multiple IO threads.

I've thought of using multiple disks in the VM and do a software RAID0. Using rados bench three threads gives me about 100MB/sec read so three disks in the VM should be adequate to get an acceptable level of performance.
But Proxmox backup is still limited to one thread and is slow.

Is there anything I could tune in the CEPH server would would help improve single threaded IO?
SSD Cache Tier to mask the problem?

@spirit
one thing possible currently, assign same disk multiple time, and do some multipathing inside guest.

like this I have been able to reach 90000iops with 1 disk. (3x virtio disk 30000iops each + iothreads + krbd).
Thats from another thread https://forum.proxmox.com/threads/vm-lockups-with-ceph.20348/#post-103774

I've never done anything with multipath, could you share some additional details?
Is this safe to do?

Seems like a much better idea than software RAID0 in the VM but still leaves Proxmox backup with only a single slow thread.

this is because the VM use one thread only - if you start rados bench with one thread only (default is 16) the result looks perhaps similiar.
I was getting ready to post the above when you said this.
 
Is there anything I could tune in the CEPH server would would help improve single threaded IO?
Disable cephx, disable debug, use high frequency intel cpu.
Also on proxmox 4.0, with have made optimisation in qemu, which use now jemalloc memory allocator. (around 30% less latencies)

About my bench, I was able to reach 70000iops 4k read by qemu disk, with librbd with ceph hammer; with 2x 10 cores 3,1ghz cpu (both client/server). I have benched up to 600000 iops 4k read with 1vm- 10iothreads-10disks.

For write, it's lower, but it's use a lot more cpu on ceph nodes. (I think I was around 100k iops 4k). Ceph team is going to make big improvement on this for next release.
 
I just want to point out, that even with one thread,
About my bench, I was able to reach 70000iops 4k read by qemu disk, with librbd with ceph hammer; with 2x 10 cores 3,1ghz cpu (both client/server). I have benched up to 600000 iops 4k read with 1vm- 10iothreads-10disks.

You know its really hard to compare Apples and Strawberries with Oranges and kiwis, if you have no clue how many oranges and kiwis were used :)


In other words, how many Nodes, Disks and what journal config are you using to reach your above benchmark examples?
So we can put that into relation to his 3x7 Disks on ssd journals.
I think a normal rados bench (example in post #9) would also shed some light on this.
 
Disable cephx, disable debug
Already did that.

on proxmox 4.0, with have made optimisation in qemu, which use now jemalloc memory allocator. (around 30% less latencies)
Yes, 4.0 is indeed slightly faster, I'll be moving to it soon.

Qemu multithread IO will greatly improve things.
I recall reading on pve-devel that multiple threads and backup do not work, that's why we are limited to one thread.

Is it possible to enable more threads to experiment with if I don't use backup?
 
I just want to point out, that even with one thread,


You know its really hard to compare Apples and Strawberries with Oranges and kiwis, if you have no clue how many oranges and kiwis were used :)


In other words, how many Nodes, Disks and what journal config are you using to reach your above benchmark examples?
So we can put that into relation to his 3x7 Disks on ssd journals.
I think a normal rados bench (example in post #9) would also shed some light on this.

3 nodes with 2x 10cores xeon 3,1ghz, 64g ram.

each node :
1 ssd raid1, with os + monitor (intel S3700 100GB)
6 ssd passthrough for osd (intel s3610 1,6TB). journal+datas on each drive.

pool : replication x3.

600k iops rand read 4k block (cpu limited)
100k iops rand write 4K block (cpu limited)

+- 4GB/s read/write with 1M block (bandwith limited)
 
Already did that.


Yes, 4.0 is indeed slightly faster, I'll be moving to it soon.

Qemu multithread IO will greatly improve things.
I recall reading on pve-devel that multiple threads and backup do not work, that's why we are limited to one thread.

Is it possible to enable more threads to experiment with if I don't use backup?
qemu don't support multiple threads by disk currently. It's only possible to use 1different iothread for each qemu disk (proxmox4)

multiple iothreads by disk is coming, but don't expect it before 1year I think .
 
Yeah, THAT would do it.

Just remember, you are getting VM-bench results of 70K IOps on a 3x 6 SSD CLuster with beefed up nodes.
100e's results are for a normal 3x7 assorted HDD based Cluster, in which SSD's take no part in read operations.

I think we need to add cranberries and Watermelons to this fruit-salad.
(I do get you are trying to show the increase of 70k IO to 600K IO using the method described above during synthetic benchmarks)

Its specifically why i have not posted the performance of our work based Clusters, as it does not relate to this "Problem" and instead stuck to my Home-Cluster which is of a similar configuration of his.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!