simulate qemu-ceph read pattern with fio

dirks

Member
Feb 6, 2016
26
16
23
43
Hi,

I am running a proxmox 5.3 cluster with ceph storage (5 x Dell 720xd, each with 12 x 4 TB spinners, 2 DC S3700 for WAL and DB, 2 x 10 Gb ovs-slb-balance) and experience poor read performance in the vms (~50 MB/s sequential read), write performance is fine (1xx-2xx MB/s). I see similar speeds with rbd bench and rados bench when using 1 thread and 4 MB block size. If I understood correctly quemu uses only 1 io-thread.

To test the disk and raid controller (yes I know, not a good idea, but read on) performance without going through rados, I wrote a fio test that simulates the pattern that qemu/ceph are using . At least I think that it does. For the test ceph is stopped on the node on which testing is performed.

all_disks.fio.j2 is to check on all disks of a node (the {{ values }} are variables filled by an associated ansible playbook).
Code:
[global]
ioengine=libaio
rw=randread
bs=4M
#iodepth is for the complete job!
iodepth=1
direct=1
size=50G
runtime=60

[one-read]
filename={{ ceph_osd_block_choosen_partition }}

one_disk.fio.j2 is to check one disk of a node
Code:
[global]
ioengine=libaio
rw=randread
bs=4M
#iodepth is for the complete job!
iodepth=1
direct=1
size=50G
runtime=60
#nice=-1


[all-read]
filename={{ ceph_osd_block_partitions.stdout }}

I get ok values around 130 - 136 MB/s per disk for the one disk test and slightly higher values for all disks at once (occasionally there are drops). The reason that the combined read speed is not higher should be due to iodepth 1 being used for the complete fio job.

Looking at iostat during a ceph benchmark I see mostly 4096 rkB/s values on the disks of all machines. Looking at it when running the all_disk.fio job I see tripple the values on the machine that is benchmarked and with the one_disk.fio job ~140000 rkB/s. I wonder does the fio file describe an adequate simulation of what ceph does on the disk layer? It feels as if I am missing something important here.
 
5 x Dell 720xd, each with 12 x 4 TB spinners, 2 DC S3700 for WAL and DB, 2 x 10 Gb ovs-slb-balance
I suspect that adding another SSD and split the 12x into 3x4 could increase the read performance. Also you how are the two 10 GbE links saturated? Could be that they are uneven and limit the read bandwidth.

Looking at iostat during a ceph benchmark I see mostly 4096 rkB/s values on the disks of all machines. Looking at it when running the all_disk.fio job I see tripple the values on the machine that is benchmarked and with the one_disk.fio job ~140000 rkB/s. I wonder does the fio file describe an adequate simulation of what ceph does on the disk layer? It feels as if I am missing something important here.
As ceph replicates a object (by default 3x) on host level (default), not all OSDs are reading the PG holding the object from disk. Further the PG involves three different nodes and at least two (if not all) objects need to be transmitted over network.

How are the VMs configured? What type of cache mode are they using?
 
Hi Alwin, thanks for your answer.

I suspect that adding another SSD and split the 12x into 3x4 could increase the read performance.

How so? The data itself is on the spinners, the SSDs hold WAL and RocksDB. So I fail to see, how this would help with cold data. Sure when the data is hot, read performance is good. By creating a benchmark that does not flush the cache, e.g. do the same rados read bench two times in a row, the first reads are served from the SSDs and speeds are at several 100 MB/s, same when reading data twice in a vm (including flushing the vms cache).

Also you how are the two 10 GbE links saturated? Could be that they are uneven and limit the read bandwidth.

Yes, they are usually saturated unevenly, but well below their bandwidth limit. In fact OSD traffic is mainly going through one interface per node.

For example during such a rados bench run

Code:
#!/bin/bash
threads=1
run_name=benchmark_$(date +%Y%m%d-%H%M%S)
pool=bench
echo started on $(date) >> /var/log/${run_name}.log
rados -p $pool bench --run-name $run_name --no-cleanup -b 4M 60 write -t "$threads" >>/var/log/${run_name}.log 2>&1
echo 3 | tee /proc/sys/vm/drop_caches && sync
rados -p $pool bench --run-name $run_name 60 seq -t 1 >>/var/log/${run_name}.log 2>&1
rados -p $pool cleanup --run-name $run_name >>/var/log/${run_name}.log 2>&1
echo finished on $(date) >> /var/log/${run_name}.log

iftop shows RX rates of ~2 GBit/s on one of the physical interfaces of each note during write, but only tens to hundreds of MBits/s during read. If I increase the number of threads performance and bandwidth on the busy interfaces goes up to 7-10 GBit/s, so at least bandwidth should not be limiting for read. And yes, we are considering to switch away from ovs-slb-balance to either lacp or split the two interfaces into dedicated public and ceph interface, but again I do not see the read bottleneck here.

As ceph replicates a object (by default 3x) on host level (default), not all OSDs are reading the PG holding the object from disk. Further the PG involves three different nodes and at least two (if not all) objects need to be transmitted over network.

Not sure if I am reading you correctly here. You are not saying that each read has to be issued to all three members of an PG? My naive perspective:

- the client (in our case rados bench, rbd bench or qemu with librbd) asks the mons for a map
- for each block the client asks an osd (daemon) to get the data. I think by default the leading member of the pg is choosen.
- the osd receives the request, translates the request of an object inside the pg to the actual location on disk and issues the read to kernel/hardware
- hardware does its thing and the data travels back over osd to client

As we have 5 nodes, 4/5 of the read requests have to actually travel the network, maybe less if ceph is smart about preferring local osds before remote ones (IIRC there was a ceph option for this but setting it did not really change the results). Let us just assume traveling the network is what most data has to do.

- icmp rrt is mostly < 0.1 ms (e.g. rtt min/avg/max/mdev = 0.054/0.067/0.118/0.015 ms)
- disk latency from local fio job on node is ~30 ms (e.g. lat (msec): min=1, max=189, avg=27.90, stdev= 9.13)

I assume that multiple network trips are needed for each request (does ceph create a new tcp session per request or does it continue to use an established connection?), but the gap between disk and network latency is quite substantial and I do not yet see that the network is the bottleneck.

How are the VMs configured? What type of cache mode are they using?

Mostly writeback, but I tested with different settings and if I recall correctly writeback and default performed best and there was a hardly a difference. I would really like to isolate the cause of the problem. There are just too many components involved to start changing hardware or software parameters.

If it is not the network (see my reasoning above), what could be the reason for the substantially (50 vs 140 MB/s) slower read requests via rados vs local fio?

I should probably have asked this sooner, but is my assumption correct that reading from inside a qemu-vm is similar to running a rados bench with one thread?
 
Hi,
do you get better read performance with an higher read ahed valuie inside the VM?
Like with such an udev-rule:
Code:
#99-ceph.rules
SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{bdi/read_ahead_kb}="16384", ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"
If this raise the performance, the bottleneck is less the network but more the spinner (and latency).

Udo
 
I use a good deal of assumptions to simplify the moving parts in ceph. In the end it has to be tested/monitored.

How so? The data itself is on the spinners, the SSDs hold WAL and RocksDB. So I fail to see, how this would help with cold data. Sure when the data is hot, read performance is good. By creating a benchmark that does not flush the cache, e.g. do the same rados read bench two times in a row, the first reads are served from the SSDs and speeds are at several 100 MB/s, same when reading data twice in a vm (including flushing the vms cache).
Yes, the object data is saved on the spinner, but its metadata is put into the RocksDB and WAL. So 6x OSDs share the IO/s of one SSD. OSD memory usage can have the effect that a subsequent read is faster (default value is 4GB; ceph >=12.2.9).

How big is your WAL/DB partition for a OSD? If the DB stores more data than the partition can hold it will spill over onto the spinner.

iftop shows RX rates of ~2 GBit/s on one of the physical interfaces of each note during write, but only tens to hundreds of MBits/s during read. If I increase the number of threads performance and bandwidth on the busy interfaces goes up to 7-10 GBit/s, so at least bandwidth should not be limiting for read. And yes, we are considering to switch away from ovs-slb-balance to either lacp or split the two interfaces into dedicated public and ceph interface, but again I do not see the read bottleneck here.
A rados bench uses by default 16 threads. The thread count that you used in your test can be taken as the amount of VMs that can run on the node with using the max. bandwidth. I suppose VMs are running on all nodes in the cluster, then the count on a single node might be lower.

Not sure if I am reading you correctly here. You are not saying that each read has to be issued to all three members of an PG? My naive perspective:

- the client (in our case rados bench, rbd bench or qemu with librbd) asks the mons for a map
- for each block the client asks an osd (daemon) to get the data. I think by default the leading member of the pg is choosen.
- the osd receives the request, translates the request of an object inside the pg to the actual location on disk and issues the read to kernel/hardware
- hardware does its thing and the data travels back over osd to client

As we have 5 nodes, 4/5 of the read requests have to actually travel the network, maybe less if ceph is smart about preferring local osds before remote ones (IIRC there was a ceph option for this but setting it did not really change the results). Let us just assume traveling the network is what most data has to do.
Yeah, my text was not clear on that. :oops: I meant your fio test reading from all disks on a node. By default the replication is done on node level. So if you have a size of 3, then 1x OSD on 1x node will get that object. Not all OSDs on a node would read at once.

Sadly ceph doesn't have a locality, the primary OSD for the PG is contacted. But reads should be quicker than writes, as for writes the OSDs takes care of copying the objects and once alle copies are written, it sends the ACK back. How are your rados bench read/write tests looking in comparison?

- icmp rrt is mostly < 0.1 ms (e.g. rtt min/avg/max/mdev = 0.054/0.067/0.118/0.015 ms)
- disk latency from local fio job on node is ~30 ms (e.g. lat (msec): min=1, max=189, avg=27.90, stdev= 9.13)

I assume that multiple network trips are needed for each request (does ceph create a new tcp session per request or does it continue to use an established connection?), but the gap between disk and network latency is quite substantial and I do not yet see that the network is the bottleneck.
I suppose the latency will be even higher going through the different ceph layers. You can find some comparison in our Ceph benchmark paper. https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

AFAIK, usually a socket is opened to the OSDs in use for a client (eg qemu) and if they are not used anymore (timeout 15min), they close and those messages should be seen in the ceph logs.

Is the bandwidth of a link maxed out while running your tests, when you get the ~50MB/s?

Mostly writeback, but I tested with different settings and if I recall correctly writeback and default performed best and there was a hardly a difference. I would really like to isolate the cause of the problem. There are just too many components involved to start changing hardware or software parameters.
How does your ceph config look like? I think about the cache settings especially.

If it is not the network (see my reasoning above), what could be the reason for the substantially (50 vs 140 MB/s) slower read requests via rados vs local fio?
Yes, good question, the network is only a part of it. How are the disks for the OSDs configured (HBA/RAID, ...)? And in general, could you please write more about hour hardware?

I should probably have asked this sooner, but is my assumption correct that reading from inside a qemu-vm is similar to running a rados bench with one thread?
Similar enough. Yes, qemu uses by default only one thread for all configured disks. But it can use caching and communicates with librbd to the cluster.
 
Hi,

I assume that multiple network trips are needed for each request (does ceph create a new tcp session per request or does it continue to use an established connection?), but the gap between disk and network latency is quite substantial and I do not yet see that the network is the bottleneck.

With low iodepth, this is indeed the latency which limit the performance.

we have 2 latency :
- network latency
- but also cpu latency (all the code in ceph server, and ceph client).

to improve cpu latency:

- use fast frequencies cpu

- disable cephx
auth_cluster_required = none
auth_service_required = none
auth_client_required = none

- disable debug in server and clients ceph.conf
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0
mutex_perf_counter = false
throttler_perf_counter = false



Also, if you use writeback, it's slowdown a lot (around twice slower) parallels read with bigger iodepth. (they a are some kind of big mutex, I think it'll be fixed in nautilus)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!