Hi,
I am running a proxmox 5.3 cluster with ceph storage (5 x Dell 720xd, each with 12 x 4 TB spinners, 2 DC S3700 for WAL and DB, 2 x 10 Gb ovs-slb-balance) and experience poor read performance in the vms (~50 MB/s sequential read), write performance is fine (1xx-2xx MB/s). I see similar speeds with rbd bench and rados bench when using 1 thread and 4 MB block size. If I understood correctly quemu uses only 1 io-thread.
To test the disk and raid controller (yes I know, not a good idea, but read on) performance without going through rados, I wrote a fio test that simulates the pattern that qemu/ceph are using . At least I think that it does. For the test ceph is stopped on the node on which testing is performed.
all_disks.fio.j2 is to check on all disks of a node (the {{ values }} are variables filled by an associated ansible playbook).
one_disk.fio.j2 is to check one disk of a node
I get ok values around 130 - 136 MB/s per disk for the one disk test and slightly higher values for all disks at once (occasionally there are drops). The reason that the combined read speed is not higher should be due to iodepth 1 being used for the complete fio job.
Looking at iostat during a ceph benchmark I see mostly 4096 rkB/s values on the disks of all machines. Looking at it when running the all_disk.fio job I see tripple the values on the machine that is benchmarked and with the one_disk.fio job ~140000 rkB/s. I wonder does the fio file describe an adequate simulation of what ceph does on the disk layer? It feels as if I am missing something important here.
I am running a proxmox 5.3 cluster with ceph storage (5 x Dell 720xd, each with 12 x 4 TB spinners, 2 DC S3700 for WAL and DB, 2 x 10 Gb ovs-slb-balance) and experience poor read performance in the vms (~50 MB/s sequential read), write performance is fine (1xx-2xx MB/s). I see similar speeds with rbd bench and rados bench when using 1 thread and 4 MB block size. If I understood correctly quemu uses only 1 io-thread.
To test the disk and raid controller (yes I know, not a good idea, but read on) performance without going through rados, I wrote a fio test that simulates the pattern that qemu/ceph are using . At least I think that it does. For the test ceph is stopped on the node on which testing is performed.
all_disks.fio.j2 is to check on all disks of a node (the {{ values }} are variables filled by an associated ansible playbook).
Code:
[global]
ioengine=libaio
rw=randread
bs=4M
#iodepth is for the complete job!
iodepth=1
direct=1
size=50G
runtime=60
[one-read]
filename={{ ceph_osd_block_choosen_partition }}
one_disk.fio.j2 is to check one disk of a node
Code:
[global]
ioengine=libaio
rw=randread
bs=4M
#iodepth is for the complete job!
iodepth=1
direct=1
size=50G
runtime=60
#nice=-1
[all-read]
filename={{ ceph_osd_block_partitions.stdout }}
I get ok values around 130 - 136 MB/s per disk for the one disk test and slightly higher values for all disks at once (occasionally there are drops). The reason that the combined read speed is not higher should be due to iodepth 1 being used for the complete fio job.
Looking at iostat during a ceph benchmark I see mostly 4096 rkB/s values on the disks of all machines. Looking at it when running the all_disk.fio job I see tripple the values on the machine that is benchmarked and with the one_disk.fio job ~140000 rkB/s. I wonder does the fio file describe an adequate simulation of what ceph does on the disk layer? It feels as if I am missing something important here.