Ceph and KVM terrible disk IO

pashadee · Nov 3, 2016

Hi guys,

Need some bright minds in the Proxmox community

I setup a storage cluster using Ceph (as the title suggests). It's a fairly large cluster consisting of over 200 osds. When I bench the cluster using rados bench, I get exactly the kind of performance I was expecting to get... doing dd to a single drive on the node I get on average 205MB/s .. writing to the cluster via rados I get 400-500MB/s writes need to be at least in the area of that in my VMs.

I installed Ubuntu 16.04x64 in KVM with a 4TB disk partition with ext4. Same dd test there I was able to get 25MB/s write on a 1G chunk and 15-20MB (varying) on a 2G chunk. I am more interested in the sustained write as that's what my usage is going to be. Experimented with different cache settings on the disk and with everything that wasn't No Cache the write dropped.

A few details about the setup:
Cluster Network: 10.20.20.0/24 -> 10Gb
Public Network: 10.1.101.0/24 -> 1Gb

3 Monitor/VM hosts:
vmh1
vmh2
vmh3

8 Storage Nodes:
stor1
stor2
stor3
stor4
stor5
stor6
stor7
stor8

I installed the Proxmox VE on all the nodes, and ceph on all the nodes for easier management. My VMs (which I won't have that many) will all run on the vmh nodes. I noticed last night that my /etc/hosts file had the IPs of the public network for vmh1 vmh2 vmh3 and the Cluster network for the storage nodes. Not sure how proxmox compiles the hosts file, maybe someone can give me an insight on that. My thought is that maybe the Ubuntu VM was writing to the cluster through the 1Gb network instead of the 10Gb... but even then 25MB/s is only 200Mbit so any suggestions in some steps to isolate where the bottleneck is would be greatly apprciated.

I just downloaded an Ubuntu 14.04 LXC and will deploy that to do the same test there just to see if there is any difference.

Thanks in advance!

Pasha

pashadee · Nov 3, 2016

Here's another test to add, from the same host the ubuntu vm is running on:

root@vmh1:/# rbd create test02 --pool backup --size 20000
root@vmh1:/# rbd map test02 --pool backup
/dev/rbd0
root@vmh1:/# mkfs.ext4 /dev/rbd0
mke2fs 1.42.12 (29-Aug-2014)
Discarding device blocks: done
Creating filesystem with 5120000 4k blocks and 1281120 inodes
Filesystem UUID: 22358c35-ff59-4b3c-88b8-e8b43acf9a9e
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

root@vmh1:/# mount /dev/rbd0 /mnt/tmp
root@vmh1:/# cd /mnt/tmp/
root@vmh1:/mnt/tmp# ls
lost+found
root@vmh1:/mnt/tmp# dd if=/dev/zero of=here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.47215 s, 729 MB/s
root@vmh1:/mnt/tmp# dd if=/dev/zero of=here10 bs=10G count=1 oflag=direct
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 3.8484 s, 558 MB/s
root@vmh1:/mnt/tmp# dd if=/dev/zero of=here10 bs=9G count=1 oflag=direct
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB) copied, 3.93635 s, 546 MB/s
root@vmh1:/mnt/tmp# dd if=/dev/zero of=here10 bs=2G count=5 oflag=direct
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+5 records in
0+5 records out
10737397760 bytes (11 GB) copied, 13.7985 s, 778 MB/s

I guess dd has a 2G limitation I didn't know about, but either way, the cluster performance is there... something is messing up between the host and the VM.

udo · Nov 3, 2016

pashadee said:
I installed the Proxmox VE on all the nodes, and ceph on all the nodes for easier management. My VMs (which I won't have that many) will all run on the vmh nodes. I noticed last night that my /etc/hosts file had the IPs of the public network for vmh1 vmh2 vmh3 and the Cluster network for the storage nodes. Not sure how proxmox compiles the hosts file, maybe someone can give me an insight on that. My thought is that maybe the Ubuntu VM was writing to the cluster through the 1Gb network instead of the 10Gb... but even then 25MB/s is only 200Mbit so any suggestions in some steps to isolate where the bottleneck is would be greatly apprciated.

Hi Pasha,
of course use the VM the 1GB link, because that is the public network!
The cluster network is for writing replicas only - the traffic fom VM to mon and VM to OSDs are via public network.

Perhaps you should use vlan-tagging to use the 10GB-link for both Networks (and the same bonded for the 1GB-Link for failover?).

About the write-speed: I guess you don't have Journal on fast SSDs?
If you ty to compare to rados bench use the switch t=1 (for single thread) to see how the performance look there.

Udo

pashadee · Nov 3, 2016

Thanks for your reponse Udo!

udo said:
Hi Pasha,
of course use the VM the 1GB link, because that is the public network!
The cluster network is for writing replicas only - the traffic fom VM to mon and VM to OSDs are via public network.

I was under the impression that everything Ceph was on the private network and public was being used for actual interfacing to the clients of the VMs. So for instance VM --> VirtIO --> librbd --> Mon (private) --> Stor (private) .. and reverse on the way back.
That way when I do a dd to the VM disk I am using the 10Gb links as it's mounted in Proxmox (ceph client) via 10.20.20.0 (private)? Is that not the case?

udo said:
Perhaps you should use vlan-tagging to use the 10GB-link for both Networks (and the same
bonded for the 1GB-Link for failover?).
About the write-speed: I guess you don't have Journal on fast SSDs?

I was definitely planning on using the LAG on 1Gb, just waiting for my second switch to arrive so I can stack.

For my use case I am not going to have a workload of a lot of small files, but less large files, so don't need the SSDs.

udo said:
If you ty to compare to rados bench use the switch t=1 (for single thread) to see how the performance look there.

About the write-speed: I guess you don't have Journal on fast SSDs?
I will try with that as well, but still doesn't explain why when mounting rbd volume on proxmox host locally, the transfer rate was 400x faster than on the same host through QEMU, that's the part that confuses me at the moment. I posted the commands I ran in the second message of this threat.

Thanks for your help!

mir · Nov 4, 2016

You cannot use dd for any reliable speed test of any kind. To measure sustained read or write speed use bonnie++. To measure random read and write speed use iozone. To measure read and write IOPS use fio.

pashadee · Nov 4, 2016

Thanks for the pointer mir, I used fio and the results are not any better that's for sure.

test: (groupid=0, jobs=1): err= 0: pid=19544: Thu Nov 3 18:14:11 2016
read : io=3071.7MB, bw=10929KB/s, iops=2732, runt=287815msec
write: io=1024.4MB, bw=3644.5KB/s, iops=911, runt=287815msec
cpu : usr=1.21%, sys=5.31%, ctx=837272, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=786347/w=262229/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: io=3071.7MB, aggrb=10928KB/s, minb=10928KB/s, maxb=10928KB/s, mint=287815msec, maxt=287815msec
WRITE: io=1024.4MB, aggrb=3644KB/s, minb=3644KB/s, maxb=3644KB/s, mint=287815msec, maxt=287815msec

Disk stats (read/write):
vda: ios=786572/262330, merge=0/57, ticks=4755948/13615028, in_queue=18374964, util=100.00%

udo · Nov 4, 2016

pashadee said:
Thanks for your reponse Udo!

I was under the impression that everything Ceph was on the private network and public was being used for actual interfacing to the clients of the VMs. So for instance VM --> VirtIO --> librbd --> Mon (private) --> Stor (private) .. and reverse on the way back.
That way when I do a dd to the VM disk I am using the 10Gb links as it's mounted in Proxmox (ceph client) via 10.20.20.0 (private)? Is that not the case?

no - ceph is a little bit more complicated.

To write:
1. VM -> VirtIO -> librbd
2. librbd -> MON (public) # ask mon on which placementgroup (OSD) the block to write is - the crushmap
3. MON -> librbd (public) # answer
4. librbd -> OSD-primary (public) # wrote the data to the primary OSD of this placement group (PG) (first in journal only!)
5. OSD-primary -> OSDs-secondaries (private!!) # wrote the data again - and copy first the data from journal (mem) over network
6. OSDs-secondaries -> OSD-primary (private) # successfull write (to journal)
7. OSD-primary -> librbd (public) # write ok
8. librbd -> virtio -> VM # write ok

after the data reach the journal, they will written to the OSD - but this is asyncron (for small writes) but influence read-speed!

To read:
1. VM -> VirtIO -> librbd
2. librbd -> MON (public) # ask mon on which placementgroup (OSD) the block to read is - the crushmap
3. MON -> librbd (public) # answer
4. librbd -> OSD-primary (public) # ask for the data
5. OSD-primary -> librbd (public) # answer - the data (all data are 4MB-chunks)
6. librbd -> virtio -> VM

You see, a little bit more, where you get latencies and so on...

For my use case I am not going to have a workload of a lot of small files, but less large files, so don't need the SSDs.

hmm, I would not think so!
look at the write table above - a write is finisched if the data is written to all journals - with a fast SSD (like Intel DC 3600/3700) this is much faster than on hdds.
Small files? Even with some bytes of changes only you must wrote the hole chunck (4MB) and librbd can put multible writes together (for speed up)

I will try with that as well, but still doesn't explain why when mounting rbd volume on proxmox host locally, the transfer rate was 400x faster than on the same host through QEMU, that's the part that confuses me at the moment. I posted the commands I ran in the second message of this threat.

I don't trust your values of the "benchmark" here...

To compare do following with one thread first and 16 after that:

Code:

rados bench 60 write -p rbd -t 1 --no-cleanup

rados bench 60 write -p rbd -t 16 --no-cleanup

I guess your values are much smaller then.

Udo

Search

Search

Ceph and KVM terrible disk IO

pashadee

Active Member

pashadee

Active Member

udo

Distinguished Member

pashadee

Active Member

mir

Famous Member

pashadee

Active Member

udo

Distinguished Member

We value your privacy