Thanks for posting this benchmark!
While I appreciate the test results to have them as a reference point when implementing PVE with Ceph, I am not sure that are representative of what people actually need/care for in real-life workloads.
High throughput numbers are cool eye catchers when you want to boost your marketing material (over 9000Mbps! lol), but real performance is at IOPs and latency numbers IMHO.
Typical real life workloads rarely need 1GB/s write speeds. But high IOPs certainly make a difference - especially during night hours when multiple (guest) backups for multiple VMs run at the same time (no control over it as these are client VMs with no access to the guest OS).
I tried the rados bench tests on my lab by using a 4K block size to measure the IOPs performance and while reads reach up to 20k IOPs, writes can barely go over 5k IOPs.
Though I am not sure if both results are any good since each disk on its own can do well over 25k IOPs in writes and 70k IOPs in reads (just a little lower than the advertised specs for the disks used for testing) according to storagereview.com benchmarks (can't post a direct link due to being a new user).
Or according to your benchmark of your SM863 240GB disks they can do 17k write IOPs using fio with 4k bs.
My lab setup is a 3 node cluster consisting of 3x HP DL360p G8 with 4 SAMSUNG SM863 960GB each (1osd per physical drive) and Xeon E5-2640 with 32GB ECC RAM.
The HP SmartArray P420i onboard controllers are set to HBA mode so the disks are presented directly to PVE without any RAID handling/overhead.
The networking is based on Infiniband (40G) in 'connected' mode with 65520 bytes MTU with active/passive bonding, and I get a maximum of 23Gbit raw networking transfer speeds (iperf measured) between the 3 nodes with IPoIB, which is good enough for testing (or at least two times+ better than 10GbE).
Here are my test PVE nodes packages versions:
Code:
# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-5-pve: 4.13.13-38
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
And here are my rados bench results:
Throughput test (4MB block size) - WRITES
Code:
Total time run: 60.041188
Total writes made: 14215
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 947.017
Stddev Bandwidth: 122.447
Max bandwidth (MB/sec): 1060
Min bandwidth (MB/sec): 368
Average IOPS: 236
Stddev IOPS: 30
Max IOPS: 265
Min IOPS: 92
Average Latency(s): 0.0675757
Stddev Latency(s): 0.0502804
Max latency(s): 0.966638
Min latency(s): 0.0166262
Throughput test (4MB block size) - READS
Code:
Total time run: 21.595730
Total reads made: 14215
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2632.93
Average IOPS: 658
Stddev IOPS: 10
Max IOPS: 685
Min IOPS: 646
Average Latency(s): 0.0233569
Max latency(s): 0.158183
Min latency(s): 0.0123441
IOPs test (4K block size) - WRITES
Code:
Total time run: 60.002736
Total writes made: 315615
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 20.5469
Stddev Bandwidth: 0.847211
Max bandwidth (MB/sec): 23.3555
Min bandwidth (MB/sec): 16.7188
Average IOPS: 5260
Stddev IOPS: 216
Max IOPS: 5979
Min IOPS: 4280
Average Latency(s): 0.00304033
Stddev Latency(s): 0.000755765
Max latency(s): 0.0208767
Min latency(s): 0.00156849
IOPs test (4K block size) - READS
Code:
Total time run: 15.658241
Total reads made: 315615
Read size: 4096
Object size: 4096
Bandwidth (MB/sec): 78.7362
Average IOPS: 20156
Stddev IOPS: 223
Max IOPS: 20623
Min IOPS: 19686
Average Latency(s): 0.000779536
Max latency(s): 0.00826032
Min latency(s): 0.000374155
Any ideas why the IOPs performance is so low on the 4K bs tests compared to using the disks standalone without Ceph?
I understand that there will definitely be a slowdown in performance due to the nature/overhead of any software defined storage solution, but are there any suggestions to make these results better since there are too much spare resources to be utilized?
Or to put it another way, how can I find what is the bottleneck in my tests (since the network and the disks can handle way more than what I am currently getting) ?
Thanks and apologies for the long post