Proxmox VE Ceph Benchmark 2018/02

flaf · May 26, 2018

Hi Alwin and thanks for your advice concerning the principle of a baseline comparison.

In fact, I don't know exactly the performance I can expect but I have made another fio benchs (I have read some docs about fio) and I have the feeling that my new benchs are not really good as regard my underlying hardware (see below).

So, I have tested a simple SSD Intel S3520 (a "OSD" disk) as you said with this fio command (the SSD was formated in EXT4):

Code:

fio --bs=4k --size=4G --numjobs=3 --time_based --runtime=40 --group_reporting --name myjob \
    --rw=randwrite --ioengine=libaio --iodepth=8 --direct=1 --fsync=1

and I obtain approximatively 6261 iops.

After that, I have tested exactly the same fio bench in a Debian Stretch VM (which uses the Ceph storage of course, with a VirtIO disk and 512MB for the RAM) and I obtain approximatively 362 iops.

1. It seems to me not very good as regard my hardware configuration (I have 24 SSD ODS!). Am I wrong?
2. Which another tests/benchs can I try?

Just for memory, here is my configuration and the complete output of the fio benchs in the attachment file promox-5.2-ceph-luminous-conf.txt.

Thx in advance for your help.

mada · May 27, 2018

Is that possible to doing drive recovery at different spirt private network? Instead of killing the performance while in production?

I’m testing ceph with intel 900p as journal will update update later with the testing.

Alwin · May 28, 2018

flaf said:
So, I have tested a simple SSD Intel S3520 (a "OSD" disk) as you said with this fio command (the SSD was formated in EXT4):

Code:

fio --bs=4k --size=4G --numjobs=3 --time_based --runtime=40 --group_reporting --name myjob \ --rw=randwrite --ioengine=libaio --iodepth=8 --direct=1 --fsync=1

and I obtain approximatively 6261 iops.

After that, I have tested exactly the same fio bench in a Debian Stretch VM (which uses the Ceph storage of course, with a VirtIO disk and 512MB for the RAM) and I obtain approximatively 362 iops.

@flaf, what are you trying to measure? The filesystem on the device introduces a layer in between that doesn't represent the capabilities of the underlying disk (especially that ext4 is doing badly with sync writes). Further, you should revise the fio options, eg. why to use randwrite or iodepth/numjobs and what to expect from it. See the fio tests in the benchmark pdf for comparison.

For your benchmarks, you need to look at the big picture and go from top-down or bottom-up through the whole stack. This means, that you test the hardware and software of your cluster. The different layers also work differently, eg. ceph uses 4 MB objects (left alone the network), while disks may use 512 KB (depends on the disks blocksize). So your tests of these layers need to be different too.

In this thread, you will find different types of hardware to compare your results against. Use the benchmark paper as guideline for a good comparison of the results.

flaf · May 28, 2018

Hi,

Thx for your answer @Alwin.

In fact, 1) I just would like to know the random write iops I could reach in the VMs of my Proxmox cluster and 2) I would like to understand why the option --fsync=1 lowers the perfs to this point.

I admit, I probably need to learn more about fio (and I/O on Linux). I try but it's difficult to find good documentation.

I am a little "annoyed" because I have a colleague who has made a vmware/vSan cluster exactly with the same hardware configuration. With vSan, the storage mechanism is different than Ceph (it's based on a kind of network RAID1 based on dedicated SSD cache disks, the 3710 SSDs) but the perfs are better in a VM (~1200 iops in an identical Debian VM with the same fio test).

I will dig...

Alwin · May 29, 2018

fsync makes sure the data is written to disk and doesn't land in any cache. It is to the underlying system to honor the fsync or ignore it. As the vSAN is using a dedicated cache disk and works differently then ceph, the starting points are not equal. Also you are using Intel S3520, they are slower then the S3710. To compete against your colleague, you may need to learn both storage technologies.

mada · Jun 19, 2018

Do some test

Dual E5-2660
75 GB RAM
SM863 OS Host
Dual port Mellanox 56Gb/s
3 x OSD 5TB Hard drive Per server 9 total OSD
1 x P3700 Journal per node 3 total

Code:

osd commit_latency(ms) apply_latency(ms)
  8                 65                65
  7                 74                74
  6                 52                52
  3                  0                 0
  5                214               214
  0                 70                70
  1                 85                85
  2                 76                76
  4                196               196

the test result with rados bench -p test 60 write --no-cleanup

Code:

Total time run:         60.319902
Total writes made:      3802
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     252.122
Stddev Bandwidth:       19.7516
Max bandwidth (MB/sec): 284
Min bandwidth (MB/sec): 212
Average IOPS:           63
Stddev IOPS:            4
Max IOPS:               71
Min IOPS:               53
Average Latency(s):     0.253834
Stddev Latency(s):      0.131711
Max latency(s):         1.10938
Min latency(s):         0.0352605

rados bench -p rbd -t 16 60 seq

Code:

Total time run:       14.515122
Total reads made:     3802
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1047.73
Average IOPS:         261
Stddev IOPS:          10
Max IOPS:             277
Min IOPS:             239
Average Latency(s):   0.0603855
Max latency(s):       0.374936
Min latency(s):       0.0161585

rados bench -p rbd -t 16 60 rand

Code:

Total time run:       60.076015
Total reads made:     19447
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1294.83
Average IOPS:         323
Stddev IOPS:          20
Max IOPS:             364
Min IOPS:             259
Average Latency(s):   0.0488371
Max latency(s):       0.468844
Min latency(s):       0.00179505

iperf -c

Code:

 iperf -c 10.1.1.17
------------------------------------------------------------
Client connecting to 10.1.1.17, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 10.1.1.16 port 54442 connected with 10.1.1.17 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  24.5 GBytes  21.1 Gbits/sec

Datanat · Jul 3, 2018

Hi guys,

Here's a short summary of some tests led in our lab.

Hyperconverged setup.

Server platform is 4x :

Lenovo SR650
2x Intel Silver 4114 10 cores
256 GB RAM @2666Mhz
1X Embedded 2x10Gbps Base-T LOM (x722 Intel) #CEPH
1X PCI-E 2x10Gbps Base-T adpater (x550 Intel) #VMBR0
For each server disk subsystem is :
2x32Gb NVMe on M2 RAID 1 Adapter for OS
3x 430-8i HBA's (8 HDD/HBA)
8x 480GB Micron 5100 Pro SSD
16x 2,4TB 10K RPM Seagate Exos

Switches are 2x Netgear M4300-24x10Gbps Base-T (4x10Gbps Stacking)

SETUP SPECIFICS

CEPH network is a 2x10Gbps LACP
CEPHX is disabled
DEBUG is disabled
Mon on nodes 1,2,3

SSD gathered in a pool as writeback cache for the SAS POOL.

No specific TCP Tuning, No JUMBO FRAMES

TEST PURPOSE : Test VM HDD subsystem itself with a quick and dirty vm (there are tons of them out there)?

Sample VM : Worst case scenario
8 vCPU
4096 MB RAM
Ubuntu Linux 16.04.03 updated
No LVM
HDD with NO CACHE, Size 200GB
All in one partition ext4

TEST ENGINE : fio

For baselining :

One VM issuing I/O without any other VMs running.

BLOC SIZE 4K

60K 4K Iops randread submillisecond
30K 4K Iops randwrite submillisecond

TROUGHPUT TEST 1M Blocksize write
Approximately 985MB/s steady on a 120GB test file

For reference :
4K writes are uncommon pattern, usually we see apps that writes way larger blocks, so we tested it with a IO size of 32K against a 64GB file :
30K iops rand read (10Gbps link is the bottleneck, LACP will not apply with just one vm issuing I/O on a single pipeline)
20K iops rand write for 620MB/s

12 CLONES TEST :
At this time LACP kicked in to break the 10Gbps single link speed

3x VM on each host issuing the same fio test

So with 12 VM's issuing I/O concurrently, and because of caching we have a small ramp time of less than a minute :

I/O are issued against a 64GB file on each VM

rapidly 300K randread IOPS, peaks at 466K, 344K AVG for 4K
instant 40K 4K randwrite IOPS, peaks at 70K, 66K AVG

Again we tested against 32K randwrites and had an average of 44K iops for 1,31 GB/s of troughput

We are currently gatering more data and tweaking the platform.

Regards,

Knuuut · Jul 5, 2018

Datanat said:
SSD gathered in a pool as writeback cache for the SAS POOL

How did you do that?

Confusing:

Datanat said:
8x 480GB Micron 5100 Pro SSD
16x 2,4TB 10K RPM Seagate Exos

I wonder if the Micron SSDs have been setup as seperate wal/db devices

for the Seagate OSDs.

- or -

The Micron SSDs have been setup as OSDs in a seperate Pool?

Regards

Datanat · Jul 5, 2018

I did it like this :

one pool with all the SSD one pool with all the HDD

Then assign the SSD pool named cache to the HDD pool named data

Code:

ceph osd tier add data cache

Assign cache policy :

Code:

ceph osd tier cache-mode cache writeback

To issue I/O from the SSD pool :

Code:

ceph osd tier set-overlay data cache

And set

Code:

ceph osd pool set cache hit_set_type bloom

The SSD are not set as wal/db devices for the spinning drives, maybe it would help, i should try. The HDD are hybrid so i don't know if it will help much. Anyway it is worth a try i guess.

Regards,

Alwin · Jul 5, 2018

@Datanat, do you have any rados bench results?

Datanat · Jul 5, 2018

@Alwin no i just tested the vm. Il will use the same commands you used in the official benchmark and post the results.

Datanat · Jul 6, 2018

So here are the results :

I/Os are issued on SSD pools :

Code:

rados bench -p cache 60 write -b 4M -t 16

 sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     16251     16235   1082.18      1084   0.0415562   0.0590952
Total time run:         60.048231
Total writes made:      16251
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1082.53
Stddev Bandwidth:       29.4887
Max bandwidth (MB/sec): 1156
Min bandwidth (MB/sec): 988
Average IOPS:           270
Stddev IOPS:            7
Max IOPS:               289
Min IOPS:               247
Average Latency(s):     0.0591144
Stddev Latency(s):      0.0183519
Max latency(s):         0.29716
Min latency(s):         0.02596
Cleaning up (deleting benchmark objects)
Removed 16251 objects
Clean up completed and total clean up time :2.970829

Now we issue I/O directly to the data pool

Code:

rados bench -p data  60 write -b 4M -t 16

As intended IOs are redirected to the writeback pool :

Code:

ceph osd pool stats
pool cache id 3
  client io 1050 MB/s wr, 0 op/s rd, 525 op/s wr
  cache tier io 262 op/s promote

pool data id 4
  nothing is going on

So same results here :

Code:

 2018-07-06 08:25:37.224862 min lat: 0.0261885 max lat: 0.293642 avg lat: 0.0585165
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     16414     16398   1093.04      1084   0.0575738   0.0585165
Total time run:         60.069431
Total writes made:      16415
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1093.07
Stddev Bandwidth:       28.9973
Max bandwidth (MB/sec): 1156
Min bandwidth (MB/sec): 1016
Average IOPS:           273
Stddev IOPS:            7
Max IOPS:               289
Min IOPS:               254
Average Latency(s):     0.0585415
Stddev Latency(s):      0.0181863
Max latency(s):         0.293642
Min latency(s):         0.0261885
Cleaning up (deleting benchmark objects)
Removed 16415 objects
Clean up completed and total clean up time :3.163250

It seems like we can only saturate one link wereas muliple vms issuing ios concurrently can break this barrier.

Any experience on that ?

Alwin · Jul 6, 2018

The caching pool has good performance. The real write throughput will show once there is I/O on the cluster. As with VMs most data seems to be hot data and never leaf the caching pool, hence not freeing up space for caching new I/O.

Datanat · Jul 6, 2018

Yes @Alwin you are right,

We will need to tweak this to get a more 'real life' scenario.

In fact it is written 'cache' in Ceph's documentation but it looks more like a tiering system.

By default it seems that there is no dirty object evictions until the cache pool is full. So eventually, with SSD fully filled, the performance dropdown would be absurd.

I will toy around with this : http://docs.ceph.com/docs/jewel/rados/operations/cache-tiering/

Best regards,

Alwin · Jul 6, 2018

See here, the text changed quiet through the releases. There are additional notes about RBD now.
http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/

Datanat · Jul 6, 2018

Thanks @Alwin.
I will read this carefully

Datanat · Jul 6, 2018

@Alwin here are the raw perfs of the 64xSAS pool

The real write throughput will show once there is I/O on the cluster

Destroyed the pools, created only an HDD pool

Issued a lot of write threads :

Code:

 rados bench -p testsas  180 write -b 4M -t 1024 --no-cleanup

Code:

2018-07-06 14:51:45.695910 min lat: 3.33414 max lat: 4.06629 avg lat: 3.67097
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
  180     928     50156     49228   1093.79      1488     3.33424     3.67097
Total time run:         180.169509
Total writes made:      50156
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1113.53
Stddev Bandwidth:       173.51
Max bandwidth (MB/sec): 1488
Min bandwidth (MB/sec): 0
Average IOPS:           278
Stddev IOPS:            43
Max IOPS:               372
Min IOPS:               0
Average Latency(s):     3.63559
Stddev Latency(s):      0.301269
Max latency(s):         4.06629
Min latency(s):         0.170923

Otherwise it is is near SSD perfs.

Code:

rados bench -p testsas  60 write -b 4M -t 16 --no-cleanup

Code:

2018-07-06 15:05:22.944918 min lat: 0.0238251 max lat: 0.305691 avg lat: 0.0673414
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     14261     14245   949.533       932   0.0568388   0.0673414
Total time run:         60.075153
Total writes made:      14261
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     949.544
Stddev Bandwidth:       28.6206
Max bandwidth (MB/sec): 1004
Min bandwidth (MB/sec): 876
Average IOPS:           237
Stddev IOPS:            7
Max IOPS:               251
Min IOPS:               219
Average Latency(s):     0.0673869
Stddev Latency(s):      0.0214755
Max latency(s):         0.305691
Min latency(s):         0.0238251

Anyway we put a few terabytes of data on the platform so we are still short stroking the disks and it will not be the same as many concurrent io patterns issued by vms.

I filled a HDD @80% to test the performance degradation.

With the hybrid system helping we see like 625 - 1800 randwrite iops per drive and with larger IOs we can peak @270MB/s per drive.

Filled at 80%, it provide only 144MB/s and a max of 320 iops and latency rises up a lot.

Regards,

Alwin · Jul 6, 2018

You may try to use the SSDs asl DB/WAL devices, so a multiple of HDDs can share a SSD. This will benefit the small writes, as they go to the SSD.

wimp9849 · Sep 5, 2018

Cha0s said:
Thanks for posting this benchmark!

While I appreciate the test results to have them as a reference point when implementing PVE with Ceph, I am not sure that are representative of what people actually need/care for in real-life workloads.

High throughput numbers are cool eye catchers when you want to boost your marketing material (over 9000Mbps! lol), but real performance is at IOPs and latency numbers IMHO.
Typical real life workloads rarely need 1GB/s write speeds. But high IOPs certainly make a difference - especially during night hours when multiple (guest) backups for multiple VMs run at the same time (no control over it as these are client VMs with no access to the guest OS).

I tried the rados bench tests on my lab by using a 4K block size to measure the IOPs performance and while reads reach up to 20k IOPs, writes can barely go over 5k IOPs.
Though I am not sure if both results are any good since each disk on its own can do well over 25k IOPs in writes and 70k IOPs in reads (just a little lower than the advertised specs for the disks used for testing) according to storagereview.com benchmarks (can't post a direct link due to being a new user).
Or according to your benchmark of your SM863 240GB disks they can do 17k write IOPs using fio with 4k bs.

My lab setup is a 3 node cluster consisting of 3x HP DL360p G8 with 4 SAMSUNG SM863 960GB each (1osd per physical drive) and Xeon E5-2640 with 32GB ECC RAM.

The HP SmartArray P420i onboard controllers are set to HBA mode so the disks are presented directly to PVE without any RAID handling/overhead.

The networking is based on Infiniband (40G) in 'connected' mode with 65520 bytes MTU with active/passive bonding, and I get a maximum of 23Gbit raw networking transfer speeds (iperf measured) between the 3 nodes with IPoIB, which is good enough for testing (or at least two times+ better than 10GbE).

Here are my test PVE nodes packages versions:

Code:

# pveversion -v proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve) pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4) pve-kernel-4.13.13-6-pve: 4.13.13-41 pve-kernel-4.13.13-5-pve: 4.13.13-38 ceph: 12.2.2-pve1 corosync: 2.4.2-pve3 criu: 2.11.1-1~bpo90 glusterfs-client: 3.8.8-1 ksm-control-daemon: not correctly installed libjs-extjs: 6.0.1-2 libpve-access-control: 5.0-8 libpve-common-perl: 5.0-28 libpve-guest-common-perl: 2.0-14 libpve-http-server-perl: 2.0-8 libpve-storage-perl: 5.0-17 libqb0: 1.0.1-1 lvm2: 2.02.168-pve6 lxc-pve: 2.1.1-2 lxcfs: 2.0.8-2 novnc-pve: 0.6-4 openvswitch-switch: 2.7.0-2 proxmox-widget-toolkit: 1.0-11 pve-cluster: 5.0-20 pve-container: 2.0-19 pve-docs: 5.1-16 pve-firewall: 3.0-5 pve-firmware: 2.0-3 pve-ha-manager: 2.0-5 pve-i18n: 1.0-4 pve-libspice-server1: 0.12.8-3 pve-qemu-kvm: 2.9.1-9 pve-xtermjs: 1.0-2 qemu-server: 5.0-22 smartmontools: 6.5+svn4324-1 spiceterm: 3.0-5 vncterm: 1.5-3

And here are my rados bench results:

Throughput test (4MB block size) - WRITES

Code:

Total time run: 60.041188 Total writes made: 14215 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 947.017 Stddev Bandwidth: 122.447 Max bandwidth (MB/sec): 1060 Min bandwidth (MB/sec): 368 Average IOPS: 236 Stddev IOPS: 30 Max IOPS: 265 Min IOPS: 92 Average Latency(s): 0.0675757 Stddev Latency(s): 0.0502804 Max latency(s): 0.966638 Min latency(s): 0.0166262

Throughput test (4MB block size) - READS

Code:

Total time run: 21.595730 Total reads made: 14215 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 2632.93 Average IOPS: 658 Stddev IOPS: 10 Max IOPS: 685 Min IOPS: 646 Average Latency(s): 0.0233569 Max latency(s): 0.158183 Min latency(s): 0.0123441

IOPs test (4K block size) - WRITES

Code:

Total time run: 60.002736 Total writes made: 315615 Write size: 4096 Object size: 4096 Bandwidth (MB/sec): 20.5469 Stddev Bandwidth: 0.847211 Max bandwidth (MB/sec): 23.3555 Min bandwidth (MB/sec): 16.7188 Average IOPS: 5260 Stddev IOPS: 216 Max IOPS: 5979 Min IOPS: 4280 Average Latency(s): 0.00304033 Stddev Latency(s): 0.000755765 Max latency(s): 0.0208767 Min latency(s): 0.00156849

IOPs test (4K block size) - READS

Code:

Total time run: 15.658241 Total reads made: 315615 Read size: 4096 Object size: 4096 Bandwidth (MB/sec): 78.7362 Average IOPS: 20156 Stddev IOPS: 223 Max IOPS: 20623 Min IOPS: 19686 Average Latency(s): 0.000779536 Max latency(s): 0.00826032 Min latency(s): 0.000374155

Any ideas why the IOPs performance is so low on the 4K bs tests compared to using the disks standalone without Ceph?
I understand that there will definitely be a slowdown in performance due to the nature/overhead of any software defined storage solution, but are there any suggestions to make these results better since there are too much spare resources to be utilized?

Or to put it another way, how can I find what is the bottleneck in my tests (since the network and the disks can handle way more than what I am currently getting) ?

Thanks and apologies for the long post

If you out your p420 in hba, from what are you booting? Where did you install proxmox (p420 in hba cannot boot from It)

Cha0s · Sep 5, 2018

tuonoazzurro said:
If you out your p420 in hba, from what are you booting? Where did you install proxmox (p420 in hba cannot boot from It)

I installed Proxmox on the first drive and then manually installed the bootloader onto a USB stick.
Then I configured the server to boot from the USB stick since it cannot boot from any drive on the controller when in HBA mode.

This was a lab setup so I didn't bother with Software RAID1 for proxmox installation.

Proxmox VE Ceph Benchmark 2018/02

New Member

Attachments

Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

Member

New Member

Member

New Member

Proxmox Retired Staff

New Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

New Member

New Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

We value your privacy