Proxmox VE Ceph Benchmark 2018/02

Hi Alwin and thanks for your advice concerning the principle of a baseline comparison.

In fact, I don't know exactly the performance I can expect but I have made another fio benchs (I have read some docs about fio) and I have the feeling that my new benchs are not really good as regard my underlying hardware (see below).

So, I have tested a simple SSD Intel S3520 (a "OSD" disk) as you said with this fio command (the SSD was formated in EXT4):

Code:
fio --bs=4k --size=4G --numjobs=3 --time_based --runtime=40 --group_reporting --name myjob \
    --rw=randwrite --ioengine=libaio --iodepth=8 --direct=1 --fsync=1

and I obtain approximatively 6261 iops.

After that, I have tested exactly the same fio bench in a Debian Stretch VM (which uses the Ceph storage of course, with a VirtIO disk and 512MB for the RAM) and I obtain approximatively 362 iops.

1. It seems to me not very good as regard my hardware configuration (I have 24 SSD ODS!). Am I wrong?
2. Which another tests/benchs can I try?

Just for memory, here is my configuration and the complete output of the fio benchs in the attachment file promox-5.2-ceph-luminous-conf.txt.


Thx in advance for your help.
 

Attachments

Is that possible to doing drive recovery at different spirt private network? Instead of killing the performance while in production?

I’m testing ceph with intel 900p as journal will update update later with the testing.
 
So, I have tested a simple SSD Intel S3520 (a "OSD" disk) as you said with this fio command (the SSD was formated in EXT4):

Code:
fio --bs=4k --size=4G --numjobs=3 --time_based --runtime=40 --group_reporting --name myjob \
--rw=randwrite --ioengine=libaio --iodepth=8 --direct=1 --fsync=1
and I obtain approximatively 6261 iops.

After that, I have tested exactly the same fio bench in a Debian Stretch VM (which uses the Ceph storage of course, with a VirtIO disk and 512MB for the RAM) and I obtain approximatively 362 iops.
@flaf, what are you trying to measure? The filesystem on the device introduces a layer in between that doesn't represent the capabilities of the underlying disk (especially that ext4 is doing badly with sync writes). Further, you should revise the fio options, eg. why to use randwrite or iodepth/numjobs and what to expect from it. See the fio tests in the benchmark pdf for comparison.

For your benchmarks, you need to look at the big picture and go from top-down or bottom-up through the whole stack. This means, that you test the hardware and software of your cluster. The different layers also work differently, eg. ceph uses 4 MB objects (left alone the network), while disks may use 512 KB (depends on the disks blocksize). So your tests of these layers need to be different too.

In this thread, you will find different types of hardware to compare your results against. Use the benchmark paper as guideline for a good comparison of the results.
 
Hi,

Thx for your answer @Alwin.

In fact, 1) I just would like to know the random write iops I could reach in the VMs of my Proxmox cluster and 2) I would like to understand why the option --fsync=1 lowers the perfs to this point.

I admit, I probably need to learn more about fio (and I/O on Linux). I try but it's difficult to find good documentation. ;)

I am a little "annoyed" because I have a colleague who has made a vmware/vSan cluster exactly with the same hardware configuration. With vSan, the storage mechanism is different than Ceph (it's based on a kind of network RAID1 based on dedicated SSD cache disks, the 3710 SSDs) but the perfs are better in a VM (~1200 iops in an identical Debian VM with the same fio test).

I will dig... ;)
 
Last edited:
  • Like
Reactions: Otter7721
fsync makes sure the data is written to disk and doesn't land in any cache. It is to the underlying system to honor the fsync or ignore it. As the vSAN is using a dedicated cache disk and works differently then ceph, the starting points are not equal. Also you are using Intel S3520, they are slower then the S3710. To compete against your colleague, you may need to learn both storage technologies. ;)
 
Do some test

Dual E5-2660
75 GB RAM
SM863 OS Host
Dual port Mellanox 56Gb/s
3 x OSD 5TB Hard drive Per server 9 total OSD
1 x P3700 Journal per node 3 total

Code:
osd commit_latency(ms) apply_latency(ms)
  8                 65                65
  7                 74                74
  6                 52                52
  3                  0                 0
  5                214               214
  0                 70                70
  1                 85                85
  2                 76                76
  4                196               196


the test result with rados bench -p test 60 write --no-cleanup

Code:
Total time run:         60.319902
Total writes made:      3802
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     252.122
Stddev Bandwidth:       19.7516
Max bandwidth (MB/sec): 284
Min bandwidth (MB/sec): 212
Average IOPS:           63
Stddev IOPS:            4
Max IOPS:               71
Min IOPS:               53
Average Latency(s):     0.253834
Stddev Latency(s):      0.131711
Max latency(s):         1.10938
Min latency(s):         0.0352605

rados bench -p rbd -t 16 60 seq

Code:
Total time run:       14.515122
Total reads made:     3802
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1047.73
Average IOPS:         261
Stddev IOPS:          10
Max IOPS:             277
Min IOPS:             239
Average Latency(s):   0.0603855
Max latency(s):       0.374936
Min latency(s):       0.0161585

rados bench -p rbd -t 16 60 rand

Code:
Total time run:       60.076015
Total reads made:     19447
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1294.83
Average IOPS:         323
Stddev IOPS:          20
Max IOPS:             364
Min IOPS:             259
Average Latency(s):   0.0488371
Max latency(s):       0.468844
Min latency(s):       0.00179505


iperf -c

Code:
 iperf -c 10.1.1.17
------------------------------------------------------------
Client connecting to 10.1.1.17, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 10.1.1.16 port 54442 connected with 10.1.1.17 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  24.5 GBytes  21.1 Gbits/sec
 
Hi guys,

Here's a short summary of some tests led in our lab.

Hyperconverged setup.

Server platform is 4x :

Lenovo SR650
2x Intel Silver 4114 10 cores
256 GB RAM @2666Mhz
1X Embedded 2x10Gbps Base-T LOM (x722 Intel) #CEPH
1X PCI-E 2x10Gbps Base-T adpater (x550 Intel) #VMBR0
For each server disk subsystem is :
2x32Gb NVMe on M2 RAID 1 Adapter for OS
3x 430-8i HBA's (8 HDD/HBA)
8x 480GB Micron 5100 Pro SSD
16x 2,4TB 10K RPM Seagate Exos

Switches are 2x Netgear M4300-24x10Gbps Base-T (4x10Gbps Stacking)

SETUP SPECIFICS

CEPH network is a 2x10Gbps LACP
CEPHX is disabled
DEBUG is disabled
Mon on nodes 1,2,3

SSD gathered in a pool as writeback cache for the SAS POOL.

No specific TCP Tuning, No JUMBO FRAMES


TEST PURPOSE : Test VM HDD subsystem itself with a quick and dirty vm (there are tons of them out there)?

Sample VM : Worst case scenario
8 vCPU
4096 MB RAM
Ubuntu Linux 16.04.03 updated
No LVM
HDD with NO CACHE, Size 200GB
All in one partition ext4


TEST ENGINE : fio

For baselining :

One VM issuing I/O without any other VMs running.

BLOC SIZE 4K

60K 4K Iops randread submillisecond
30K 4K Iops randwrite submillisecond

TROUGHPUT TEST 1M Blocksize write
Approximately 985MB/s steady on a 120GB test file


For reference :
4K writes are uncommon pattern, usually we see apps that writes way larger blocks, so we tested it with a IO size of 32K against a 64GB file :
30K iops rand read (10Gbps link is the bottleneck, LACP will not apply with just one vm issuing I/O on a single pipeline)
20K iops rand write for 620MB/s



12 CLONES TEST :
At this time LACP kicked in to break the 10Gbps single link speed

3x VM on each host issuing the same fio test

So with 12 VM's issuing I/O concurrently, and because of caching we have a small ramp time of less than a minute :

I/O are issued against a 64GB file on each VM

rapidly 300K randread IOPS, peaks at 466K, 344K AVG for 4K
instant 40K 4K randwrite IOPS, peaks at 70K, 66K AVG

Again we tested against 32K randwrites and had an average of 44K iops for 1,31 GB/s of troughput

We are currently gatering more data and tweaking the platform.

Regards,
 
  • Like
Reactions: DerDanilo
I did it like this :

one pool with all the SSD one pool with all the HDD

Then assign the SSD pool named cache to the HDD pool named data

Code:
ceph osd tier add data cache

Assign cache policy :

Code:
ceph osd tier cache-mode cache writeback

To issue I/O from the SSD pool :

Code:
ceph osd tier set-overlay data cache

And set

Code:
ceph osd pool set cache hit_set_type bloom

The SSD are not set as wal/db devices for the spinning drives, maybe it would help, i should try. The HDD are hybrid so i don't know if it will help much. Anyway it is worth a try i guess.

Regards,
 
  • Like
Reactions: Knuuut
@Datanat, do you have any rados bench results?
 
So here are the results :

I/Os are issued on SSD pools :

Code:
rados bench -p cache 60 write -b 4M -t 16

 sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     16251     16235   1082.18      1084   0.0415562   0.0590952
Total time run:         60.048231
Total writes made:      16251
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1082.53
Stddev Bandwidth:       29.4887
Max bandwidth (MB/sec): 1156
Min bandwidth (MB/sec): 988
Average IOPS:           270
Stddev IOPS:            7
Max IOPS:               289
Min IOPS:               247
Average Latency(s):     0.0591144
Stddev Latency(s):      0.0183519
Max latency(s):         0.29716
Min latency(s):         0.02596
Cleaning up (deleting benchmark objects)
Removed 16251 objects
Clean up completed and total clean up time :2.970829

Now we issue I/O directly to the data pool

Code:
rados bench -p data  60 write -b 4M -t 16

As intended IOs are redirected to the writeback pool :

Code:
ceph osd pool stats
pool cache id 3
  client io 1050 MB/s wr, 0 op/s rd, 525 op/s wr
  cache tier io 262 op/s promote

pool data id 4
  nothing is going on

So same results here :


Code:
 2018-07-06 08:25:37.224862 min lat: 0.0261885 max lat: 0.293642 avg lat: 0.0585165
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     16414     16398   1093.04      1084   0.0575738   0.0585165
Total time run:         60.069431
Total writes made:      16415
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1093.07
Stddev Bandwidth:       28.9973
Max bandwidth (MB/sec): 1156
Min bandwidth (MB/sec): 1016
Average IOPS:           273
Stddev IOPS:            7
Max IOPS:               289
Min IOPS:               254
Average Latency(s):     0.0585415
Stddev Latency(s):      0.0181863
Max latency(s):         0.293642
Min latency(s):         0.0261885
Cleaning up (deleting benchmark objects)
Removed 16415 objects
Clean up completed and total clean up time :3.163250


It seems like we can only saturate one link wereas muliple vms issuing ios concurrently can break this barrier.

Any experience on that ?
 
The caching pool has good performance. The real write throughput will show once there is I/O on the cluster. As with VMs most data seems to be hot data and never leaf the caching pool, hence not freeing up space for caching new I/O.
 
Yes @Alwin you are right,

We will need to tweak this to get a more 'real life' scenario.

In fact it is written 'cache' in Ceph's documentation but it looks more like a tiering system.

By default it seems that there is no dirty object evictions until the cache pool is full. So eventually, with SSD fully filled, the performance dropdown would be absurd.

I will toy around with this : http://docs.ceph.com/docs/jewel/rados/operations/cache-tiering/


Best regards,
 
@Alwin here are the raw perfs of the 64xSAS pool

The real write throughput will show once there is I/O on the cluster

Destroyed the pools, created only an HDD pool

Issued a lot of write threads :

Code:
 rados bench -p testsas  180 write -b 4M -t 1024 --no-cleanup

Code:
2018-07-06 14:51:45.695910 min lat: 3.33414 max lat: 4.06629 avg lat: 3.67097
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
  180     928     50156     49228   1093.79      1488     3.33424     3.67097
Total time run:         180.169509
Total writes made:      50156
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1113.53
Stddev Bandwidth:       173.51
Max bandwidth (MB/sec): 1488
Min bandwidth (MB/sec): 0
Average IOPS:           278
Stddev IOPS:            43
Max IOPS:               372
Min IOPS:               0
Average Latency(s):     3.63559
Stddev Latency(s):      0.301269
Max latency(s):         4.06629
Min latency(s):         0.170923


Otherwise it is is near SSD perfs.

Code:
rados bench -p testsas  60 write -b 4M -t 16 --no-cleanup

Code:
2018-07-06 15:05:22.944918 min lat: 0.0238251 max lat: 0.305691 avg lat: 0.0673414
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   60      16     14261     14245   949.533       932   0.0568388   0.0673414
Total time run:         60.075153
Total writes made:      14261
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     949.544
Stddev Bandwidth:       28.6206
Max bandwidth (MB/sec): 1004
Min bandwidth (MB/sec): 876
Average IOPS:           237
Stddev IOPS:            7
Max IOPS:               251
Min IOPS:               219
Average Latency(s):     0.0673869
Stddev Latency(s):      0.0214755
Max latency(s):         0.305691
Min latency(s):         0.0238251



Anyway we put a few terabytes of data on the platform so we are still short stroking the disks and it will not be the same as many concurrent io patterns issued by vms.

I filled a HDD @80% to test the performance degradation.

With the hybrid system helping we see like 625 - 1800 randwrite iops per drive and with larger IOs we can peak @270MB/s per drive.

Filled at 80%, it provide only 144MB/s and a max of 320 iops and latency rises up a lot.

Regards,
 
You may try to use the SSDs asl DB/WAL devices, so a multiple of HDDs can share a SSD. This will benefit the small writes, as they go to the SSD.
 
Thanks for posting this benchmark!

While I appreciate the test results to have them as a reference point when implementing PVE with Ceph, I am not sure that are representative of what people actually need/care for in real-life workloads.

High throughput numbers are cool eye catchers when you want to boost your marketing material (over 9000Mbps! lol), but real performance is at IOPs and latency numbers IMHO.
Typical real life workloads rarely need 1GB/s write speeds. But high IOPs certainly make a difference - especially during night hours when multiple (guest) backups for multiple VMs run at the same time (no control over it as these are client VMs with no access to the guest OS).

I tried the rados bench tests on my lab by using a 4K block size to measure the IOPs performance and while reads reach up to 20k IOPs, writes can barely go over 5k IOPs.
Though I am not sure if both results are any good since each disk on its own can do well over 25k IOPs in writes and 70k IOPs in reads (just a little lower than the advertised specs for the disks used for testing) according to storagereview.com benchmarks (can't post a direct link due to being a new user).
Or according to your benchmark of your SM863 240GB disks they can do 17k write IOPs using fio with 4k bs.

My lab setup is a 3 node cluster consisting of 3x HP DL360p G8 with 4 SAMSUNG SM863 960GB each (1osd per physical drive) and Xeon E5-2640 with 32GB ECC RAM.

The HP SmartArray P420i onboard controllers are set to HBA mode so the disks are presented directly to PVE without any RAID handling/overhead.

The networking is based on Infiniband (40G) in 'connected' mode with 65520 bytes MTU with active/passive bonding, and I get a maximum of 23Gbit raw networking transfer speeds (iperf measured) between the 3 nodes with IPoIB, which is good enough for testing (or at least two times+ better than 10GbE).

Here are my test PVE nodes packages versions:
Code:
# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-5-pve: 4.13.13-38
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: not correctly installed
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3

And here are my rados bench results:

Throughput test (4MB block size) - WRITES
Code:
Total time run:         60.041188
Total writes made:      14215
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     947.017
Stddev Bandwidth:       122.447
Max bandwidth (MB/sec): 1060
Min bandwidth (MB/sec): 368
Average IOPS:           236
Stddev IOPS:            30
Max IOPS:               265
Min IOPS:               92
Average Latency(s):     0.0675757
Stddev Latency(s):      0.0502804
Max latency(s):         0.966638
Min latency(s):         0.0166262

Throughput test (4MB block size) - READS
Code:
Total time run:       21.595730
Total reads made:     14215
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2632.93
Average IOPS:         658
Stddev IOPS:          10
Max IOPS:             685
Min IOPS:             646
Average Latency(s):   0.0233569
Max latency(s):       0.158183
Min latency(s):       0.0123441

IOPs test (4K block size) - WRITES
Code:
Total time run:         60.002736
Total writes made:      315615
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     20.5469
Stddev Bandwidth:       0.847211
Max bandwidth (MB/sec): 23.3555
Min bandwidth (MB/sec): 16.7188
Average IOPS:           5260
Stddev IOPS:            216
Max IOPS:               5979
Min IOPS:               4280
Average Latency(s):     0.00304033
Stddev Latency(s):      0.000755765
Max latency(s):         0.0208767
Min latency(s):         0.00156849

IOPs test (4K block size) - READS
Code:
Total time run:       15.658241
Total reads made:     315615
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   78.7362
Average IOPS:         20156
Stddev IOPS:          223
Max IOPS:             20623
Min IOPS:             19686
Average Latency(s):   0.000779536
Max latency(s):       0.00826032
Min latency(s):       0.000374155

Any ideas why the IOPs performance is so low on the 4K bs tests compared to using the disks standalone without Ceph?
I understand that there will definitely be a slowdown in performance due to the nature/overhead of any software defined storage solution, but are there any suggestions to make these results better since there are too much spare resources to be utilized?

Or to put it another way, how can I find what is the bottleneck in my tests (since the network and the disks can handle way more than what I am currently getting) ?

Thanks and apologies for the long post :)


If you out your p420 in hba, from what are you booting? Where did you install proxmox (p420 in hba cannot boot from It)
 
If you out your p420 in hba, from what are you booting? Where did you install proxmox (p420 in hba cannot boot from It)
I installed Proxmox on the first drive and then manually installed the bootloader onto a USB stick.
Then I configured the server to boot from the USB stick since it cannot boot from any drive on the controller when in HBA mode.

This was a lab setup so I didn't bother with Software RAID1 for proxmox installation.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!