Proxmox VE Ceph Benchmark 2018/02

I just ran a comparison with the benchmark running on just 1 node, and then the benchmark running on all 4 nodes to simulate heavy workloads across the entire cluster. Not only did the average IOPS drop as you'd expect, but the average latency jumped due to queueing.

Code:
1 x bench over 10GbE

Max bandwidth (MB/sec): 1476
Min bandwidth (MB/sec): 1280
Average IOPS:           344
Stddev IOPS:            9.61719
Max IOPS:               369
Min IOPS:               320
Average Latency(s):     0.0463861

Code:
4 x bench over 10GbE

Max bandwidth (MB/sec): 1412
Min bandwidth (MB/sec): 412
Average IOPS:           132
Stddev IOPS:            38.3574
Max IOPS:               353
Min IOPS:               103
Average Latency(s):     0.120387

I hope that using 40G will provide each node with enough bandwidth so that we don't don't see contention on the fabric. That should let each node run heavily loaded without degrading storage performance. I should have the hardware tomorrow so I'll post the results early next week.

David
...
 
I thought I'd share some further results as I think they're interesting and they may be of use to someone else. These are the results of the benchmark running over a 40GbE switched network (OM3 fibre). This is the same equipment as my post on 15 Oct with the network moved from 10GbE to 40GbE so you can see the direct comparison (4 node cluster, 4 x 2TB Intel P4510 NVMe drives per node).

We've added 2 x 40GbE ports per server and we're running them in an active / standby bond for the public network. The cluster network is still running over 10GbE. During the write tests the cluster network ran at between 5 and 8 gbps so it shouldn't be impacting performance.

There is an improvement in both sets of numbers, but as Alwin mentioned, moving from 10 to 40 doesn't decrease latency so you don't see anything like 4 times the performance. Below are the numbers from running a single instance of the benchmark.

Code:
# rados -p ceph_1 bench 60 write -b 4M -t 16

Total time run:         60.0269
Total writes made:      24970
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1663.92
Stddev Bandwidth:       49.3114
Max bandwidth (MB/sec): 1752
Min bandwidth (MB/sec): 1496
Average IOPS:           415
Stddev IOPS:            12.3279
Max IOPS:               438
Min IOPS:               374
Average Latency(s):     0.0384613
Stddev Latency(s):      0.0142218
Max latency(s):         0.248035
Min latency(s):         0.0150251


# rados -p ceph_1 bench 180 rand -t 16

Total time run:       180.049
Total reads made:     93945
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2087.1
Average IOPS:         521
Stddev IOPS:          15.8213
Max IOPS:             554
Min IOPS:             445
Average Latency(s):   0.0299043
Max latency(s):       0.272099
Min latency(s):       0.00322598


I was already pretty happy with the 10GbE numbers from a performance perspective. But running multiple instances of the benchmark at the same time (to simulate lots of VMs generating load in parallel) reduced the numbers dramatically. A single benchmark instance using NVMe drives can saturate the 10GbE link so any more load just decreased overall performance. Here's the results from running 3 random read benchmarks at the same time over the 40 GbE fabric. The numbers are only 15% lower than running just 1 benchmark process. So we see about 3 times the data volume at roughly the same data rate. The network port sat at around 35 gbps during the test. That's what I was hoping to prove with this testing. Moving to 40 GbE won't increase raw performance but it'll let you run a lot more load before you see any significant degradation of the performance.

Code:
# rados -p ceph_1 bench 180 rand -t 16

Total time run:       180.053
Total reads made:     78676
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1747.84
Average IOPS:         436
Stddev IOPS:          17.8279
Max IOPS:             470
Min IOPS:             379
Average Latency(s):   0.0356483
Max latency(s):       0.67529
Min latency(s):       0.00471686


And below is a graph of the switch port utilisation. The peaks from left to right are from 1 x write bench, 2 x write bench, 1 x read bench, 2 x read bench, and finally 3 x read bench running simultaneously on the node.

Screen Shot 2019-10-23 at 1.23.42 pm.png



Thanks

David
...
 
as Alwin mentioned, moving from 10 to 40 doesn't decrease latency
Yes it does. I spoke about 25 to 40 GbE, there you most likely won't see a latency change.

Please also keep in mind that the cluster network is there for replication. Limiting Ceph on that end.
1571811960972.png
Put the Cluster network onto the 40 GbE as well, this will lower latency again and increase throughput. To get the best out of the network, you can use the standby link of the bond for this. As you can create a bond on top of a VLAN, you can use both links actively. In a disaster case, where one link is dead, the traffic would be put onto the same link but still work. Both switches would need to trunk all VLANs between each other.

As an example:
Code:
eth0.10 --> bond0 (primary) --> Ceph public
eth1.10 --> bond0

eth0.20 --> bond1
eth1.20 --> bond1 (primary) --> Ceph cluster
 
  • Like
Reactions: Romsch
Just stopping in to share something that might be useful.

I rolled the dice on 16 X 2TB Kingston DC500 SSDs for our new cluster at work. They claim to have full data path protection with capacitors on the board and were the lowest price of any drive in this class. No expectation of high end performance here (not needed for this application), just hoping these would be able to handle being their own db/journal device in a ceph enviroment and give decent performance.

I disabled the write cache and did what I believe is a 4K write sync performance test here:

----------------------

journal-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=136MiB/s][w=34.7k IOPS][eta 00m:00s]
journal-test: (groupid=0, jobs=1): err= 0: pid=27035: Mon Feb 17 16:22:56 2020
write: IOPS=34.7k, BW=135MiB/s (142MB/s)(8128MiB/60001msec); 0 zone resets
clat (usec): min=27, max=420, avg=28.44, stdev= 1.94
lat (usec): min=27, max=420, avg=28.50, stdev= 1.94
clat percentiles (nsec):
| 1.00th=[27776], 5.00th=[27776], 10.00th=[28032], 20.00th=[28032],
| 30.00th=[28032], 40.00th=[28032], 50.00th=[28288], 60.00th=[28288],
| 70.00th=[28288], 80.00th=[28544], 90.00th=[28800], 95.00th=[30080],
| 99.00th=[31104], 99.50th=[36608], 99.90th=[41216], 99.95th=[79360],
| 99.99th=[97792]
bw ( KiB/s): min=134104, max=139896, per=100.00%, avg=138709.74, stdev=744.84, samples=119
iops : min=33526, max=34974, avg=34677.46, stdev=186.27, samples=119
lat (usec) : 50=99.92%, 100=0.08%, 250=0.01%, 500=0.01%
cpu : usr=2.26%, sys=9.30%, ctx=2080674, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2080669,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=135MiB/s (142MB/s), 135MiB/s-135MiB/s (142MB/s-142MB/s), io=8128MiB (8522MB), run=60001-60001msec

Disk stats (read/write):
sdb: ios=58/2076904, merge=0/0, ticks=11/54915, in_queue=0, util=99.91%

------------------------

If I understand the results correctly, looks like 142MB/s and ~35K IOPs.

I re-enabled the write cache, and ran the test again... performance actually dropped to ~57MB/s. Does that make sense?

Either way, order of magnitude better than my homelab cluster full of consumer drives (kilobytes/s on write sync performance). I think these drives may prove to be a decent value after-all (they were ~$300 each).
 
just wondering if anyone is experiencing this issue on their all flash ceph pools, where windows vm is slow during copy operations

Hi,

we run a three node PVE/Ceph cluster with 10x Samsung SM883 1,92 TB per node (very "standard" - replication of 3, one large Ceph pool connected via RBD, dedicated 10 GBit/s mesh network for Ceph). Inside there are a couple of Windows Server 2012 R2 VMs. What kind of performance test would you like to see? :)

Greets
Stephan
 
  • Like
Reactions: yaboc
Hi,

we run a three node PVE/Ceph cluster with 10x Samsung SM883 1,92 TB per node (very "standard" - replication of 3, one large Ceph pool connected via RBD, dedicated 10 GBit/s mesh network for Ceph). Inside there are a couple of Windows Server 2012 R2 VMs. What kind of performance test would you like to see? :)

Greets
Stephan

Hi Stephan.
Is this a hyper converged setup? or just ceph managed through proxmox? list i linked shows that two ceph users experience transfer rate drop while copying 2gb+ files around 1-1.5gb mark (significant drop to 25MB/s) which only happens on windows guests and not linux. would you be able to try to reproduce it on your setup?

what are your crystal mark numbers in those windows based kvms? would you share your osd nodes specs? did you do any specific kernel/ceph/drive tuning before putting that setup in production?

we're considering a similar compact setup like yours (5 node mesh but less osds 3-4) and were thinking about micron 5300 series. how dou you like sm883s (SATA)?

thanks
 
Hi,

I would recommend to try to update to ceph octopus && enable writeback, performance have greatly improved.
(before octopus, writeback slowdown reads.)


Code:
Here some iops result with 1vm - 1disk -  4k block   iodepth=64, librbd, no iothread.



                        nautilus-cache=none     nautilus-cache=writeback          octopus-cache=none     octopus-cache=writeback
          
randread 4k                  62.1k                     25.2k                            61.1k                     60.8k
randwrite 4k                 27.7k                     19.5k                            34.5k                     53.0k
seqwrite 4k                  7850                      37.5k                            24.9k                     82.6k


some windows benchmark, with octopus, with && without writeback. (10 osd with intel s3610 replication x3.
 

Attachments

  • ceph-cache-writeback.JPG
    ceph-cache-writeback.JPG
    62.9 KB · Views: 47
  • ceph-cache-none.JPG
    ceph-cache-none.JPG
    65.6 KB · Views: 45
  • Like
Reactions: yaboc
Is this a hyper converged setup?
Yes, it it.

list i linked shows that two ceph users experience transfer rate drop while copying 2gb+ files around 1-1.5gb mark (significant drop to 25MB/s) which only happens on windows guests and not linux. would you be able to try to reproduce it on your setup?

I just copied an about 6 GB large single file where source and destination
a) is the same vdisk
b) are different vdisks (at the same VM)

My results: no I/O drops, "constant" rates (as constant as a Windows progress bar can be :D) with a) about 200 MB/s and b) about 350 MB/s.

what are your crystal mark numbers in those windows based kvms?
win2012r2-chrystal.png

would you share your osd nodes specs?
What exactly do you want to see? This is how Ceph -> OSD looks like:
osd-specs.png

did you do any specific kernel/ceph/drive tuning before putting that setup in production?
Nope - very boring (and very well running) setup! :cool:

how dou you like sm883s (SATA)?
Nothing to complain about so far (the setup runs for six months).
Oh no, to be honest: There was one faulty SSD. But Ceph handled it as expected, the server seller replaced it quickly, and the replacement itself also worked like a charm.
Retrospectively I'm happy that this incident happened, because now I know how this works and that it's no reason to panic. :)

Greets
Stephan

Edit: Keep in mind that these tests and benchmarks run during normal production (no heavy loads, but constant "background noise" from about 60 VMs).
 
Last edited:
  • Like
Reactions: yaboc
update to ceph octopus
Thanks for sharing this!
Our policy is to be "as standard as possible". So in case of Ceph version we take what the proxmox repos give us (at the moment: nautilus 14.2.9).

enable writeback, performance have greatly improved.
That's interesting, because the recommendation I get from the proxmox support was to keep "Default (no cache)".
Maybe a proxmox stuff member can say something about it? What are the options, and what are the trade offs?

Thanks and greets
Stephan
 
Yes, it it.



I just copied an about 6 GB large single file where source and destination
a) is the same vdisk
b) are different vdisks (at the same VM)

My results: no I/O drops, "constant" rates (as constant as a Windows progress bar can be :D) with a) about 200 MB/s and b) about 350 MB/s.


View attachment 18362


What exactly do you want to see? This is how Ceph -> OSD looks like:
View attachment 18363


Nope - very boring (and very well running) setup! :cool:


Nothing to complain about so far (the setup runs for six months).
Oh no, to be honest: There was one faulty SSD. But Ceph handled it as expected, the server seller replaced it quickly, and the replacement itself also worked like a charm.
Retrospectively I'm happy that this incident happened, because now I know how this works and that it's no reason to panic. :)

Greets
Stephan

thank you for taking your time to run those tests and providing a very detailed answer. it's great information.

i meant specs of the node hardware, CPU/RAM. Sorry for not making that clear. Also is your 10gb mesh used for both private and public networks or are you splitting them. i'm thinking of doing 25gb mesh for priv/pub.

i may change our disks choice and go with sammie vs micron. how heavy is the workload on this three node cluster ? #vms, memory utilization.

and yes it looks like octopus is definitely an improvement but yes i'd wait too until it's in proxmox official repo.

i was also under impression that no cache was the desired option.
 
thank you for taking your time to run those tests and providing a very detailed answer. it's great information.
You're welcome! :) I'm so happy with this setup so far that it's a lot of fun sharing our experiences.

i meant specs of the node hardware, CPU/RAM.
Ah, of course! Each node is based on this hardware and equipped with

2x AMD Epyc 7351
384 GB RAM (DDR4, 2667 MHz)
10x Samsung SM883 1,92 TB connected to a
Broadcom HBA 9300-8i
2x Intel X520-DA2

Also is your 10gb mesh used for both private and public networks
Yes, it is.

i'm thinking of doing 25gb mesh for priv/pub.
I think that this is a very good idea. I read that latency gets a boost from 10 GBit to 25 GBit (but not so much from 25 to 40)

how heavy is the workload on this three node cluster ? #vms, memory utilization.
We run about 65 VMs (Windows and Linux; databases, file, mail, ... KVM only)
Our cluster summary looks like this:
1593888296582.png
(most of the CPU utilization goes to the World Community Grid). :D:D

This is one week of "Load" on the node that hosts our critical databases:
1593888603111.png

and yes it looks like octopus is definitely an improvement
Great to hear! I'm curious about it when it comes the Proxmox repos.

Greets
Stephan
 
  • Like
Reactions: yaboc
You're welcome! :) I'm so happy with this setup so far that it's a lot of fun sharing our experiences.


Ah, of course! Each node is based on this hardware and equipped with

2x AMD Epyc 7351
384 GB RAM (DDR4, 2667 MHz)
10x Samsung SM883 1,92 TB connected to a
Broadcom HBA 9300-8i
2x Intel X520-DA2


Yes, it is.


I think that this is a very good idea. I read that latency gets a boost from 10 GBit to 25 GBit (but not so much from 25 to 40)


We run about 65 VMs (Windows and Linux; databases, file, mail, ... KVM only)
Our cluster summary looks like this:
View attachment 18364
(most of the CPU utilization goes to the World Community Grid). :D:D

This is one week of "Load" on the node that hosts our critical databases:
View attachment 18365


Great to hear! I'm curious about it when it comes the Proxmox repos.

Greets
Stephan

once again excellent piece of information. those are some beefy servers you got there. i'm trying to re purpose 5 older dual e5-26xx R430 256GB RAM and hopefully they'll do the trick as our workload isn't too crazy (around 20 VMs 99% win based) and we hope to have some room in case we double it in the next year or two. we will start with 3 maybe 4 osds per node.

looks like you are still within the 80-85% full safe limit although i've read that it's better to keep it around 60-70% for recovery purposes. also it doesn't look like those servers are breaking any sweat. thanks again!
 
Last edited:
i'm trying to re purpose 5 older dual e5-26xx R430 256GB RAM and hopefully they'll do the trick as our workload isn't too crazy (around 20 VMs 99% win based) and we hope to have some room in case we double it in the next year or two. we will start with 3 maybe 4 osds per node.
Sounds like a very good setup, too!
Do you have Windows Server Datacenter licences? If yes, keep in mind that you have to pay per CPU and even physical core. Our thoughts were, using less nodes, saving licensing costs and put these savings into the hardware of the three nodes. This is also why we choose 24x 2,5" hot swap chassis, because we want to scale up instead of scale out. Of course scaling out is better for Ceph performance, but with our setup I'm sure that we don't run into any performance trouble in the medium run. :cool:

looks like you are still within the 80-85%
Yes, we always try to stay below 85%.

Good luck with your project, and if you have any further questions to me, feel free to send me a PM.

Greets
Stephan
 
  • Like
Reactions: yaboc
Sounds like a very good setup, too!
Do you have Windows Server Datacenter licences? If yes, keep in mind that you have to pay per CPU and even physical core. Our thoughts were, using less nodes, saving licensing costs and put these savings into the hardware of the three nodes. This is also why we choose 24x 2,5" hot swap chassis, because we want to scale up instead of scale out. Of course scaling out is better for Ceph performance, but with our setup I'm sure that we don't run into any performance trouble in the medium run. :cool:

we're on standard which becomes quite a job to make sure we're compliant so we will definitely invest in datacenter once our workload goes up. thanks for the tip on the ms licensing but it always makes my head spin lol.

we'd probably be fine with 3 nodes too but the hardware is on the older side so we decided to up the node count to 5 for more resilient failure domain and pretty much max the scale out aspect in our configuration. with 8 bay servers our only expansion path will be to scale up and it should meet our requirements for the next few years.

i'll definitely reach out once we start building it out! thanks!
 
Hi @spirit , this statement is valid only for windows vms right? or also for linux vms?
Octopus has a new caching mode, that is called write-around. It basically is a write-only cache and bypasses reads directly to the cluster. This is regardless of the OS inside the VM.
 
Octopus has a new caching mode, that is called write-around. It basically is a write-only cache and bypasses reads directly to the cluster. This is regardless of the OS inside the VM.
thanks @Alwin , but for it to be active i need to change the cache of the disk on the vms config from none to writeback. Is it right? or is a cache under the hood regardless of the config of the vms?
 
thanks @Alwin , but for it to be active i need to change the cache of the disk on the vms config from none to writeback. Is it right?
correct.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!