Proxmox VE Ceph Benchmark 2018/02

a. A container with a virtual disk stored on CEPHFS, benchmark running on its local /tmp, bandwidth of approx 450MB/s
b. Same container with BIND MOUNT exported by host and benchmark running on this shared folder, bandwidth of approx 70MB/s
This is exactly the reason, why we don't recommend CephFS as directory storage for VM/CT at the moment. Workloads with small updates suffer greatly by the latency introduced. Use RBD instead, if you don't need a shared filesystem.
 
This is exactly the reason, why we don't recommend CephFS as directory storage for VM/CT at the moment. Workloads with small updates suffer greatly by the latency introduced. Use RBD instead, if you don't need a shared filesystem.
That's exactly the point. We need shared filesystems. We are currently comparing CEPHFS performances against GPFS and Nutanix NDFS.

That being said, I am afraid I did not understand your point.

Both in case a. and b. above we are accessing CEPHFS. Only in case a. the container virtual disk is stored in CEPHFS and the benchmark read/writes inside this virtual disk (stored in CEPHFS), while in case b. the container access the underlying CEPHFS via a BIND MOUNT option.

The benchmark script is the same.

So why do you say that in case b. the benchmark suffers of the latency introduced by CEPHFS while in case a. it doesn't?

To be honest, I would have expected an opposite result: b. being faster than a.

Any further detail will be greatly appreciated.
 
Both in case a. and b. above we are accessing CEPHFS. Only in case a. the container virtual disk is stored in CEPHFS and the benchmark read/writes inside this virtual disk (stored in CEPHFS), while in case b. the container access the underlying CEPHFS via a BIND MOUNT option.
This is exactly the difference. While you have (depending on your test) one big open file that is written too / read from, the bind mount will put all read/write operations directly to CephFS. CephFS needs to translate the filesystem into objects (usually 4 MB).

This is why we didn't release it as a generell storage.

You can try to tune CephFS by eg. activating the experimental inline data [0] feature or cache tiering [1]. Both need to be carefully tested and no guarantees that the have the wished effect.

Please note that this setup is not supported by us, hence no enterprise support [2] (where eligible) can be given.

[0] https://docs.ceph.com/docs/nautilus/cephfs/experimental-features/#inline-data
[1] https://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/
[2] https://www.proxmox.com/en/proxmox-ve/pricing
 
Understood.
I shall replace now all HDDs with SSDs and perform again the same test to see if anything significant changes and I'll publish here the results. Do you expect any performance improvement on CEPHFS in any of their future releases? Has anything been put in roadmap yet?
 
Do you expect any performance improvement on CEPHFS in any of their future releases? Has anything been put in roadmap yet?
CephFS was the original idea but got finished the latest, in terms of being production ready. So yes, there are improvements with every Ceph release [0]. See their experimental feature list [1] to get a feel for what is still coming (eg. lazyio).

[0] https://docs.ceph.com/docs/master/releases/nautilus/
[1] https://docs.ceph.com/docs/nautilus/cephfs/experimental-features/
 
Hello!
Has anyone been able to compare the performance between Proxmox 5.x (Ceph Luminous) and Proxmox 6 (Ceph Nautilus)?
Is there an improvement in performance?
 
We have completed the initial build of our new cluster with 3 nodes:
CPU: Dual AMD EPYC 7551 32-Core Processor
RAM: 512G
DISK: 10 * 2TB NVME SAMSUNG drives
case is a 1RU Supermicro box 10bay NVME
4 * 10GB fibre nics with a pair of Juniper 4600s in a stack. 1 pair for ceph cluster, 1 pair for ceph public traffic
2 * 10GB copper for uplinks to LAN/vm traffic

Ceph is running pretty well, and the network is the bottle neck as expected. I suppose we could have upgraded to 100G if we needed it and had the budget.
Before we really start using this in anger, does anyone have any tips to tune Ceph or Proxmox for speed/reliability etc?
Happy to take any input, or do some more benchmarks.
 
  • Like
Reactions: fbifido
If you need a good working shared storage I recommend using croit.io on dedicated Ceph nodes instead.
External poola can easily be integrated into PVE, even erasure coded pools. (I finally figured out how to do it for multiple pools).
 
  • Like
Reactions: fbifido
I'm interested in 3 Node Mesh network setup. How do you connect? Do you bond two nic for each node? or give separate IP?
 
Hi,

I just ran the first fio command mentioned in the benchmark pdf document on a new Samsung SM883 1,92 TB (MZ7KH1T9). The output goes like this:

Code:
fio: (groupid=0, jobs=1): err= 0: pid=44035: Mon Oct  7 15:40:04 2019
  write: IOPS=26.4k, BW=103MiB/s (108MB/s)(6180MiB/60001msec); 0 zone resets
    slat (nsec): min=2800, max=1027.0k, avg=4854.19, stdev=954.43
    clat (nsec): min=660, max=370745, avg=32476.96, stdev=838.11
     lat (usec): min=36, max=1062, avg=37.44, stdev= 1.32
    clat percentiles (nsec):
     |  1.00th=[31616],  5.00th=[31872], 10.00th=[31872], 20.00th=[32128],
     | 30.00th=[32384], 40.00th=[32384], 50.00th=[32384], 60.00th=[32384],
     | 70.00th=[32640], 80.00th=[32640], 90.00th=[33024], 95.00th=[33024],
     | 99.00th=[34048], 99.50th=[35072], 99.90th=[38144], 99.95th=[39680],
     | 99.99th=[51456]
   bw (  KiB/s): min=101704, max=106144, per=99.99%, avg=105464.86, stdev=637.76, samples=119
   iops        : min=25426, max=26536, avg=26366.20, stdev=159.45, samples=119
  lat (nsec)   : 750=0.01%
  lat (usec)   : 10=0.01%, 20=0.01%, 50=99.99%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%
  cpu          : usr=3.49%, sys=11.28%, ctx=3164141, majf=7, minf=53
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1582141,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=103MiB/s (108MB/s), 103MiB/s-103MiB/s (108MB/s-108MB/s), io=6180MiB (6480MB), run=60001-60001msec

Disk stats (read/write):
  sdc: ios=22/1578893, merge=0/0, ticks=5/54413, in_queue=0, util=99.89%

Looks as expected to me, what do you think?
And dear Proxmox staff: Feel free to put this into your document.

Greets
Stephan
 
Here's another benchmark result for the records. This is a 4 node test platform, using 4 x 2TB Intel P4510 NVMe drives per node. Each drive is configure with 4 OSDs and the pool has 3 copies of the data. It's configured with 4096 PGs based on the results of the pg_calculator but I'm happy to take advice on that number. Other than that it's a standard Ceph configuration from the PVE GUI. Trying to keep it simple so no SPDK or anything tricky. Network is 10GbE with separate active / standby bonds for the public and cluster networks. Running 16 OSDs, a manager and a monitor per node consumes about 20GB of RAM (each node) as a base usage before bringing up any VM workloads.

Code:
# rados bench 60 write -b 4M -t 16 --no-cleanup

Total time run:         60.044
Total writes made:      20974
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1397.24
Stddev Bandwidth:       39.2492
Max bandwidth (MB/sec): 1480
Min bandwidth (MB/sec): 1300
Average IOPS:           349
Stddev IOPS:            9.81231
Max IOPS:               370
Min IOPS:               325
Average Latency(s):     0.0458033
Stddev Latency(s):      0.018851
Max latency(s):         0.19313
Min latency(s):         0.0137688


# rados bench 60 rand -t 16

Total time run:       60.0489
Total reads made:     21852
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1455.61
Average IOPS:         363
Stddev IOPS:          9.83817
Max IOPS:             382
Min IOPS:             344
Average Latency(s):   0.043261
Max latency(s):       0.348086
Min latency(s):       0.00317715
 
  • Like
Reactions: Otter7721 and Alwin
Network is 10GbE with separate active / standby bonds for the public and cluster networks
The network might be the bottleneck, as you can see in the Benchmark Paper, 4x SSD OSDs per node could already max out the 10 GbE.
 
It's interesting that with 4 nodes on a 10GbE network our numbers are significantly higher than the ones in the benchmark report. If the network is maxed out maybe our 10 Gigs is quicker than your 10 Gigs :)

I don't have our normal monitoring on this gear yet. I'll get that in place tomorrow so we can see what the switch ports are doing during the benchmark. I'm already pretty impressed with these numbers, given the distributed nature and resilience Ceph offers.


David
...
 
It's interesting that with 4 nodes on a 10GbE network our numbers are significantly higher than the ones in the benchmark report. If the network is maxed out maybe our 10 Gigs is quicker than your 10 Gigs :)
Easy, you are using Intel DC P4510 NVME, compared to Samsung SM863 SSDs and your cluster/public traffic is separated (not so in the test).
 
Here we go:
3 Nodes, 7x Samsung 1,92 TB SM883 OSD per node and a mesh network for ceph traffic (Intel X520 10 GBit, 100-150 meters fibre between the nodes):
Code:
# rados bench -p ceph-pool0 60 write -b 4M -t 16 --no-cleanup
Total time run:         60.0513
Total writes made:      17375
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1157.34
Stddev Bandwidth:       36.1428
Max bandwidth (MB/sec): 1224
Min bandwidth (MB/sec): 1080
Average IOPS:           289
Stddev IOPS:            9.0357
Max IOPS:               306
Min IOPS:               270
Average Latency(s):     0.0552977
Stddev Latency(s):      0.0199578
Max latency(s):         0.165578
Min latency(s):         0.0196849
Code:
#  rados bench -p ceph-pool0 60 seq -t 16
Total time run:       36.3202
Total reads made:     17375
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   1913.54
Average IOPS:         478
Stddev IOPS:          21.0896
Max IOPS:             529
Min IOPS:             438
Average Latency(s):   0.0324742
Max latency(s):       0.216693
Min latency(s):       0.0137038
Code:
# rados bench -p ceph-pool0 60 write -b 4k -t 16 --no-cleanup
Total time run:         60.0026
Total writes made:      333893
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     21.7369
Stddev Bandwidth:       0.0759789
Max bandwidth (MB/sec): 21.8906
Min bandwidth (MB/sec): 21.5547
Average IOPS:           5564
Stddev IOPS:            19.4506
Max IOPS:               5604
Min IOPS:               5518
Average Latency(s):     0.00287467
Stddev Latency(s):      0.000459568
Max latency(s):         0.00820139
Min latency(s):         0.00135645
Code:
# rados bench -p ceph-pool0 60 rand -t 16
Total time run:       60.0003
Total reads made:     2277521
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   148.275
Average IOPS:         37958
Stddev IOPS:          2685.27
Max IOPS:             45532
Min IOPS:             32601
Average Latency(s):   0.000417614
Max latency(s):       0.00527385
Min latency(s):       0.0001299
Do you think this is as expected (maybe 4k write could go faster?)? With 2000 more bucks we could do a 40 GBit mesh network. Do you think it's worth trying this?

Thanks and greets
Stephan
 
We have some spare 40GbE switch ports so I've ordered some nics for our servers. Early next week I should have a benchmark using 40G to compare to the 10G benchmark from the other day. Should be interesting as I was maxing out the public ceph network as Alwin thought I would be.


David
...
 
  • Like
Reactions: Alwin
Do you think this is as expected (maybe 4k write could go faster?)?
4 KB object size will generate a lot of objects and that will be counterproductive, not only on the network. And if you have a MTU of 9000, 5K would be wasted. For RBD workloads the 4MB object size is a good average. But you can set the object size per disk image and experiment (needs manual creation).

With 2000 more bucks we could do a 40 GBit mesh network. Do you think it's worth trying this?
The cluster reached ~2 GB/s (~17 Gb) with the rados read test. Reads are done in parallel. If the SM883 do only ~200MB/s then a single node could do a combined wirte of 1400MB/s (11.7 Gb). At least an upgrade to 25 GbE is recommended and probably cheaper than 40 GbE. The latency doesn't get better upward (>)25 GbE, only the bandwidth is enlarged.
 
Do you think this is as expected (maybe 4k write could go faster?)? With 2000 more bucks we could do a 40 GBit mesh network. Do you think it's worth trying this?

From my last bench, with fio on rbd,
I'm able to reach around 150000 iops 4k randwrite, limitation is the cpu of the ceph nodes. (3 nodes with 24cores 3ghz, 100% cpu).

For read, I'm around 700000 iops 4k randread, cpu is limiting too. (3 ceph nodes 100% cpu, and 2 clients nodes (same cpu config) cpu 100%).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!