Ceph Performance within VMs on Crucial MX500

holr

Well-Known Member
Jan 4, 2019
33
1
48
54
Hello,
I think I have not set up things correctly on my 4 node proxmox-with-ceph cluster. I'm using consumer Crucial MS500 SSDs on HP DL380 Gen9 servers (1tb ram). Seven 2TB hard drives are set up as individual raid 0 (I couldn't find a pure HBA option in the server config) drives for ceph, with one 2TB SSD in each server for the Proxmox OS. Total 28 2TB MX500 for Ceph. Ceph runs on a dedicated 40Gbps network, Proxmox runs on a 10Gbps network.

I ran some benchmarks using crystaldisk mark in a windows 10 vm, moving it to the Proxmox OS SSD LVM (left, in the attached image), then moving it to the Ceph cluster and rerunning (right in the attached image). Whilst I know the Crucial MX500 are not the best choice for a server setup, and that there is naturally a degrading of virtual hard drive performance from a fixed disk LVM to Ceph cluster, are the vast differences in speed to be expected?

Many thanks for any insight you can provide!
 

Attachments

  • ceph.png
    ceph.png
    157.6 KB · Views: 134
Code:
ceph tell osd.0 bench

What is your raid controller on that G9?
Hello, it is a p440ar, running the command gives:
bytes written: 1073741824
blocksize: 4194304
elapsed_sec: 2.426355
bytes_per_sec: 442532805.034562
iops: 105.508043
 
Are you sure, P440ar doesn't support HBA mode without needing raid0 method? My teammate has MX500 in G8 (P420) in R0 and performance is good even with performance hit due oldness of G8 servers.
 
Are you sure, P440ar doesn't support HBA mode without needing raid0 method? My teammate has MX500 in G8 (P420) in R0 and performance is good even with performance hit due oldness of G8 servers.
I used lshw to identify the controller p440ar, and yes you are correct the drives are in indinidual raid 0. One disk is for os, the other 7 are ceph osd's, one osd per disk, journalling also on each disk (no dedicated journal disks).
 
I've since replaced all the crucial MX500 (Sata 6gbps, rated as 560MB/s read, 510MB/s write) with Samsung PM1643 (12gbps SAS, rated as 2100MB/s read, 1700MB/s write) and get the following crystaldiskmark results:
(default nocache)

No Cache Screen Shot 2019-11-21 at 3.10.43 PM.png

(writeback cache):
Write Back Cache Screen Shot 2019-11-21 at 2.48.47 PM.png

Considering the rated speed of the new Samsung PM1643 drives, are the speeds above all one can reasonably expect to get out of an all-SSD Ceph array? I put the old MX500's in a RAID5 server (running CentOS as an NFS server) and moved my test VM across and had the following result:
Screen Shot 2019-11-21 at 6.18.29 PM.png
 
My teammate has MX500 in G8 (P420) in R0 and performance is good even with performance hit due oldness of G8 servers.
Could I kindly ask what figures your teammate gets? I.e. what is regarded as good performance please (so I can compare)
 
What settings are you using for the CEPH Pool?

Are you using filestore or bluestore?
 
What settings are you using for the CEPH Pool?

Are you using filestore or bluestore?

I'm using the default as created in Proxmox 5.4 (bluestore). The CEPH pool settings are as follows:

Screen Shot 2019-11-23 at 6.58.46 AM.png

[global]
auth client required = none
auth cluster required = none
auth service required = none
cluster network = 10.x.x.0/24
debug_asok = 0/0
debug_auth = 0/0
debug_buffer = 0/0
debug_client = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_filer = 0/0
debug_filestore = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_journal = 0/0
debug_journaler = 0/0
debug_lockdep = 0/0
debug_mon = 0/0
debug_monc = 0/0
debug_ms = 0/0
debug_objclass = 0/0
debug_objectcatcher = 0/0
debug_objecter = 0/0
debug_optracker = 0/0
debug_osd = 0/0
debug_paxos = 0/0
debug_perfcounter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_rgw = 0/0
debug_throttle = 0/0
debug_timer = 0/0
debug_tp = 0/0
fsid = xxxxxxxx-557e-4a17-8263-xxxxxxxx
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.x.0/24

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class ssd
device 21 osd.21 class ssd
device 22 osd.22 class ssd
device 23 osd.23 class ssd
device 24 osd.24 class ssd
device 25 osd.25 class ssd
device 26 osd.26 class ssd
device 27 osd.27 class ssd
device 28 osd.28 class ssd
device 29 osd.29 class ssd
device 30 osd.30 class ssd
device 31 osd.31 class ssd
device 32 osd.32 class ssd
device 33 osd.33 class ssd
device 34 osd.34 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pve2 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 12.225
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.746
item osd.2 weight 1.746
item osd.3 weight 1.746
item osd.4 weight 1.746
item osd.5 weight 1.746
item osd.6 weight 1.746
item osd.34 weight 1.746
}
host pve3 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 12.225
alg straw2
hash 0 # rjenkins1
item osd.7 weight 1.746
item osd.8 weight 1.746
item osd.9 weight 1.746
item osd.10 weight 1.746
item osd.11 weight 1.746
item osd.12 weight 1.746
item osd.13 weight 1.746
}
host pve4 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 12.225
alg straw2
hash 0 # rjenkins1
item osd.14 weight 1.746
item osd.15 weight 1.746
item osd.16 weight 1.746
item osd.17 weight 1.746
item osd.18 weight 1.746
item osd.19 weight 1.746
item osd.20 weight 1.746
}
host pve1 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 12.225
alg straw2
hash 0 # rjenkins1
item osd.21 weight 1.746
item osd.22 weight 1.746
item osd.23 weight 1.746
item osd.24 weight 1.746
item osd.25 weight 1.746
item osd.26 weight 1.746
item osd.27 weight 1.746
}
host pve5 {
id -11 # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 9.596
alg straw2
hash 0 # rjenkins1
item osd.1 weight 0.250
item osd.28 weight 1.819
item osd.29 weight 1.819
item osd.30 weight 1.819
item osd.31 weight 1.819
item osd.32 weight 1.819
item osd.33 weight 0.250
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 58.498
alg straw2
hash 0 # rjenkins1
item pve2 weight 12.225
item pve3 weight 12.225
item pve4 weight 12.225
item pve1 weight 12.225
item pve5 weight 9.596
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
 
Firstly I noticed your using 2/1 this is very much not suggested, specially when using consume grade SSD's, you could have total data loss with a single SSD having a corrupted bit.

I also noticed your running at quite a high usage, CEPH definite slows down as the usage gets higher, and with consume SSD's they also normally slow down as they have less free space for TRIM / Garbage collection.

Apart from that I see no other issues, other than as others have said CEPH will hit your SSD with different IO that any other benchmark will do, so what your seeing performance wise is probably would you should expect to get.
 
  • Like
Reactions: Alwin
Thank you for your reply, and insight. I'll be migrating to a 3/2 in the near future, once I'm able to offload a number of the active VMs on the cluster; so that is sound advice.

Do you think the benchmarks with the enterprise-level Samsung drives (PM1643) above look sensible? Those drives are rated as 4x faster (than the Consumer Crucial MX500 drives, but as the CrystalDiskMark images above show, there doesn't appear to be as big of a boost as one may expect...
 
During the benchmark have you checked top / nmon on some of the CEPH nodes? See if you can see any point of saturation / heavy I/O wait on the OSD's?

Have you got a spare SSD that you can benchmark directly on the same hardware platform to make sure you can atleast get the throughput your expect from the SSD Raw and dont need to look into firmware / drivers.
 
I've since replaced all the crucial MX500 (Sata 6gbps, rated as 560MB/s read, 510MB/s write) with Samsung PM1643 (12gbps SAS, rated as 2100MB/s read, 1700MB/s write) and get the following crystaldiskmark results:

Maybe a silly question, but do the PM1643 sit on a 12 GBit/s SAS bus? Is the performance the single thread performance? Normally SSD shine on multiple threads and only read their maximum throughput with multiple threads. What about a read test of one of those disks?

For CEPH performance benchmark comparison, it's best to use rados bench or fio in order to get results than can be compared to thousands of others on the web.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!