CEPH Performance, I am perplexed

arubenstein · Dec 16, 2023

I can't seem to figure out what is going on here.

9 node cluster of Cisco UCS c240 M3's. (2) x Xeon E5-2667 @ 3.30 ghz. 128 gb ram per host. PVE 8.13, CEPH Reef 18.2.0

node 1 thru 5 each have (10) Seagate rotational drives, some SATA, some SAS, ranging from 10 to 16 TB, As I've gone down the testing rabbit hole, I've made two pools of these drives, using device classes. hdd = host 2 thru 5, 40 drives; hdd2 = host 1, 10 drives. I've changed the CRUSH map to allow replication 3 on the same host.

Nodes 6-9 each have (22) 1.92 TB SSDs, in a device class known as "ssd". Monitors on 1,3,5,7,9.

Networking is dual 40 gb ethernet (LAG/MLAG to a pair of Arista 40 gig switches).

This all started out with me migrating a hyper-v VM to this machine. I mounted that hyper-v server using mount.cifs, and started a qm-import to the hdd class. That part works fine, but is very slow. Like, less than 100 mbytes/sec. So I spend the last 3 hours testing every component, working my way up, and I am not sure what to say here.

I am focused on the hdd2 class, which again is the (10) rotationals on node1. I tested all ten drives with fio as direct devices:

Code:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

All 10 of them return the following, plus/minus a couple percent:

Code:

WRITE: bw=143MiB/s (150MB/s), 8653KiB/s-10.7MiB/s (8861kB/s-11.2MB/s), io=11.8GiB (12.6GB), run=60598-84202msec

So, the drives are fast and equal. I make OSD's from them, I do a "ceph tell bench" on all of them. Again, all are about 75 mbytes/sec (plus/minus 3 mbytes/sec), and around 18 iops. No outliers. Example:

Code:

root@ceph1-hyp:/mnt/pve/cephfs_hdd2# ceph tell osd.5 bench
{
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 14.066783756,
    "bytes_per_sec": 76331721.779828295,
    "iops": 18.198900647122453
}

So I create some cephfs pools, mount them on the local hypervisor, and do more fio:

ssd (88 SSDs)	WRITE: bw=12.6GiB/s (13.5GB/s), 721MiB/s-916MiB/s (756MB/s-960MB/s), io=777GiB (834GB), run=61731-61782msec
hdd (40 HDDs)	WRITE: bw=11.5GiB/s (12.3GB/s), 482MiB/s-881MiB/s (506MB/s-923MB/s), io=735GiB (789GB), run=63391-63967msec
hdd2 (10 hdds)	WRITE: bw=10.5GiB/s (11.3GB/s), 277MiB/s-1144MiB/s (291MB/s-1200MB/s), io=790GiB (848GB), run=72933-74883msec

Pretty zippy. So then, I create RBD pools, create a VM, and plop it on the hdd2 pool (10 drives on node1). Debian base install, then install fio:

Code:

root@junkbox2:~# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1

[...]

Run status group 0 (all jobs):
  WRITE: bw=30.0MiB/s (31.5MB/s), 1726KiB/s-2303KiB/s (1768kB/s-2359kB/s), io=2451MiB (2570MB), run=71822-81624msec

Wow. Slow! So then I follow an example online of doing a rados bench, which I do on hdd2:

Code:

rados bench -p pool_hdd2 30 write --no-cleanup
Total time run:         30.3039
Total writes made:      1506
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     198.786
Stddev Bandwidth:       21.2007
Max bandwidth (MB/sec): 236
Min bandwidth (MB/sec): 132
Average IOPS:           49
Stddev IOPS:            5.30018
Max IOPS:               59
Min IOPS:               33
Average Latency(s):     0.320658
Stddev Latency(s):      0.190432
Max latency(s):         1.76071
Min latency(s):         0.0456256

and then a rbd bench:

Code:

root@ceph1-hyp:/mnt/rbd-mounts# rbd bench --io-type write image01 --pool=pool_hdd2
bench  type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1     21456   21132.1    83 MiB/s
    2     41168   20631.5    81 MiB/s
    3     58288   19355.6    76 MiB/s
    4     74544   18657.1    73 MiB/s
    5     90384     18093    71 MiB/s
    6    106416   17004.2    66 MiB/s
    7    119824   15717.3    61 MiB/s
    8    135696   15505.1    61 MiB/s
    9    151568   15342.1    60 MiB/s
   10    170160     15865    62 MiB/s
   11    188576   16483.4    64 MiB/s
   12    204352   16917.7    66 MiB/s
   13    220832   17012.2    66 MiB/s
   14    236800   17113.4    67 MiB/s
   15    254256   16912.5    66 MiB/s
elapsed: 15   ops: 262144   ops/sec: 16627.9   bytes/sec: 65 MiB/s

I cannot figure out for the life of me why fio is so slow INSIDE the VM, and why qm import is so slow on the same pool that supports orders of magnitude faster writes in certain aspect. It's like I am missing somethign ridiculously basic.

gurubert · Dec 19, 2023

Do these HDDs have an SSD for their RocksDB?
Otherwise it is quite expected that performance is below the raw values of the disk as RocksDB is very random IO and a HDD cannot cope with that.

Search

Search

CEPH Performance, I am perplexed

arubenstein

New Member

gurubert

Distinguished Member

We value your privacy