ceph performance is really poor

alexskysilk

Distinguished Member
Oct 16, 2015
1,821
363
153
Chatsworth, CA
www.skysilk.com
I have a cluster of 6 nodes, each containing 8x Intel SSDSC2BB016T7R for a total of 48 OSDs. each node has 384GB ram and 40 logical cpus. For some reason, this cluster performance is really low in comparison to other deployments. deploying the gitlab template took well over 5 minutes:

extracting archive '/mnt/pve/template/template/cache/debian-8-turnkey-gitlab_14.2-1_amd64.tar.gz'
Total bytes read: 2216007680 (2.1GiB, 6.4MiB/s)
to eliminate the template source, I excecuted tar -tf on the template file:
Code:
time tar tf /mnt/pve/template/template/cache/debian-8-turnkey-gitlab_14.2-1_amd64.tar.gz

real    0m14.582s
user    0m17.900s
sys     0m5.854s

next I benchmarked the setup using rados bench:
Code:
 rados bench -p scbench 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       177       161   643.884       644    0.070158   0.0905813
    2      16       344       328   655.885       668   0.0706853   0.0922631
    3      16       508       492   655.887       656   0.0812616    0.094763
    4      16       684       668   667.891       704   0.0619202   0.0935216
    5      16       842       826   660.697       632   0.0662235   0.0950198
    6      16      1010       994   662.567       672    0.132235   0.0950342
Total time run:       6.819637
Total reads made:     1134
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   665.138
Average IOPS:         166
Stddev IOPS:          6
Max IOPS:             176
Min IOPS:             158
Average Latency(s):   0.095415
Max latency(s):       0.477642
Min latency(s):       0.0129592

The bandwidth looks reasonable but the IOPs are approx 1/10th of my other clusters.
OSD latency is 1ms or less.

Code:
pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.15.15-1-pve)
pve-manager: 5.1-49 (running version: 5.1-49/1e427a54)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-2-pve: 4.13.16-47
ceph: 12.2.4-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
openvswitch-switch: 2.7.0-2
proxmox-widget-toolkit: 1.0-14
pve-cluster: 5.0-24
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-7
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-24
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

Code:
ceph.conf
[global]
         auth client required = cephx
         auth cluster required = cephx
         auth service required = cephx
         cluster network = 10.18.21.0/24
         fsid = a4b0bc0a-cf15-44f3-8410-f3816c155685
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.18.21.0/24

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.sky36]
         host = sky36
         mon addr = 10.18.21.36:6789

[mon.sky32]
         host = sky32
         mon addr = 10.18.21.32:6789

[mon.sky31]
         host = sky31
         mon addr = 10.18.21.31:6789

[mon.sky33]
         host = sky33
         mon addr = 10.18.21.33:6789

[client]
        rbd cache = true
        rbd cache size = 268435456
        rbd cache max dirty = 134217728
        rbd cache max dirty age = 5

What am I missing?
 
Leaving all other hardware aside, I want to concentrate on the disks first.

Did you test the Intel SSDSC2BB016T7R with fio? What where your results? From the data sheets and our fio test (DC S3510 series 120GB), the SSD should arrive at ~60 MB/s (4K fio job like below).

root@ella:~# fio --filename=/dev/sdd --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

write: io=6162.8MB, bw=105177KB/s, iops=26294, runt= 60000msec
clat (usec): min=24, max=3266, avg=37.48, stdev=10.75
lat (usec): min=24, max=3266, avg=37.61, stdev=10.75
Our results from a Intel S3610 for comparison. I assume, you have a 10 GbE (judging from the latency on the rados bench result).

8 x 60 MB/s = 480 MB/s (bandwidth if all disks are written to simultaneously on one host)
6 x 480 MB/s = 2880 MB/s (if all disks are written simultaneously)
2880 MB/s / 3 = 960 MB/s (if replica of the pool is 3)

Only a rough calculation, but 960 MB/s w/o the latency (avg. 95 ms). And I guess, the cluster is also busy, dropping the results again.

How are your other clusters comparing?

For reference: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
 
  • Like
Reactions: grin
Leaving all other hardware aside, I want to concentrate on the disks first.

Did you test the Intel SSDSC2BB016T7R with fio? What where your results? /

write: io=1450.6MB, bw=24755KB/s, iops=6188, runt= 60001msec
clat (usec): min=137, max=6295, avg=158.62, stdev=32.83
lat (usec): min=137, max=6296, avg=158.98, stdev=32.83

So its substantially slower then your observed io for a S3610, but doesnt explain the 6 minutes to write a 2G image.

Only a rough calculation, but 960 MB/s w/o the latency (avg. 95 ms). And I guess, the cluster is also busy, dropping the results again.
Thats just it, the cluster is mostly idle at the moment since its only been brought online a week ago. I'm really stumped.
 
Last edited:
Alwin, I think we got hung up on the wrong thing. yes, these drives are oddly slow (I have the same drive with an HP Part number which performs normally) but that doesnt explain the slow deployment.

Since then I've been able to reproduce the really slow (5MB/S) template deployment on another cluster; it is mostly felt on larger templates such as Gentoo but it is still boggling my mind why its taking so long. Why is a template that takes 15s to extract take over 5 minutes to deploy...
 
did you ever solve this? my average iops for write are 90 using rados. I have 24x 1tb spinning drives between 4 servers with dual 10gbe connections. clearly something is wrong with my config.
 
Yes, I did. It ended up not being ceph at all; This cluster had Sophos AV running and it just slows many disk operations to a crawl. If you can rule out any in memory process that can be slowing you down, its probably just your slow disks. spinning disks are only capable of delivering ~100 iops per disk which means thats all you'll get for writes per logical target.
 
  • Like
Reactions: syfy323
Thats true, spinners are only usable with WAL/DB outsourced to NVMe (which means total destruction of all OSDs on NVMe fail).
It's important to use high end SSDs with ceph, there are lots of tests in the ceph-users ML archives.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!