CEPH performance

Volker Lieder

Well-Known Member
Nov 6, 2017
48
3
48
44
Hi,
we setup a new environment with 3 nodes, debian stretch, proxmox 5.1, ceph luminous.
Each nodes has 4 ssds for osd, summary 12 osds.
From pg-calc we set a pg_num of 512 in ceph pool.

The network for ceph is connected via infiniband. If we install a vm in ceph storage and make a dd inside, we only get results round about 175MB-200MB/s.

rados bench also is raound about 170-200MB/s

iperf shows a bandwith round about 6Gbit/s.

Any links / hints what we can do to get a better performance with infiniband, stretch and proxmox 5.1?

Regards,
Volker
 
I am not a ceph expert, but found out the hard way it is best to use drives that are recommended and already tested by others .
Do you know if that is the case with those drives?

also post your ceph.conf and pveversion -v . then hopefully a ceph specialist will have more info to help evaluate .
 
hmmm you have the standard ssd not high endurance version (3 DPW)!

The Toshiba HK4 is a 2.5” enterprise SATA SSD that comes in two models: a read-intensive model (HK4R) and a high-endurance model (HK4E). The drives have capacity that ranges from 120GB to 1.92TB (depending on model type) and utilize Toshiba’s next-gen 15nm NAND as well as Toshiba controllers. This would make it one of the first SATA drives to hit the 2TB capacity point. The drives come with a 5-year warranty and are designed for a variety of use-cases including mixed workloads, web servers, files servers, media streaming, video-on-demand, search engines and warm data storage.
 
root@cloud-node11:/mnt# cat /etc/ceph/ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.16.70.0/24
fsid = 09fbf10f-836d-4bc2-b678-b78897966984c1
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 172.16.65.0/24

# Disable in-memory logs
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.cloud-node12]
host = cloud-node12
mon addr = 172.16.65.112:6789


root@cloud-node11:/mnt# pveversion -v
proxmox-ve: 5.1-26 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-26
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: not correctly installed
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
ceph: 12.2.1-pve3
 
try to add in your ceph.conf

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

it should improve performance.
They are also a bug in current luminous with debug ms, and it'll be set to 0/0 by default in next ceph release.


also for rados bench, try "-t 64", to increase number of threads (16 by default)
 
1] don't use useless dd as benchmark, use fio
2] use fio for read, write test
3] use fio for readwrite test
4] 170-200MBps = 1.36-1.6Gbps x (replica) 3 = 4.8Gbps

5] what cpu is used? HT is enabled? what is % load per core?
6] what is osd config? ext4? xfs? bluestore?

etc, etc
 
we used dd to ensure the single disc is running fast enough on the node itself. Furthermore we used dd inside of a vm, which shouldn't be that useless, or am I wrong?

But: we also used fio:

root@cloud-node11:/mnt# fio --filename=/dev/sdd --direct=1 --sync=1 --rw=write --bs=4k --numjobs=6 --iodepth=2 --runtime=60 --time_based --group_reporting --name=journal-test
journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=2
...
fio-2.16
Starting 6 processes
Jobs: 6 (f=6): [W(6)] [100.0% done] [0KB/107.4MB/0KB /s] [0/27.5K/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=6): err= 0: pid=27991: Fri Nov 24 15:27:37 2017
write: io=6382.7MB, bw=108929KB/s, iops=27232, runt= 60001msec
clat (usec): min=146, max=1426, avg=219.74, stdev=20.98
lat (usec): min=146, max=1426, avg=219.81, stdev=20.98
clat percentiles (usec):
| 1.00th=[ 191], 5.00th=[ 195], 10.00th=[ 199], 20.00th=[ 207],
| 30.00th=[ 211], 40.00th=[ 213], 50.00th=[ 217], 60.00th=[ 221],
| 70.00th=[ 225], 80.00th=[ 231], 90.00th=[ 243], 95.00th=[ 258],
| 99.00th=[ 278], 99.50th=[ 286], 99.90th=[ 298], 99.95th=[ 302],
| 99.99th=[ 596]
lat (usec) : 250=93.15%, 500=6.84%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%
cpu : usr=0.69%, sys=8.31%, ctx=3268083, majf=0, minf=156
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1633955/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
WRITE: io=6382.7MB, aggrb=108928KB/s, minb=108928KB/s, maxb=108928KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
sdd: ios=0/3263394, merge=0/0, ticks=0/319752, in_queue=319380, util=97.22%

CPU is: 2 x Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
HT is enabled,
OSD config is XFS with Bluestore
%Load only noticable on Core 1-5, where #1 has ~ 15-16%, where #2-5 are having ~6-8%
 
your infiniband network is on 10GigE or 40 or 56 ?
Networkcards ? firmware ?
Switches? firmware ?

We have mellanox and our 1st step was to bump up firmware to latetes release also on our mellanox switch :)

please use code-tags for cut&paste content ...
 
the problem with dd as benchmark, is that is like iodeph=1 and sequential.
so you'll be limited by the latency (network + cpu frenquency).

with 18 osd ssd, replication x3 , big 2x12 cores 3.1ghz cpu, I'm able to reach around 700k iops randread 4K, and 150-200k randwrite 4K.
(fio, iodepth=64, numjob=20), running fio-rbd on host. (host have also 2x12 3.1ghz cores)


qemu have also limits by disk, because disk are not multithreaded. I'm able de reach 70k randread 4k with 1 disk.
If you want to scale in 1 vm, add more disks with iothread option.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!