Ceph low performance (especially 4k)

sander93 · Sep 18, 2018

Hello,

We have a seperate Ceph cluster and Proxmox Cluster (seperate server nodes), i want to know if the performance we get is normal or not, my thought was that the performance could be way better with the hardware we are using.

So is there any way we can improve with configuration changes?

The performance we get from inside Virtual Machines are about:
Sequentieel: Read 642,6 MB/s Write 459,8MB/s
4 kb Single Thread: Read 4,342 MB/s Write 15,45 MB/s

4K i only get 406 IOPs Write and 835 IOPs Read.

The hardware we are using:
4 X OSD Node, per node:
- 96GB Ram
- 2 x 6 Core (with HT) 2,6 GHz
- 6 x SM863 960GB (single BlueStore OSD per SSD)
- 2 x 10GB SFP+ (1 x 10GB for storage and 1 x 10GB for replication)

3 x Monitoring Node, per node:
- 4GB RAM
- Dual Core CPU (with HT)
- Single 120 GB Intel Enterprise SSD
- 2 x 1 GB Network (Active/Backup)

Replication/size: 2
Ceph Version: 12.2.8
Jumbo Frames enabled
Logging options from Ceph disabled in ceph.conf (this improves it a little bit)

All Proxmox nodes are connected with 1 x 10GB SFP+

Is there any configuration / setting we can change to improve performance? Or is this max we can get with this hardware? Especially 4K read / writes are slow.

I also be thinking if it would help to add 2 OSD nodes with each a fast NVME SSD and make it a Cache pool before the normale SSD pool? Ore will this make it even slower?

Thank you already,

Kind regards,

Sander

spirit · Sep 19, 2018

that seem a bit slow, I'm able to reach around 3000-4000iops 4K with single thread/iodepth

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --filename=test --bs=4k --iodepth=1 --size=1G --readwrite=randread

(using 3ghz cpu on proxmox nodes, and ceph server, disabling logging && cephx).

and fast mellanox switch
rtt min/avg/max/mdev = 0.011/0.016/2.657/0.009 ms, ipg/ewma 0.029/0.017 ms

Cache pool will not help.

Mainly, for single thread/iodepth, you need to take network latency + ceph latency (client/server cpu frequency help).

spirit · Sep 19, 2018

maybe the monitors, with 1G link have a bigger latency ?

Alwin · Sep 19, 2018

@spirit, can you add some hardware details to it? To have some relation to the 3000-4000 IO/s. The network latency looks like 100GbE.

@sander93, a SM863 should get around ~17k IO/s with fio.

Code:

fio --ioengine=libaio –filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio-kvm --output-format=terse,json,normal --output=fio-kvm.log --bandwidth-log

I guess, your cluster is already in production, so all other IO on that cluster is interfering with your benchmark. Cache settings inside and outside the VM will affect this test further. As @spirit said, try to reduce the latency to and on your MONs, this can improve the IO/s.

A note of warning, don't use size 2 for your ceph pools, especially in small clusters, as consecutive hardware failures can result in lost PGs (data).

sander93 · Sep 19, 2018

Thank you for your responses!

Can i do the fio test on an existing OSD disk without losing data?

This cluster is already in production yes.

i can add 10 GBE cards to the monitor nodes, will this help you think?

spirit · Sep 19, 2018

@Alwin

I'm using mellanox sn2100 switches (25G ports).
ssd are intel s3610 or nvme intel p4600.

I known that ceph guy are working on msgr2 protocol for nautilus, it should help to reduce latency with cephx,debug and other things related to monitor.
They are also some projects for dpdk-spdk with nvme drive, should help too.

But yes, single thread/low iodepth, is not easy with network storage.

sander93 · Sep 19, 2018

sander93 said:
Thank you for your responses!

Can i do the fio test on an existing OSD disk without losing data?

This cluster is already in production yes.

i can add 10 GBE cards to the monitor nodes, will this help you think?

And is it useful to disable cephfx? if i am right we need to reboot all the VM's to do this right?

spirit · Sep 19, 2018

sander93 said:
And is it useful to disable cephfx? if i am right we need to reboot all the VM's to do this right?

yes, it should help a little bit.

but yes, you need to shutdown ceph cluster and vms. before doing the change.

sander93 · Sep 19, 2018

Oke thanks, can you tell of i can the fio benchmark on a running OSD disk? without losing data / connection etc.

spirit · Sep 19, 2018

also,
osd_enable_op_tracker = false

should reduce latency

sander93 · Sep 19, 2018

spirit said:
also,
osd_enable_op_tracker = false

should reduce latency

Can i do this on een running cluster without problems?

Alwin · Sep 19, 2018

sander93 said:
Oke thanks, can you tell of i can the fio benchmark on a running OSD disk? without losing data / connection etc.

No, fio writes data onto your devices.

spirit said:
I'm using mellanox sn2100 switches (25G ports).

Still a 100GbE switch.

sander93 · Sep 20, 2018

spirit said:
also,
osd_enable_op_tracker = false

should reduce latency

I disabled osd enable op tracker = false and found on the internet also to disable: throttler perf counter = false
Sequentieel read realy improved from 642 MB/s to 1016,5 MB/s but write and 4k read/write is almost the same unfortunately

Are there any more configurations/options i can change to improve this?

spirit · Sep 21, 2018

are you already in production.
If not, I think you could move your monitor on the same server than osd.

(dedicated monitor nodes are only needed for really big clusters).

This should help to reduce latency between osd<-> monitor, and 10GB will help for latency between client<->mon

sander93 · Sep 21, 2018

spirit said:
are you already in production.
If not, I think you could move your monitor on the same server than osd.

(dedicated monitor nodes are only needed for really big clusters).

This should help to reduce latency between osd<-> monitor, and 10GB will help for latency between client<->mon

Yes it is an production cluster...

Oké, and is it else useful to add 10GBe to the monitoring nodes?

spirit · Sep 21, 2018

sander93 said:
Yes it is an production cluster...

Oké, and is it else useful to add 10GBe to the monitoring nodes?

It's not about bandwith, but generally, 10GB switch have bigger asic with lowest latency.

(try to de a "ping -f" to compare between you promox<->osd, and proxmox<->mon)

spirit · Sep 21, 2018

also, on ceph nodes, if they are not shared with vm, you can try to disable spectre/meldown fixes (don't known which kernel you use)

/etc/default/grub

GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off"

also, you can try to put your cpu frequency to always max + elevator noop

GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable quiet intel_idle.max_cstate=0 processor.max_cstate=1 elevator=noop"

Humbug · Sep 21, 2018

spirit said:
GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off"

Does one have to add theses parameters on the virtualization host or on the vms?

spirit · Sep 21, 2018

Humbug said:
Does one have to add theses parameters on the virtualization host or on the vms?

on the ceph hosts.

(you can do it on proxmox host, but it's a security risk).
This is disabling all spectre/meldown fixes.

after that:

#update-grub
#reboot

sander93 · Sep 24, 2018

spirit said:
It's not about bandwith, but generally, 10GB switch have bigger asic with lowest latency.

(try to de a "ping -f" to compare between you promox<->osd, and proxmox<->mon)

I understand it is for lower latency, the traffic is almost nothing, i get the following information with ping -f:

From Proxmox to OSD node:
678456 packets transmitted, 678456 received, 0% packet loss, time 60305ms
rtt min/avg/max/mdev = 0.027/0.070/1.756/0.017 ms, ipg/ewma 0.088/0.067 ms

From Proxmox to Mon node:
569877 packets transmitted, 569876 received, 0% packet loss, time 59519ms
rtt min/avg/max/mdev = 0.040/0.086/0.776/0.020 ms, ipg/ewma 0.104/0.092 ms

Ceph low performance (especially 4k)

Renowned Member

Distinguished Member

Distinguished Member

Proxmox Retired Staff

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Proxmox Retired Staff

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

Active Member

Distinguished Member

Renowned Member