Ceph low performance (especially 4k)

sander93

Renowned Member
Sep 30, 2014
57
2
73
Hello,

We have a seperate Ceph cluster and Proxmox Cluster (seperate server nodes), i want to know if the performance we get is normal or not, my thought was that the performance could be way better with the hardware we are using.

So is there any way we can improve with configuration changes?

The performance we get from inside Virtual Machines are about:
Sequentieel: Read 642,6 MB/s Write 459,8MB/s
4 kb Single Thread: Read 4,342 MB/s Write 15,45 MB/s

4K i only get 406 IOPs Write and 835 IOPs Read.

The hardware we are using:
4 X OSD Node, per node:
- 96GB Ram
- 2 x 6 Core (with HT) 2,6 GHz
- 6 x SM863 960GB (single BlueStore OSD per SSD)
- 2 x 10GB SFP+ (1 x 10GB for storage and 1 x 10GB for replication)

3 x Monitoring Node, per node:
- 4GB RAM
- Dual Core CPU (with HT)
- Single 120 GB Intel Enterprise SSD
- 2 x 1 GB Network (Active/Backup)

Replication/size: 2
Ceph Version: 12.2.8
Jumbo Frames enabled
Logging options from Ceph disabled in ceph.conf (this improves it a little bit)

All Proxmox nodes are connected with 1 x 10GB SFP+

Is there any configuration / setting we can change to improve performance? Or is this max we can get with this hardware? Especially 4K read / writes are slow.

I also be thinking if it would help to add 2 OSD nodes with each a fast NVME SSD and make it a Cache pool before the normale SSD pool? Ore will this make it even slower?

Thank you already,

Kind regards,

Sander
 
that seem a bit slow, I'm able to reach around 3000-4000iops 4K with single thread/iodepth

fio --randrepeat=1 --ioengine=libaio --direct=1 --name=test --filename=test --bs=4k --iodepth=1 --size=1G --readwrite=randread

(using 3ghz cpu on proxmox nodes, and ceph server, disabling logging && cephx).

and fast mellanox switch
rtt min/avg/max/mdev = 0.011/0.016/2.657/0.009 ms, ipg/ewma 0.029/0.017 ms


Cache pool will not help.

Mainly, for single thread/iodepth, you need to take network latency + ceph latency (client/server cpu frequency help).
 
@spirit, can you add some hardware details to it? To have some relation to the 3000-4000 IO/s. The network latency looks like 100GbE. ;)

@sander93, a SM863 should get around ~17k IO/s with fio.
Code:
fio --ioengine=libaio –filename=/dev/sdx --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=fio-kvm --output-format=terse,json,normal --output=fio-kvm.log --bandwidth-log

I guess, your cluster is already in production, so all other IO on that cluster is interfering with your benchmark. Cache settings inside and outside the VM will affect this test further. As @spirit said, try to reduce the latency to and on your MONs, this can improve the IO/s.

A note of warning, don't use size 2 for your ceph pools, especially in small clusters, as consecutive hardware failures can result in lost PGs (data).
 
Thank you for your responses!

Can i do the fio test on an existing OSD disk without losing data?

This cluster is already in production yes.

i can add 10 GBE cards to the monitor nodes, will this help you think?
 
@Alwin

I'm using mellanox sn2100 switches (25G ports).
ssd are intel s3610 or nvme intel p4600.

I known that ceph guy are working on msgr2 protocol for nautilus, it should help to reduce latency with cephx,debug and other things related to monitor.
They are also some projects for dpdk-spdk with nvme drive, should help too.

But yes, single thread/low iodepth, is not easy with network storage.
 
Thank you for your responses!

Can i do the fio test on an existing OSD disk without losing data?

This cluster is already in production yes.

i can add 10 GBE cards to the monitor nodes, will this help you think?

And is it useful to disable cephfx? if i am right we need to reboot all the VM's to do this right?
 
Oke thanks, can you tell of i can the fio benchmark on a running OSD disk? without losing data / connection etc.
 
also,
osd_enable_op_tracker = false

should reduce latency

I disabled osd enable op tracker = false and found on the internet also to disable: throttler perf counter = false
Sequentieel read realy improved from 642 MB/s to 1016,5 MB/s but write and 4k read/write is almost the same unfortunately

Are there any more configurations/options i can change to improve this?
 
are you already in production.
If not, I think you could move your monitor on the same server than osd.

(dedicated monitor nodes are only needed for really big clusters).

This should help to reduce latency between osd<-> monitor, and 10GB will help for latency between client<->mon
 
are you already in production.
If not, I think you could move your monitor on the same server than osd.

(dedicated monitor nodes are only needed for really big clusters).

This should help to reduce latency between osd<-> monitor, and 10GB will help for latency between client<->mon

Yes it is an production cluster...

Oké, and is it else useful to add 10GBe to the monitoring nodes?
 
Yes it is an production cluster...

Oké, and is it else useful to add 10GBe to the monitoring nodes?
It's not about bandwith, but generally, 10GB switch have bigger asic with lowest latency.

(try to de a "ping -f" to compare between you promox<->osd, and proxmox<->mon)
 
also, on ceph nodes, if they are not shared with vm, you can try to disable spectre/meldown fixes (don't known which kernel you use)

/etc/default/grub

GRUB_CMDLINE_LINUX=" pti=off spectre_v2=off l1tf=off spec_store_bypass_disable=off"

also, you can try to put your cpu frequency to always max + elevator noop

GRUB_CMDLINE_LINUX_DEFAULT="intel_pstate=disable quiet intel_idle.max_cstate=0 processor.max_cstate=1 elevator=noop"
 
It's not about bandwith, but generally, 10GB switch have bigger asic with lowest latency.

(try to de a "ping -f" to compare between you promox<->osd, and proxmox<->mon)

I understand it is for lower latency, the traffic is almost nothing, i get the following information with ping -f:

From Proxmox to OSD node:
678456 packets transmitted, 678456 received, 0% packet loss, time 60305ms
rtt min/avg/max/mdev = 0.027/0.070/1.756/0.017 ms, ipg/ewma 0.088/0.067 ms

From Proxmox to Mon node:
569877 packets transmitted, 569876 received, 0% packet loss, time 59519ms
rtt min/avg/max/mdev = 0.040/0.086/0.776/0.020 ms, ipg/ewma 0.104/0.092 ms
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!