Is Ceph too slow and how to optimize it?

fcukinyahoo · Dec 1, 2016

The setup is 3 clustered Proxmox for computations, 3 clustered Ceph storage nodes,

ceph01 8*150GB ssds (1 used for OS, 7 for storage)
ceph02 8*150GB ssds (1 used for OS, 7 for storage)
ceph03 8*250GB ssds (1 used for OS, 7 for storage)

When I create a VM on proxmox node using ceph storage, I get below speed (network bandwidth is NOT the bottleneck)

Writing to VM where hdd in Ceph

Code:

[root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 46.7814 s, 23.0 MB/s

[root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 15.5484 s, 69.1 MB/s

for comparison, below is on a VM on proxmox, ssd same modal,

Writing to VM where hdd in proxmox

Code:

[root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.301 s, 104 MB/s

[root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 7.22211 s, 149 MB/s

I have below ceph pool

Code:

size/min = 3/2
pg_num = 2048
ruleset = 0

Running 3 monitors on same hosts, Journals are stored on each own OSD
Running latest proxmox with Ceph Hammer

Any suggestions on where should we look at for improvements? Is it the Ceph pool? Is it the Journals? Does it matter if Journal is in same drive as OS (/dev/sda) or OSD (/dev/sdX)?

czechsys · Dec 1, 2016

150MBps is very poor for SSD. I can do this with standard HDD. What is your HW?
Anyway, ceph isn't targeted for performance.

fcukinyahoo · Dec 2, 2016

@czechsys I would be happy if it was 150MBps, it is much less than that. ~23MBps for bs=1M count=1024

czechsys said:
Anyway, ceph isn't targeted for performance.

What is the best performance network storage for Proxmox? I thought it was Ceph...

Hardware below,
Dell R210
CPU: 8 * X3460 @ 2.80GHz
Mem: 4GB
HDD per Ceph node: 8
Network: 2 nic bond cat 6 cable

spirit · Dec 2, 2016

the problem of benching with dd, is that is simulated a single stream, so latency is really important for this benchmark.

(if you bench with fio for example, with iodepth=128 so have more parallel access , i'll be a lot faster).

but for your benchmark, here some tips:

1) - use fastest frequency cpu for your ceph cluster and client.

2)- in your ceph cluster ceph.conf file, disable cephx auth

Code:

[global]
auth_cluster_required = none
auth_service_required = none
auth_client_required = none

(this change need restart of all the ceph cluster and all vms)

3) disable debug feature on ceph client

- create a /etc/ceph.conf in your kvm host with this content

Code:

[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0

4) if you do sequential write without direct, you can enable cache=writeback

tschanness · Dec 2, 2016

Use10g Ethernet
Use möge than 4g RAM

Is your storage and host network in the same nics?

tom · Dec 2, 2016

What kind of SSD do you use? Please add the specification of your SSDs to this thread.

hansm · Dec 2, 2016

A Dell R210 can have 2 2,5" disks and the R210 II can have 4. How can you have 8 ssd's in it? As clarification, the R210 is a 1 socket server, so I assume you have 1 x3460 with 4 cores/8 threads. 4GB RAM is far too little, at least 1GB per OSD, better use 16GB for wetter performance.

I think your network IS your bottleneck. You have 2x 1Gbit/a? What bond mode?

Please clarify your hardware and configuration, be thorough in describing it, we want to help, but you NEED to tell us everything about your setup.

spirit · Dec 2, 2016

also, what is your ssd model ? consumer or enterprise ?

you need enterprise drive for ceph journal, for fast sync write

fcukinyahoo · Dec 2, 2016

@spirit
I will try your suggestions. I am still setting it up so it is not in production yet. I am doing all my testing on test VMs. So restarting will not be a problem. Thank you.

@tschanness

tschanness said:
Use möge than 4g RAM

I can increase the ram. I will give it a shot.

tschanness said:
Is your storage and host network in the same nics?

all nics bonded. However, doesn't increase throughput but gives reliability.

tschanness said:
Use10g Ethernet

below is my network benchmark from one ceph server to proxmox KVM host

Code:

root@ceph01:~# iperf -c 192.168.1.10
------------------------------------------------------------
Client connecting to 192.168.1.10, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.11 port 40690 connected with 192.168.1.10 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 940 Mbits/sec

@tom
SSD Models are
ceph01: INTEL SSDSC2BW24
ceph02: INTEL SSDSA2M160
ceph03: INTEL SSDSA2M160

@hansm

hansm said:
A Dell R210 can have 2 2,5" disks and the R210 II can have 4. How can you have 8 ssd's in it?

We bought a pci raid controller and a 2*4 data cable attached to the pci controller. DELL - PERC H700 SAS RAID CONTROLLER WITH 512MB CACHE

hansm said:
better use 16GB for wetter performance.

I will increase the ram as suggested by someone else as well. Thank you.

hansm said:
I think your network IS your bottleneck. You have 2x 1Gbit/a?

All nics are bonded. Below is my network throughput from one ceph node to proxmox host node.

Code:

root@ceph01:~# iperf -c 192.168.1.10
------------------------------------------------------------
Client connecting to 192.168.1.10, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.11 port 40690 connected with 192.168.1.10 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 940 Mbits/sec

hansm said:
What bond mode?

802.3ad

hansm said:
we want to help, but you NEED to tell us everything about your setup.

Please let me know if you need more information. I would like to have this setup in production as fast as possible with current equipment if possible.

@spirit

spirit said:
also, what is your ssd model ?

SSD models are
ceph01: INTEL SSDSC2BW24
ceph02: INTEL SSDSA2M160
ceph03: INTEL SSDSA2M160

spirit said:
you need enterprise drive for ceph journal, for fast sync write

For the Journals, I kept it default to be written on each OSD. Also I see that 7 OSD Deamons running one for each drive on each server. Is that normal and expected? I created OSDs on proxmox interface so I am assuming yes.

Thanks alot for all your help.

mir · Dec 2, 2016

Network is to slow. For anything but home setup or testing purpose 10 Gb is the absolute minimum.
SSDSC2BW24 and SSDSA2M160 is not DC quality disks.

hansm · Dec 2, 2016

Your SSD's are consumer grade and not fit for the journal job. See https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/ and look for your Intel 520. The 9MB/s is really slow, you have 7 osd's (and pad daemons) per server, so 7x9MB/s = 63MB/s as maximum performance per Ceph node, I'm not sure but I suppose we need to divide this value by 2 because of double write (1 for journal and 1 for real data). You end up with 31,5MB/s performance ro your Ceph cluster. This is close to your test results.

Your network isn't separated for host and cluster network so your bandwidth is shared. If you have 940Mbit/s you need to divide it by 2 = 470Mbit/s / 8 = 58,75MB/s max throughout to your Ceph cluster. This is because of your host writing to Ceph and Ceph replicating your data to the other nodes.

This setup will never perform the way you would like it to do.

Besides that I'm very curious on how you put 8 disks in the R210, your RAID controller can handle the number of disks but your server case can't as far as I know ;-)

Mihai · Dec 18, 2017

spirit said:
the problem of benching with dd, is that is simulated a single stream, so latency is really important for this benchmark.

(if you bench with fio for example, with iodepth=128 so have more parallel access , i'll be a lot faster).

but for your benchmark, here some tips:

1) - use fastest frequency cpu for your ceph cluster and client.

2)- in your ceph cluster ceph.conf file, disable cephx auth

Code:

[global] auth_cluster_required = none auth_service_required = none auth_client_required = none

(this change need restart of all the ceph cluster and all vms)

3) disable debug feature on ceph client

- create a /etc/ceph.conf in your kvm host with this content

Code:

[global] debug asok = 0/0 debug auth = 0/0 debug buffer = 0/0 debug client = 0/0 debug context = 0/0 debug crush = 0/0 debug filer = 0/0 debug filestore = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug journal = 0/0 debug journaler = 0/0 debug lockdep = 0/0 debug mds = 0/0 debug mds balancer = 0/0 debug mds locker = 0/0 debug mds log = 0/0 debug mds log expire = 0/0 debug mds migrator = 0/0 debug mon = 0/0 debug monc = 0/0 debug ms = 0/0 debug objclass = 0/0 debug objectcacher = 0/0 debug objecter = 0/0 debug optracker = 0/0 debug osd = 0/0 debug paxos = 0/0 debug perfcounter = 0/0 debug rados = 0/0 debug rbd = 0/0 debug rgw = 0/0 debug throttle = 0/0 debug timer = 0/0 debug tp = 0/0

4) if you do sequential write without direct, you can enable cache=writeback

When disabling cephx, can I restart each host one by one, or does the entire cluster need to be off and then on again to get this to work?

aderumier · Dec 19, 2017

Mihai said:
When disabling cephx, can I restart each host one by one, or does the entire cluster need to be off and then on again to get this to work?

you need to restart your ceph cluster (mon/osd), and all the vms.

Mihai · Dec 19, 2017

aderumier said:
you need to restart your ceph cluster (mon/osd), and all the vms.

Thank you.

Otter7721 · Feb 11, 2023

I'm sorry to warm this thread again, but I do have the same problem. Do you have a solution. I use four 25gb network cards, and the storage hardware is sata SSD. The test result is even worse than your result. The rados bench shows that the sequential write performance is only 60MB/s

spirit · Feb 11, 2023

Otter7721 said:
I'm sorry to warm this thread again, but I do have the same problem. Do you have a solution. I use four 25gb network cards, and the storage hardware is sata SSD. The test result is even worse than your result. The rados bench shows that the sequential write performance is only 60MB/s

ssd model ?

Otter7721 · Feb 11, 2023

spirit said:
ssd model ?

Micron_1 100_MTFDDAK1T0TB SCSI Disk Device
My thread is here, thank you for your time.
https://forum.proxmox.com/threads/c...roblems-fast-reading-and-slow-writing.122023/

Search

Search

Is Ceph too slow and how to optimize it?

fcukinyahoo

New Member

czechsys

Renowned Member

fcukinyahoo

New Member

spirit

Distinguished Member

tschanness

Member

tom

Proxmox Staff Member

hansm

Well-Known Member

spirit

Distinguished Member

fcukinyahoo

New Member

mir

Famous Member

hansm

Well-Known Member

Mihai

Renowned Member

aderumier

Well-Known Member

Mihai

Renowned Member

Otter7721

New Member

spirit

Distinguished Member

Otter7721

New Member