Is Ceph too slow and how to optimize it?

fcukinyahoo

New Member
Nov 29, 2012
27
0
1
The setup is 3 clustered Proxmox for computations, 3 clustered Ceph storage nodes,

ceph01 8*150GB ssds (1 used for OS, 7 for storage)
ceph02 8*150GB ssds (1 used for OS, 7 for storage)
ceph03 8*250GB ssds (1 used for OS, 7 for storage)

When I create a VM on proxmox node using ceph storage, I get below speed (network bandwidth is NOT the bottleneck)

Writing to VM where hdd in Ceph
Code:
[root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 46.7814 s, 23.0 MB/s

[root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 15.5484 s, 69.1 MB/s
for comparison, below is on a VM on proxmox, ssd same modal,

Writing to VM where hdd in proxmox
Code:
[root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.301 s, 104 MB/s

[root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 7.22211 s, 149 MB/s
I have below ceph pool
Code:
size/min = 3/2
pg_num = 2048
ruleset = 0
Running 3 monitors on same hosts, Journals are stored on each own OSD
Running latest proxmox with Ceph Hammer

Any suggestions on where should we look at for improvements? Is it the Ceph pool? Is it the Journals? Does it matter if Journal is in same drive as OS (/dev/sda) or OSD (/dev/sdX)?
 

czechsys

Member
Nov 18, 2015
166
4
18
150MBps is very poor for SSD. I can do this with standard HDD. What is your HW?
Anyway, ceph isn't targeted for performance.
 

fcukinyahoo

New Member
Nov 29, 2012
27
0
1
@czechsys I would be happy if it was 150MBps, it is much less than that. ~23MBps for bs=1M count=1024

Anyway, ceph isn't targeted for performance.
What is the best performance network storage for Proxmox? I thought it was Ceph...

Hardware below,
Dell R210
CPU: 8 * X3460 @ 2.80GHz
Mem: 4GB
HDD per Ceph node: 8
Network: 2 nic bond cat 6 cable
 

spirit

Well-Known Member
Apr 2, 2010
3,527
156
63
www.odiso.com
the problem of benching with dd, is that is simulated a single stream, so latency is really important for this benchmark.

(if you bench with fio for example, with iodepth=128 so have more parallel access , i'll be a lot faster).

but for your benchmark, here some tips:


1) - use fastest frequency cpu for your ceph cluster and client.

2)- in your ceph cluster ceph.conf file, disable cephx auth
Code:
[global]
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
(this change need restart of all the ceph cluster and all vms)

3) disable debug feature on ceph client

- create a /etc/ceph.conf in your kvm host with this content

Code:
[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
4) if you do sequential write without direct, you can enable cache=writeback
 

tschanness

Member
Oct 30, 2016
291
21
18
29
Use10g Ethernet
Use möge than 4g RAM

Is your storage and host network in the same nics?
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
13,696
435
83
What kind of SSD do you use? Please add the specification of your SSDs to this thread.
 

hansm

Member
Feb 27, 2015
61
3
8
A Dell R210 can have 2 2,5" disks and the R210 II can have 4. How can you have 8 ssd's in it? As clarification, the R210 is a 1 socket server, so I assume you have 1 x3460 with 4 cores/8 threads. 4GB RAM is far too little, at least 1GB per OSD, better use 16GB for wetter performance.

I think your network IS your bottleneck. You have 2x 1Gbit/a? What bond mode?

Please clarify your hardware and configuration, be thorough in describing it, we want to help, but you NEED to tell us everything about your setup.
 

fcukinyahoo

New Member
Nov 29, 2012
27
0
1
@spirit
I will try your suggestions. I am still setting it up so it is not in production yet. I am doing all my testing on test VMs. So restarting will not be a problem. Thank you.

@tschanness
Use möge than 4g RAM
I can increase the ram. I will give it a shot.
Is your storage and host network in the same nics?
all nics bonded. However, doesn't increase throughput but gives reliability.
Use10g Ethernet
below is my network benchmark from one ceph server to proxmox KVM host
Code:
root@ceph01:~# iperf -c 192.168.1.10
------------------------------------------------------------
Client connecting to 192.168.1.10, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.11 port 40690 connected with 192.168.1.10 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 940 Mbits/sec
@tom
SSD Models are
ceph01: INTEL SSDSC2BW24
ceph02: INTEL SSDSA2M160
ceph03: INTEL SSDSA2M160

@hansm
A Dell R210 can have 2 2,5" disks and the R210 II can have 4. How can you have 8 ssd's in it?
We bought a pci raid controller and a 2*4 data cable attached to the pci controller. DELL - PERC H700 SAS RAID CONTROLLER WITH 512MB CACHE
better use 16GB for wetter performance.
I will increase the ram as suggested by someone else as well. Thank you.
I think your network IS your bottleneck. You have 2x 1Gbit/a?
All nics are bonded. Below is my network throughput from one ceph node to proxmox host node.
Code:
root@ceph01:~# iperf -c 192.168.1.10
------------------------------------------------------------
Client connecting to 192.168.1.10, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.11 port 40690 connected with 192.168.1.10 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 940 Mbits/sec
What bond mode?
802.3ad
we want to help, but you NEED to tell us everything about your setup.
Please let me know if you need more information. I would like to have this setup in production as fast as possible with current equipment if possible.

@spirit
also, what is your ssd model ?
SSD models are
ceph01: INTEL SSDSC2BW24
ceph02: INTEL SSDSA2M160
ceph03: INTEL SSDSA2M160
you need enterprise drive for ceph journal, for fast sync write
For the Journals, I kept it default to be written on each OSD. Also I see that 7 OSD Deamons running one for each drive on each server. Is that normal and expected? I created OSDs on proxmox interface so I am assuming yes.

Thanks alot for all your help.
 

mir

Well-Known Member
Apr 14, 2012
3,489
97
48
Copenhagen, Denmark
Network is to slow. For anything but home setup or testing purpose 10 Gb is the absolute minimum.
SSDSC2BW24 and SSDSA2M160 is not DC quality disks.
 

hansm

Member
Feb 27, 2015
61
3
8
Your SSD's are consumer grade and not fit for the journal job. See https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ and look for your Intel 520. The 9MB/s is really slow, you have 7 osd's (and pad daemons) per server, so 7x9MB/s = 63MB/s as maximum performance per Ceph node, I'm not sure but I suppose we need to divide this value by 2 because of double write (1 for journal and 1 for real data). You end up with 31,5MB/s performance ro your Ceph cluster. This is close to your test results.

Your network isn't separated for host and cluster network so your bandwidth is shared. If you have 940Mbit/s you need to divide it by 2 = 470Mbit/s / 8 = 58,75MB/s max throughout to your Ceph cluster. This is because of your host writing to Ceph and Ceph replicating your data to the other nodes.

This setup will never perform the way you would like it to do.

Besides that I'm very curious on how you put 8 disks in the R210, your RAID controller can handle the number of disks but your server case can't as far as I know ;-)
 

Mihai

Member
Dec 22, 2015
55
3
8
34
the problem of benching with dd, is that is simulated a single stream, so latency is really important for this benchmark.

(if you bench with fio for example, with iodepth=128 so have more parallel access , i'll be a lot faster).

but for your benchmark, here some tips:


1) - use fastest frequency cpu for your ceph cluster and client.

2)- in your ceph cluster ceph.conf file, disable cephx auth
Code:
[global]
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
(this change need restart of all the ceph cluster and all vms)

3) disable debug feature on ceph client

- create a /etc/ceph.conf in your kvm host with this content

Code:
[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
4) if you do sequential write without direct, you can enable cache=writeback
When disabling cephx, can I restart each host one by one, or does the entire cluster need to be off and then on again to get this to work?
 

aderumier

Member
May 14, 2013
203
18
18
When disabling cephx, can I restart each host one by one, or does the entire cluster need to be off and then on again to get this to work?
you need to restart your ceph cluster (mon/osd), and all the vms.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!