ceph-perfomance and latency

udo

Famous Member
Apr 22, 2009
5,935
184
83
Ahrensburg; Germany
Hi,
I have made some rados benchmark and I'm see high max latency (rados -p test bench -b 4194304 60 write -t 32 --no-cleanup).

does anybody know how to find and isolate the reason for the high latency?
Code:
Total time run:            61.186561        60.008194 
Total writes made:       7045           100939 
Write size:           4194304             4096
Bandwidth (MB/sec):   460.559                6.571

Stddev Bandwidth:          98.5743           5.50003
Max bandwidth (MB/sec):   568               15.4883
Min bandwidth (MB/sec):     0                0 
Average Latency:            0.277103         0.0190213    
Stddev Latency:             0.227484         0.120132    
Max latency:                [B]4.81041[/B]          [B]4.09047 [/B]
Min latency:                0.067239         0.001231
One thing I found is the numbers of threads - the throughput is better with more threads - even on a host with only 8 cores the performance is with -t 32 much better as with -t 8. But the max latency also rised up (but only slightly).
 

e100

Renowned Member
Nov 6, 2010
1,267
40
68
Columbus, Ohio
ulbuilder.wordpress.com
I don't have an answer to your question but do have some related comments.

We just setup a four node CEPH cluster using old SMART error SATA disks we had laying around. Each node has 16gb ram and10g infiniband.

I also noticed that a higher number of threads resulted in better performance.

Wish I had four SSDs laying around so I could benchmark worth the journal on SSD.

I am disappointed in the read speeds of CEPH. I suspect that the network communications is what introduces the unexpected latency.
 

udo

Famous Member
Apr 22, 2009
5,935
184
83
Ahrensburg; Germany
I don't have an answer to your question but do have some related comments.

We just setup a four node CEPH cluster using old SMART error SATA disks we had laying around. Each node has 16gb ram and10g infiniband.

I also noticed that a higher number of threads resulted in better performance.

Wish I had four SSDs laying around so I could benchmark worth the journal on SSD.

I am disappointed in the read speeds of CEPH. I suspect that the network communications is what introduces the unexpected latency.

Hi e100,
you don't have an good read-performance with an infiniband-connection?

I searched a little bit, and found an hint on the ceph-mailing list. SSDs are sometimes very slow (ceph use dsync for the journal).
We use three different SSDs, because that not die all SSDs together.

Here the results from my SSDs:
Code:
dd if=/root/randfile of=/mnt/test bs=350k count=10000 oflag=direct,dsync

# ssd1: Corsair Force GS
161 MB/s

# ssd2: INTEL SSDSC2CW12
126 MB/s

# ssd3: Samsung SSD 840
52,6 MB/s
The Samsung is much to slow... I have ordered an new corsair-ssd and report if something changed with the latency.

Udo
 

e100

Renowned Member
Nov 6, 2010
1,267
40
68
Columbus, Ohio
ulbuilder.wordpress.com
Hi e100,
you don't have an good read-performance with an infiniband-connection?
One thing I learned: putting the public and private CEPH networks on different IB connections gave me the best performance.

I was getting 187MB/sec inside a windows VM (virtio 0.1-52) sequential read, I can get 1200MB/sec reading from my Areca array in this same VM so I know the bottleneck is not the VM itself.
Reading from 12 OSDs is less than 15MB/sec each.

What really puzzeled me is that during the sequential read the CEPH servers were not reading from the disks, they were getting the data from the cache.
To only get 187MB/sec when reading from the cache of 4 nodes seems rather low even with my crappy hardware.
I even tested with 16OSDs, same speed, not sure what is preventing more performance.

I've checked bandwidth using iperf, the IB network is working fine.
Any thoughts/suggestions?
 

udo

Famous Member
Apr 22, 2009
5,935
184
83
Ahrensburg; Germany
One thing I learned: putting the public and private CEPH networks on different IB connections gave me the best performance.

I was getting 187MB/sec inside a windows VM (virtio 0.1-52) sequential read, I can get 1200MB/sec reading from my Areca array in this same VM so I know the bottleneck is not the VM itself.
Reading from 12 OSDs is less than 15MB/sec each.

What really puzzeled me is that during the sequential read the CEPH servers were not reading from the disks, they were getting the data from the cache.
To only get 187MB/sec when reading from the cache of 4 nodes seems rather low even with my crappy hardware.
I even tested with 16OSDs, same speed, not sure what is preventing more performance.

I've checked bandwidth using iperf, the IB network is working fine.
Any thoughts/suggestions?
Hi,
what values you got from the host with
Code:
rados -p test bench -b 4194304 60 write -t 32 --no-cleanup
rados -p test bench -b 4194304 60 seq -t 32 --no-cleanup
(you have to create the pool test first: "ceph osd pool create test 1600")

Udo
 

e100

Renowned Member
Nov 6, 2010
1,267
40
68
Columbus, Ohio
ulbuilder.wordpress.com
rados -p test bench -b 4194304 60 write -t 32 --no-cleanup
Code:
 Total time run:         61.299273
Total writes made:      1232
Write size:             4194304
Bandwidth (MB/sec):     80.392 

Stddev Bandwidth:       22.1143
Max bandwidth (MB/sec): 112
Min bandwidth (MB/sec): 0
Average Latency:        1.58592
Stddev Latency:         0.889294
Max latency:            4.17732
Min latency:            0.273837

rados -p test bench -b 4194304 60 seq -t 32 --no-cleanup
Code:
 Total time run:        6.391866
Total reads made:     1232
Read size:            4194304
Bandwidth (MB/sec):    770.980 

Average Latency:       0.165004
Max latency:           0.753776
Min latency:           0.058182
OK now that seems like what I was expecting, thats about 64MB/sec/OSD which for these old disks is about their average speed.

Inside the VM I get:
Code:
 Sequential Read :   186.845 MB/s
Sequential Write :    71.298 MB/s

The write speed seems about right, the read speed in the VM is far off from what rados is doing.

Same VM, different disk that is one local Areca Array:
Code:
 Sequential Read :  1301.366 MB/s
Sequential Write :   985.813 MB/s

Both the RBD and LVM disk in the VM have the same settings and driver versions.
I've tested writethrough/back, directsync,none, the sequential read on RBD always hangs around 187MB/sec
 

udo

Famous Member
Apr 22, 2009
5,935
184
83
Ahrensburg; Germany
Your run the benchmark with 32 thread (-t 32). A VM is only a single IO thread.

Hi,
but this confused me completly...

because with single thread I got very small values with rados: (blocksize 4M):
write: 93.180 MB/s
read: 43.655 MB/s

inside a VM i got this with dd (1MB blocksize on a filesystem):
write: 214 MB/s
read: 168 MB/s

How can the VM be faster than the host? But never the less IMHO the VM-speed is not high enough.

With many threads the read-speed ist the same (or a little bit higher) than the write speed - in the VM (and with rados + 1 thread) it's much slower.
Is this normal?

Udo
 

dietmar

Proxmox Staff Member
Staff member
Apr 28, 2005
17,137
532
133
Austria
www.proxmox.com
With many threads the read-speed ist the same (or a little bit higher) than the write speed - in the VM (and with rados + 1 thread) it's much slower.
Is this normal?

I have no real idea. But I think it would be better to ask those questions on the ceph mailing lists.
 

e100

Renowned Member
Nov 6, 2010
1,267
40
68
Columbus, Ohio
ulbuilder.wordpress.com
With rados single thread I got:
Code:
 Total time run:        24.016654
Total reads made:     1232
Read size:            4194304
Bandwidth (MB/sec):    205.191 

Average Latency:       0.0194893
Max latency:           0.026378
Min latency:           0.012442

That corresponds to what I see in the VM itself.

Seems the single IO thread in KVM is the limitation here.
 

udo

Famous Member
Apr 22, 2009
5,935
184
83
Ahrensburg; Germany
Hi e100,
you don't have an good read-performance with an infiniband-connection?

I searched a little bit, and found an hint on the ceph-mailing list. SSDs are sometimes very slow (ceph use dsync for the journal).
We use three different SSDs, because that not die all SSDs together.

Here the results from my SSDs:
Code:
dd if=/root/randfile of=/mnt/test bs=350k count=10000 oflag=direct,dsync

# ssd1: Corsair Force GS
161 MB/s

# ssd2: INTEL SSDSC2CW12
126 MB/s

# ssd3: Samsung SSD 840
52,6 MB/s
The Samsung is much to slow... I have ordered an new corsair-ssd and report if something changed with the latency.

Udo
Hi,
with the new ssd for journal I got an better performance (610 MB/s write with 4k-blocks instead of 460 MB/s), but the max latencys looks sometimes not better...

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!