Ceph Performance

troycarpenter · Oct 9, 2017

I am running PVE5.0 with Ceph Luminous from the test repository. I have a 13 node Ceph & PVE cluster with three storage nodes, each storage node has four 2-TB osd with bluestore backing. and a SSD drive for the journal (block.db) on partitions. The nodes are communicating over a 10Gb network.

I have benchmarked the network between all nodes with iperf, and it always shows around 9.5 Gbps. I did ceph cluster benchmarks from various nodes following the page here: http://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance. The average from various clients in the ceph cluster is 430 MB/s for the write speed and 650 MB/s for both sequential and random reads. From what I can tell, those benchmarks are what I would expect.

However, the problem I'm having is that in practice the ceph cluster feels slow. Moving a 150GB disk image from a local storage to the ceph storage took over 1.5 hours. The Proxmox Ceph performance page registers read and write speeds in KBps.

Any advice as to what the problem may be. How do I determine why the underlying ceph cluster seems to be running as expected, but the Proxmox system is reading and writing slowly to it. I can't move other images to the ceph storage until I can be sure the performance is there.

troycarpenter · Oct 9, 2017

Followup...
When trying to do some rbd commands, I get this sometimes:

7f19cf41d700 0 -- 192.168.201.236:0/4227972101 >> 192.168.201.245:6800/11670 conn(0x7f19b8017eb0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER

If I abort the command and issue it again, the command will work. For instance:

Code:

root@ajax:/var/lib/ceph# rbd rm -p lab-vm-pool vm-1361-disk-1
2017-10-09 13:46:01.756819 7f19cf41d700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.245:6800/1634 conn(0x7f19b800dab0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-10-09 13:46:01.757025 7f19cf41d700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.245:6800/1634 conn(0x7f19b800dab0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-10-09 13:46:01.757610 7f19d041f700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.239:6800/14725 conn(0x55c7a7441760 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-10-09 13:46:01.757752 7f19d041f700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.239:6800/14725 conn(0x55c7a7441760 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER^C
root@ajax:/var/lib/ceph# rbd rm -p lab-vm-pool vm-1361-disk-1
Removing image: 100% complete...done.
root@ajax:/var/lib/ceph#

Might that be an indicator of the problem? How to fix?

jeffwadsworth · Oct 10, 2017

Is your time sync'd throughout your cluster?

troycarpenter · Oct 10, 2017

Yes, all servers are synchronized to the proper time.

I did find that when I deleted all my monitors except on the three storage nodes, the errors went away. I think there is something not happening (or happening incorrectly) when I created the extra monitors on other nodes.

jeffwadsworth · Oct 10, 2017

How is the performance now that the extra monitors are gone?

troycarpenter · Oct 10, 2017

It does seem faster, but not 10G faster.

jeffwadsworth · Oct 10, 2017

How many replicas?

troycarpenter · Oct 10, 2017

3. I'm copying a 30G image right now. It's already been a few minutes and the GUI shows 35MB/s.

jeffwadsworth · Oct 10, 2017

An interesting test would be to copy that file to a single disk on the 10Gbit. The replicas take time and resources.

troycarpenter · Oct 10, 2017

Yes, that is interesting. Even though iperf shows 10G transfer bi-directional, using scp the 30Gig file transfers about the same 35MB/s rates. I'll look into what may be happening and report back.

aderumier · Oct 10, 2017

AFAIK, the move disk option, move block by block of 4K., and sequentially. so it'll be not faster than 1 disk write + network latency.

I'm not sure that journal for write is helping too much here.
Is the source drive configured as writeback ? it could help for migrate to target ceph as writeback too.

czechsys · Oct 10, 2017

What you use - (lib)rbd or krbd?
SCP can be slow if you use crypto not supported by CPU (some cpu doesn't have AES-NI).
How much 10G lines you have in Prox/Ceph node?

Gerhard W. Recher · Oct 10, 2017

troycarpenter said:

Followup...
When trying to do some rbd commands, I get this sometimes:

7f19cf41d700 0 -- 192.168.201.236:0/4227972101 >> 192.168.201.245:6800/11670 conn(0x7f19b8017eb0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER

If I abort the command and issue it again, the command will work. For instance:

Code:

root@ajax:/var/lib/ceph# rbd rm -p lab-vm-pool vm-1361-disk-1
2017-10-09 13:46:01.756819 7f19cf41d700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.245:6800/1634 conn(0x7f19b800dab0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-10-09 13:46:01.757025 7f19cf41d700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.245:6800/1634 conn(0x7f19b800dab0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-10-09 13:46:01.757610 7f19d041f700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.239:6800/14725 conn(0x55c7a7441760 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-10-09 13:46:01.757752 7f19d041f700  0 -- 192.168.201.236:0/4227972101 >> 192.168.201.239:6800/14725 conn(0x55c7a7441760 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER^C
root@ajax:/var/lib/ceph# rbd rm -p lab-vm-pool vm-1361-disk-1
Removing image: 100% complete...done.
root@ajax:/var/lib/ceph#

Might that be an indicator of the problem? How to fix?

just test first with dados bench
example 56 Mbit/s cluster

Code:

rados bench -p rbd 300 write --no-cleanup -t 256
...
Total time run:         300.048707
Total writes made:      266600
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     3554.09
Stddev Bandwidth:       149.486
Max bandwidth (MB/sec): 4172
Min bandwidth (MB/sec): 2504
Average IOPS:           888
Stddev IOPS:            37
Max IOPS:               1043
Min IOPS:               626
Average Latency(s):     0.287971
Stddev Latency(s):      0.0180675
Max latency(s):         0.88252
Min latency(s):         0.0210762

[\code]

I expect 655Mbytes/s for your 10GigE setup

troycarpenter · Oct 10, 2017

aderumier said:
AFAIK, the move disk option, move block by block of 4K., and sequentially. so it'll be not faster than 1 disk write + network latency.

I'm not sure that journal for write is helping too much here.
Is the source drive configured as writeback ? it could help for migrate to target ceph as writeback too.

Moving the disk is also tied to CPU usage since it appears to be using the qemu-img command to convert the image to the ceph storage. That also seems to be limiting the speed when moving a disk from local to ceph.

czechsys said:
What you use - (lib)rbd or krbd?
SCP can be slow if you use crypto not supported by CPU (some cpu doesn't have AES-NI).
How much 10G lines you have in Prox/Ceph node?

I don't know how to answer your first question. I've installed ceph from the proxmox repositories.

For the 10G network connections, I only have one 10G connection from each node to the network switch, but the actual network configuration is slightly more complicated. There are two switches involved, four nodes are on one switch, one node is on a second switch, and the remainder of the nodes (7) are in a chassis based system with a single 10G interface between them all. The ceph network is encapsulated on a VLAN that spans all three switches (since the chassis is considered to have a switch in its backplane that need needs to be configured to support this).

jeffwadsworth · Oct 10, 2017

He is referring to this setting.

troycarpenter · Oct 10, 2017

Ah. No, I don't have that set. Not using storage for containers.

jeffwadsworth · Oct 13, 2017

Any progress?

troycarpenter · Oct 13, 2017

Everything is stable right now, but it still doesn't feel as fast as it should. All the benchmarks show that the speed is there, but in practice the transfers as I move my images to the ceph storage still don't feel as fast as they should. One of the problems I had was playing around with MTU of 9000, but that got messy and I've reverted everything back to standard 1500.

I'm also finding that some of my VM images still have residual internal issues from the previous NAS failures (that led me to pursue this solution), even though all the disk checking tools I've used both internally to the VM with systemrescuecd and external with qemu-img say there are no issues with the images. When moving a few of the disk images, I hit some error about accessing a sector out of range. That's got to be a local disk error, but fsck's don't seem to find any issues on the local partitions.

aderumier · Oct 13, 2017

troycarpenter said:
Everything is stable right now, but it still doesn't feel as fast as it should. All the benchmarks show that the speed is there, but in practice the transfers as I move my images to the ceph storage still don't feel as fast as they should.

the main problem with the move disk option, is that qemu is moving sequentially with small 4 blocks.
you can reduce latency by disabling cephx auth, also disable all debug in ceph.conf (on ceph nodes, but also client node)

[global]
debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

jeffwadsworth · Oct 13, 2017

That is an excellent point, aderumier. I thought they had worked out some of the kinks with that in the latest release. For example, backing up a Ceph image over NFS was painfully slow, etc. Now, with 12.2, the backup to the local storage is very fast. What is your rate when you backup a guest image to your local storage?

As for the disk corruption, yes, I bet the images have some issues. I had an Exchange 2010 server that had a corrupt VMDK image, but it still worked. It could not be backed up or clone except from within the guest itself. In the end, I just used VMWare's converter inside the guest to migrate it off. The corruption came from a supposedly rare "raid-blocks lost" issue on a Equallogic 6000. I have seen a few of those already. The checksums of Ceph should help in that regard at least.

Ceph Performance

Renowned Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Member

Attachments

Renowned Member

Member

Renowned Member

Renowned Member

Member

We value your privacy