Is Ceph too slow and how to optimize it?

Discussion in 'Proxmox VE: Installation and configuration' started by fcukinyahoo, Dec 1, 2016.

  1. fcukinyahoo

    fcukinyahoo New Member

    Joined:
    Nov 29, 2012
    Messages:
    27
    Likes Received:
    0
    The setup is 3 clustered Proxmox for computations, 3 clustered Ceph storage nodes,

    ceph01 8*150GB ssds (1 used for OS, 7 for storage)
    ceph02 8*150GB ssds (1 used for OS, 7 for storage)
    ceph03 8*250GB ssds (1 used for OS, 7 for storage)

    When I create a VM on proxmox node using ceph storage, I get below speed (network bandwidth is NOT the bottleneck)

    Writing to VM where hdd in Ceph
    Code:
    [root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
    1024+0 records in
    1024+0 records out
    1073741824 bytes (1.1 GB) copied, 46.7814 s, 23.0 MB/s
    
    [root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
    1+0 records in
    1+0 records out
    1073741824 bytes (1.1 GB) copied, 15.5484 s, 69.1 MB/s
    
    for comparison, below is on a VM on proxmox, ssd same modal,

    Writing to VM where hdd in proxmox
    Code:
    [root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
    1024+0 records in
    1024+0 records out
    1073741824 bytes (1.1 GB) copied, 10.301 s, 104 MB/s
    
    [root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
    1+0 records in
    1+0 records out
    1073741824 bytes (1.1 GB) copied, 7.22211 s, 149 MB/s
    
    I have below ceph pool
    Code:
    size/min = 3/2
    pg_num = 2048
    ruleset = 0
    
    Running 3 monitors on same hosts, Journals are stored on each own OSD
    Running latest proxmox with Ceph Hammer

    Any suggestions on where should we look at for improvements? Is it the Ceph pool? Is it the Journals? Does it matter if Journal is in same drive as OS (/dev/sda) or OSD (/dev/sdX)?
     
  2. czechsys

    czechsys Member

    Joined:
    Nov 18, 2015
    Messages:
    143
    Likes Received:
    3
    150MBps is very poor for SSD. I can do this with standard HDD. What is your HW?
    Anyway, ceph isn't targeted for performance.
     
  3. fcukinyahoo

    fcukinyahoo New Member

    Joined:
    Nov 29, 2012
    Messages:
    27
    Likes Received:
    0
    @czechsys I would be happy if it was 150MBps, it is much less than that. ~23MBps for bs=1M count=1024

    What is the best performance network storage for Proxmox? I thought it was Ceph...

    Hardware below,
    Dell R210
    CPU: 8 * X3460 @ 2.80GHz
    Mem: 4GB
    HDD per Ceph node: 8
    Network: 2 nic bond cat 6 cable
     
  4. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,323
    Likes Received:
    135
    the problem of benching with dd, is that is simulated a single stream, so latency is really important for this benchmark.

    (if you bench with fio for example, with iodepth=128 so have more parallel access , i'll be a lot faster).

    but for your benchmark, here some tips:


    1) - use fastest frequency cpu for your ceph cluster and client.

    2)- in your ceph cluster ceph.conf file, disable cephx auth
    Code:
    [global]
    auth_cluster_required = none
    auth_service_required = none
    auth_client_required = none
    
    (this change need restart of all the ceph cluster and all vms)

    3) disable debug feature on ceph client

    - create a /etc/ceph.conf in your kvm host with this content

    Code:
    [global]
     debug asok = 0/0
     debug auth = 0/0
     debug buffer = 0/0
     debug client = 0/0
     debug context = 0/0
     debug crush = 0/0
     debug filer = 0/0
     debug filestore = 0/0
     debug finisher = 0/0
     debug heartbeatmap = 0/0
     debug journal = 0/0
     debug journaler = 0/0
     debug lockdep = 0/0
     debug mds = 0/0
     debug mds balancer = 0/0
     debug mds locker = 0/0
     debug mds log = 0/0
     debug mds log expire = 0/0
     debug mds migrator = 0/0
     debug mon = 0/0
     debug monc = 0/0
     debug ms = 0/0
     debug objclass = 0/0
     debug objectcacher = 0/0
     debug objecter = 0/0
     debug optracker = 0/0
     debug osd = 0/0
     debug paxos = 0/0
     debug perfcounter = 0/0
     debug rados = 0/0
     debug rbd = 0/0
     debug rgw = 0/0
     debug throttle = 0/0
     debug timer = 0/0
     debug tp = 0/0
    
    
    4) if you do sequential write without direct, you can enable cache=writeback
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. tschanness

    tschanness Member

    Joined:
    Oct 30, 2016
    Messages:
    291
    Likes Received:
    21
    Use10g Ethernet
    Use möge than 4g RAM

    Is your storage and host network in the same nics?
     
  6. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,564
    Likes Received:
    408
    What kind of SSD do you use? Please add the specification of your SSDs to this thread.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  7. hansm

    hansm Member

    Joined:
    Feb 27, 2015
    Messages:
    57
    Likes Received:
    3
    A Dell R210 can have 2 2,5" disks and the R210 II can have 4. How can you have 8 ssd's in it? As clarification, the R210 is a 1 socket server, so I assume you have 1 x3460 with 4 cores/8 threads. 4GB RAM is far too little, at least 1GB per OSD, better use 16GB for wetter performance.

    I think your network IS your bottleneck. You have 2x 1Gbit/a? What bond mode?

    Please clarify your hardware and configuration, be thorough in describing it, we want to help, but you NEED to tell us everything about your setup.
     
  8. spirit

    spirit Well-Known Member

    Joined:
    Apr 2, 2010
    Messages:
    3,323
    Likes Received:
    135
    also, what is your ssd model ? consumer or enterprise ?

    you need enterprise drive for ceph journal, for fast sync write
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. fcukinyahoo

    fcukinyahoo New Member

    Joined:
    Nov 29, 2012
    Messages:
    27
    Likes Received:
    0
    @spirit
    I will try your suggestions. I am still setting it up so it is not in production yet. I am doing all my testing on test VMs. So restarting will not be a problem. Thank you.

    @tschanness
    I can increase the ram. I will give it a shot.
    all nics bonded. However, doesn't increase throughput but gives reliability.
    below is my network benchmark from one ceph server to proxmox KVM host
    Code:
    root@ceph01:~# iperf -c 192.168.1.10
    ------------------------------------------------------------
    Client connecting to 192.168.1.10, TCP port 5001
    TCP window size: 85.0 KByte (default)
    ------------------------------------------------------------
    [ 3] local 192.168.1.11 port 40690 connected with 192.168.1.10 port 5001
    [ ID] Interval Transfer Bandwidth
    [ 3] 0.0-10.0 sec 1.09 GBytes 940 Mbits/sec
    
    @tom
    SSD Models are
    ceph01: INTEL SSDSC2BW24
    ceph02: INTEL SSDSA2M160
    ceph03: INTEL SSDSA2M160

    @hansm
    We bought a pci raid controller and a 2*4 data cable attached to the pci controller. DELL - PERC H700 SAS RAID CONTROLLER WITH 512MB CACHE
    I will increase the ram as suggested by someone else as well. Thank you.
    All nics are bonded. Below is my network throughput from one ceph node to proxmox host node.
    Code:
    root@ceph01:~# iperf -c 192.168.1.10
    ------------------------------------------------------------
    Client connecting to 192.168.1.10, TCP port 5001
    TCP window size: 85.0 KByte (default)
    ------------------------------------------------------------
    [ 3] local 192.168.1.11 port 40690 connected with 192.168.1.10 port 5001
    [ ID] Interval Transfer Bandwidth
    [ 3] 0.0-10.0 sec 1.09 GBytes 940 Mbits/sec
    
    802.3ad
    Please let me know if you need more information. I would like to have this setup in production as fast as possible with current equipment if possible.

    @spirit
    SSD models are
    ceph01: INTEL SSDSC2BW24
    ceph02: INTEL SSDSA2M160
    ceph03: INTEL SSDSA2M160
    For the Journals, I kept it default to be written on each OSD. Also I see that 7 OSD Deamons running one for each drive on each server. Is that normal and expected? I created OSDs on proxmox interface so I am assuming yes.

    Thanks alot for all your help.
     
  10. mir

    mir Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 14, 2012
    Messages:
    3,481
    Likes Received:
    96
    Network is to slow. For anything but home setup or testing purpose 10 Gb is the absolute minimum.
    SSDSC2BW24 and SSDSA2M160 is not DC quality disks.
     
  11. hansm

    hansm Member

    Joined:
    Feb 27, 2015
    Messages:
    57
    Likes Received:
    3
    Your SSD's are consumer grade and not fit for the journal job. See https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/ and look for your Intel 520. The 9MB/s is really slow, you have 7 osd's (and pad daemons) per server, so 7x9MB/s = 63MB/s as maximum performance per Ceph node, I'm not sure but I suppose we need to divide this value by 2 because of double write (1 for journal and 1 for real data). You end up with 31,5MB/s performance ro your Ceph cluster. This is close to your test results.

    Your network isn't separated for host and cluster network so your bandwidth is shared. If you have 940Mbit/s you need to divide it by 2 = 470Mbit/s / 8 = 58,75MB/s max throughout to your Ceph cluster. This is because of your host writing to Ceph and Ceph replicating your data to the other nodes.

    This setup will never perform the way you would like it to do.

    Besides that I'm very curious on how you put 8 disks in the R210, your RAID controller can handle the number of disks but your server case can't as far as I know ;-)
     
  12. Mihai

    Mihai Member

    Joined:
    Dec 22, 2015
    Messages:
    51
    Likes Received:
    2
    When disabling cephx, can I restart each host one by one, or does the entire cluster need to be off and then on again to get this to work?
     
  13. aderumier

    aderumier Member

    Joined:
    May 14, 2013
    Messages:
    203
    Likes Received:
    18
    you need to restart your ceph cluster (mon/osd), and all the vms.
     
  14. Mihai

    Mihai Member

    Joined:
    Dec 22, 2015
    Messages:
    51
    Likes Received:
    2
    Thank you.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice