Proxmox, Ceph and local storage performance

Discussion in 'Proxmox VE: Installation and configuration' started by mateusz, Mar 13, 2017.

  1. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    Hello,
    On our environment I see some performance issue, maybe someone can help me to find where is the problem.
    We have 6 servers on PVE4.4 with ca. 200VMs (Windows and Linux). All VM disks (rbd) are stored on separated Ceph cluster (10 servers, 20 SSD osd - cache tier and 48 HDD osd ).
    I do some IO test using fio from Linux VM (Linux test01 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64)
    • VM disk stored on ceph:
      • READ: io=540704KB, aggrb=6199KB/s, minb=481KB/s, maxb=3018KB/s, mint=34097msec, maxt=87216msec
      • WRITE: io=278496KB, aggrb=8167KB/s, minb=479KB/s, maxb=14077KB/s, mint=18622msec, maxt=34097msec
    • VM disk stored on one (raid0) SATA drive
      • READ: io=540704KB, aggrb=736KB/s, minb=333KB/s, maxb=361KB/s, mint=49232msec, maxt=733843msec
      • WRITE: io=278496KB, aggrb=5656KB/s, minb=332KB/s, maxb=11234KB/s, mint=23334msec, maxt=49232msec
    • VM disk stored on one (raid0) SAS drive (15k)
      • READ: io=540704KB, aggrb=1597KB/s, minb=498KB/s, maxb=782KB/s, mint=32905msec, maxt=338542msec
      • WRITE: io=278496KB, aggrb=8463KB/s, minb=496KB/s, maxb=39390KB/s, mint=6655msec, maxt=32905msec

    VM config is:
    agent: 1
    balloon: 0
    boot: c
    bootdisk: virtio0
    cores: 4
    cpu: host
    hotplug: 0
    ide2: none,media=cdrom
    memory: 8192
    name: test
    net0: virtio=32:65:61:xx:xx:xx,bridge=vmbr0,tag=2027
    numa: 1
    ostype: l26
    virtio0: ceph01:vm-2027003-disk-3,cache=none,size=10G
    virtio1: ceph01:vm-2027003-disk-2,cache=none,size=10G
    scsihw: virtio-scsi
    smbios1: uuid=8c947036-c62c-4e72-8e4f-f8d1xxxxxxxx
    sockets: 2
    Proxmox Version (pveversion -v):
    proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
    pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
    pve-kernel-4.2.6-1-pve: 4.2.6-36
    pve-kernel-4.4.16-1-pve: 4.4.16-64
    pve-kernel-4.4.40-1-pve: 4.4.40-82
    lvm2: 2.02.116-pve3
    corosync-pve: 2.4.2-1
    libqb0: 1.0-1
    pve-cluster: 4.0-48
    qemu-server: 4.0-109
    pve-firmware: 1.1-10
    libpve-common-perl: 4.0-92
    libpve-access-control: 4.0-23
    libpve-storage-perl: 4.0-76
    pve-libspice-server1: 0.12.8-2
    vncterm: 1.3-1
    pve-docs: 4.4-3
    pve-qemu-kvm: 2.7.1-4
    pve-container: 1.0-94
    pve-firewall: 2.0-33
    pve-ha-manager: 1.0-40
    ksm-control-daemon: 1.2-1
    glusterfs-client: 3.5.2-2+deb8u3
    lxc-pve: 2.0.7-3
    lxcfs: 2.0.6-pve1
    criu: 1.6.0-1
    novnc-pve: 0.5-8
    smartmontools: 6.5+svn4324-1~pve80
    zfsutils: 0.6.5.9-pve15~bpo80
    ceph: 9.2.1-1~bpo80+1​

    Network interfaces (1000MBps) are never utilized more than 30%.
    Is this issue with PVE or some mistakes on ceph configuration? What config I should post here to give You more info?
    Best Regards
    Mateusz
     
  2. czechsys

    czechsys Member

    Joined:
    Nov 18, 2015
    Messages:
    138
    Likes Received:
    3
    Well, better posting fio result lines with "read" or "write" and "iops=". Anyway, i am thinking about problem on pve side, because "maxb" READ on every (specially local!) storage is slower than on WRITE. That's crazy having read << write performance.

    What's fio on hypervisor side? Any tunning setup affecting read cache?
     
  3. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    There are no tunning on proxmox, upgraded from 4.0 last week (but on 4.0 the same symptoms).
    Fio 2.1.11 on localstorage (SATA) from hypervisor:

    bgwriter: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
    queryA: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=mmap, iodepth=1
    queryB: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=mmap, iodepth=1
    bgupdater: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
    fio-2.1.11
    Starting 4 processes
    queryA: Laying out IO file(s) (1 file(s) / 256MB)
    queryB: Laying out IO file(s) (1 file(s) / 256MB)
    bgupdater: Laying out IO file(s) (1 file(s) / 32MB)
    Jobs: 1 (f=1): [_(2),r(1),_(1)] [99.4% done] [3452KB/0KB/0KB /s] [863/0/0 iops] [eta 00m:02s]
    bgwriter: (groupid=0, jobs=1): err= 0: pid=30387: Mon Mar 13 15:16:48 2017
    write: io=262144KB, bw=2355.4KB/s, iops=588, runt=111297msec
    slat (usec): min=7, max=183, avg=24.83, stdev= 4.40
    clat (msec): min=1, max=1503, avg=54.31, stdev=52.23
    lat (msec): min=1, max=1503, avg=54.34, stdev=52.23
    clat percentiles (msec):
    | 1.00th=[ 4], 5.00th=[ 7], 10.00th=[ 10], 20.00th=[ 16],
    | 30.00th=[ 23], 40.00th=[ 30], 50.00th=[ 39], 60.00th=[ 49],
    | 70.00th=[ 63], 80.00th=[ 84], 90.00th=[ 120], 95.00th=[ 157],
    | 99.00th=[ 245], 99.50th=[ 285], 99.90th=[ 388], 99.95th=[ 433],
    | 99.99th=[ 562]
    bw (KB /s): min= 1181, max= 2648, per=100.00%, avg=2358.99, stdev=154.17
    lat (msec) : 2=0.01%, 4=1.09%, 10=8.97%, 20=16.32%, 50=34.62%
    lat (msec) : 100=24.53%, 250=13.54%, 500=0.90%, 750=0.02%, 2000=0.01%
    cpu : usr=0.75%, sys=2.28%, ctx=63568, majf=0, minf=147
    IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
    submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
    issued : total=r=0/w=65536/d=0, short=r=0/w=0/d=0
    latency : target=0, window=0, percentile=100.00%, depth=32
    queryA: (groupid=0, jobs=1): err= 0: pid=30388: Mon Mar 13 15:16:48 2017
    read : io=262144KB, bw=832342B/s, iops=203, runt=322506msec
    clat (usec): min=100, max=216863, avg=4912.76, stdev=7368.91
    lat (usec): min=100, max=216864, avg=4913.08, stdev=7368.88
    clat percentiles (usec):
    | 1.00th=[ 274], 5.00th=[ 338], 10.00th=[ 1224], 20.00th=[ 1832],
    | 30.00th=[ 2384], 40.00th=[ 2928], 50.00th=[ 3472], 60.00th=[ 3984],
    | 70.00th=[ 4512], 80.00th=[ 5280], 90.00th=[ 8512], 95.00th=[12864],
    | 99.00th=[38144], 99.50th=[51456], 99.90th=[88576], 99.95th=[102912],
    | 99.99th=[150528]
    bw (KB /s): min= 113, max= 2751, per=49.14%, avg=815.66, stdev=472.12
    lat (usec) : 250=0.49%, 500=6.15%, 750=0.31%, 1000=0.82%
    lat (msec) : 2=15.31%, 4=36.99%, 10=32.80%, 20=4.20%, 50=2.40%
    lat (msec) : 100=0.46%, 250=0.06%
    cpu : usr=0.28%, sys=0.75%, ctx=65554, majf=65536, minf=46
    IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued : total=r=65536/w=0/d=0, short=r=0/w=0/d=0
    latency : target=0, window=0, percentile=100.00%, depth=1
    queryB: (groupid=0, jobs=1): err= 0: pid=30389: Mon Mar 13 15:16:48 2017
    read : io=262144KB, bw=824210B/s, iops=201, runt=325688msec
    clat (usec): min=97, max=335631, avg=4959.28, stdev=8171.98
    lat (usec): min=98, max=335631, avg=4959.74, stdev=8171.98
    clat percentiles (usec):
    | 1.00th=[ 241], 5.00th=[ 306], 10.00th=[ 1208], 20.00th=[ 1816],
    | 30.00th=[ 2384], 40.00th=[ 2896], 50.00th=[ 3440], 60.00th=[ 3952],
    | 70.00th=[ 4448], 80.00th=[ 5152], 90.00th=[ 8384], 95.00th=[12864],
    | 99.00th=[40704], 99.50th=[58112], 99.90th=[100864], 99.95th=[119296],
    | 99.99th=[164864]
    bw (KB /s): min= 78, max= 3417, per=48.69%, avg=808.28, stdev=498.22
    lat (usec) : 100=0.01%, 250=1.50%, 500=4.93%, 750=0.32%, 1000=1.17%
    lat (msec) : 2=15.51%, 4=37.55%, 10=32.05%, 20=4.02%, 50=2.26%
    lat (msec) : 100=0.58%, 250=0.10%, 500=0.01%
    cpu : usr=0.31%, sys=0.72%, ctx=65543, majf=65536, minf=35
    IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued : total=r=65536/w=0/d=0, short=r=0/w=0/d=0
    latency : target=0, window=0, percentile=100.00%, depth=1
    bgupdater: (groupid=0, jobs=1): err= 0: pid=30390: Mon Mar 13 15:16:48 2017
    read : io=16416KB, bw=112303B/s, iops=27, runt=149684msec
    slat (usec): min=8, max=95, avg=21.72, stdev= 5.18
    clat (usec): min=152, max=183931, avg=6610.63, stdev=12333.95
    lat (usec): min=169, max=183957, avg=6632.88, stdev=12334.56
    clat percentiles (usec):
    | 1.00th=[ 219], 5.00th=[ 282], 10.00th=[ 924], 20.00th=[ 1640],
    | 30.00th=[ 2352], 40.00th=[ 3024], 50.00th=[ 3696], 60.00th=[ 4256],
    | 70.00th=[ 5024], 80.00th=[ 7072], 90.00th=[11840], 95.00th=[23168],
    | 99.00th=[68096], 99.50th=[84480], 99.90th=[125440], 99.95th=[158720],
    | 99.99th=[183296]
    bw (KB /s): min= 4, max= 468, per=7.66%, avg=127.18, stdev=149.98
    write: io=16352KB, bw=111865B/s, iops=27, runt=149684msec
    slat (usec): min=9, max=94, avg=24.15, stdev= 4.39
    clat (msec): min=1, max=1308, avg=29.98, stdev=75.57
    lat (msec): min=1, max=1308, avg=30.00, stdev=75.57
    clat percentiles (msec):
    | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 5],
    | 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 8], 60.00th=[ 9],
    | 70.00th=[ 12], 80.00th=[ 17], 90.00th=[ 77], 95.00th=[ 174],
    | 99.00th=[ 363], 99.50th=[ 482], 99.90th=[ 709], 99.95th=[ 824],
    | 99.99th=[ 1303]
    bw (KB /s): min= 4, max= 469, per=6.57%, avg=122.25, stdev=151.44
    lat (usec) : 250=1.79%, 500=2.27%, 750=0.50%, 1000=1.01%
    lat (msec) : 2=7.14%, 4=22.09%, 10=42.26%, 20=11.44%, 50=4.55%
    lat (msec) : 100=2.58%, 250=2.94%, 500=1.18%, 750=0.20%, 1000=0.02%
    lat (msec) : 2000=0.01%
    cpu : usr=0.27%, sys=0.20%, ctx=8197, majf=0, minf=9
    IO depths : 1=99.8%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.0%, >=64=0.0%
    submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued : total=r=4104/w=4088/d=0, short=r=0/w=0/d=0
    latency : target=0, window=0, percentile=100.00%, depth=16

    Run status group 0 (all jobs):
    READ: io=540704KB, aggrb=1660KB/s, minb=109KB/s, maxb=812KB/s, mint=149684msec, maxt=325688msec
    WRITE: io=278496KB, aggrb=1860KB/s, minb=109KB/s, maxb=2355KB/s, mint=111297msec, maxt=149684msec

    Disk stats (read/write):
    dm-0: ios=135160/70370, merge=0/0, ticks=669476/3928668, in_queue=4601776, util=100.00%, aggrios=135176/70297, aggrmerge=0/73, aggrticks=669364/3901808, aggrin_queue=4571056, aggrutil=100.00%
    sda: ios=135176/70297, merge=0/73, ticks=669364/3901808, in_queue=4571056, util=100.00%​
     
  4. czechsys

    czechsys Member

    Joined:
    Nov 18, 2015
    Messages:
    138
    Likes Received:
    3
    Can you test fio from non-pve live linux? Problem is on all hypervisors (same HW, basic info)?

    Your results are hardly readable, but compare yours:
    Code:
    read : io=262144KB, bw=832342B/s, iops=203, runt=322506msec
    read : io=262144KB, bw=824210B/s, iops=201, runt=325688msec
    read : io=16416KB, bw=112303B/s, iops=27, runt=149684msec
    
    write: io=262144KB, bw=2355.4KB/s, iops=588, runt=111297msec
    write: io=16352KB, bw=111865B/s, iops=27, runt=149684msec
    with (SAS 10k 2x300GB, P410i, HP DL1xx G6, pve-manager/4.4-12/e71b7a74 (running kernel: 4.4.44-1-pve)):
    Code:
    fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
    
      read : io=3071.7MB, bw=3780.6KB/s, iops=945, runt=831985msec
      write: io=1024.4MB, bw=1260.8KB/s, iops=315, runt=831985msec
    
    What's your iowait, avg load when you run read tests?
     
  5. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,822
    Likes Received:
    158
    Hi,
    app. 30 VMs on a server with 1GB-connection ? to an ceph-cluster?
    this mean an EC-Pool - with journal on the SSDs too or are the ssds for the cache tier only?
    EC-Pools are not the fastest ones... esp. if the data aren't in the cache...
    How IO-saturated are your ceph-cluster?
    How looks the test with 4MB block-size?

    That write is faster than read is imho quite normal - afaik the rbd-driver combined small writes to bigger ones.

    Udo
     
  6. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    Tested from live CD on my laptop, using fio and this config:
    Code:
    [global]
    ioengine=rbd 
    clientname=admin
    pool=sata
    rbdname=fio_test
    invalidate=0    # mandatory
    rw=randwrite
    bs=4k
    
    [rbd_iodepth32]
    iodepth=32
    
    Result:
    Code:
    write: io=2048.0MB, bw=7717.5KB/s, iops=1929, runt=271742msec
    
    So it's looks like ceph problem.
    Next i do the tests again on local storage on main hypervisors.
    Command for bs=4M
    Code:
    fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/var/lib/vz/images/fio_4M_test --bs=4M --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
    
    libaio_bs_4M_size_8G_randrw.png
    Command for bs=4K
    Code:
    fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/var/lib/vz/images/fio_4k_test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
    
    libaio_bs_4M_size_8G_randrw.png
    Results in files libaio_bs_4k_size_2G_randrw.png and libaio_bs_4M_size_8G_randrw.png
    Values for iowait, avgqu-sz and await from command iostat -x 10 5

    At last I do bs=4M test inside guest with fio:
    Code:
    [global]
    rw=randread
    size=256m
    directory=/root/fio-testing/data
    ioengine=libaio
    iodepth=4
    invalidate=1
    direct=1
    bs=4M
    
    [bgwriter]
    rw=randwrite
    iodepth=32
    
    [queryA]
    iodepth=1
    ioengine=mmap
    direct=0
    thinktime=3
    
    [queryB]
    iodepth=1
    ioengine=mmap
    direct=0
    thinktime=5
    
    [bgupdater]
    rw=randrw
    iodepth=16
    thinktime=40
    size=32m
    
    and result:
    Code:
    bgwriter: write: io=262144KB, bw=45104KB/s, iops=11, runt=  5812msec
    queryA:   read : io=262144KB, bw=1741.9KB/s, iops=0, runt=150499msec
    queryB:   read : io=262144KB, bw=1726.6KB/s, iops=0, runt=151829msec
    bgupdater:   read : io=16384KB, bw=6246.3KB/s, iops=1, runt=  2623msec
    
    Write speed for bs=4M inside guest is OK, so problem is with bs=4k.
    How can I improve write speed for 4k writes on PVE?
     

    Attached Files:

  7. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    Yes, but there is max 30% of network device usage.
    This is replicated (replica 3) pool with cache tier and journal on SSD. All SSDs drives are INTEL SSDSC2BX200G4.
    How can I check this?
    4MB block-size on local storage gets 69MB/s read and 20MB/s write on SATA.

    OK
     
  8. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,822
    Likes Received:
    158
    also SSD DC S3610 - this should be ok.
    replica 3 with cache tier?? sure? sounds for me like an EC-Pool with cache tier. But this shouldn't change anything for your speed-tests.
    atop is an nice tool
    Yes - but what values do you get on ceph?

    Because 4k-access: with small blocks the latency has an higher impact - and 1GB-Ethernet has an much higher latency than 10GB-Ethernet.
    This is one reason, why many people use 10GB-Nics for ceph/iscsi and so so on.
    And normaly the test is 4k for IOPS and 4M for throughput.
    Even if you buy an SSD - they wrote " > 500MB/s + 50k IOPS" - mean more than 500MB/s with 4MB-Blocks but (this is >125 IOPS) but with 4K and 50k IOPS you get 195MB/s.

    You can look with "ceph -w" how many data your ceph-cluster provide - or ceph-dash as an gui (or other tools).

    Udo
     
  9. Q-wulf

    Q-wulf Active Member

    Joined:
    Mar 3, 2013
    Messages:
    593
    Likes Received:
    28
    Can you post the following, please?
    • Ceph server specs (cpu, ram, networking)
    • Ceph config (including networking)
    • Do you use the 20 cache-tier SSD's as pure cache or also as a journal?
    • Ceph crush map.
    preferably encapsulated by code/quote bb-code.

    You can put a pool as cache on any other pool, even a cache-pool.
     
    #9 Q-wulf, Mar 16, 2017
    Last edited: Mar 16, 2017
  10. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    Code:
    ceph10: 2x E5504  @ 2.00GHz, 32GB RAM, 4x NetXtreme II BCM5709  Gigabit Ethernet (2 active)
    ceph15: 2x E5504  @ 2.00GHz, 32GB RAM, 4x NetXtreme II BCM5709  Gigabit Ethernet (2 active)
    ceph20: 2x E5410  @ 2.33GHz, 32GB RAM, 4x 82571EB Gigabit Ethernet Controller (2 active)
    ceph25: 2x E5620  @ 2.40GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
    ceph30: 1x E5530  @ 2.40GHz, 32GB RAM, 2x 82571EB Gigabit Ethernet Controller (1 active), 4x  NetXtreme II BCM5709 Gigabit Ethernet (1 active)
    ceph35: 2x E5540  @ 2.53GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
    ceph40: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
    ceph45: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
    ceph50: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
    ceph55: 2x X5670  @ 2.93GHz, 32GB RAM, 4x NetXtreme II BCM5709 Gigabit Ethernet (2 active)
    
    Network configuration:
    Code:
    #ceph10:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.10
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8
     auto em4
     iface em4 inet static
             address 10.20.4.10
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph15:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.15
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8 8.8.4.4
     auto em4
     iface em4 inet static
             address 10.20.4.15
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    #ceph20:
     auto lo
     iface lo inet loopback
     auto eth0
     iface eth0 inet static
             address 10.20.8.20
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8 8.8.4.4
     auto eth2
     iface eth2 inet static
             address 10.20.4.20
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph25:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.25
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8 8.8.4.4
     auto em4
     iface em4 inet static
             address 10.20.4.25
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph30:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.30
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8 8.8.4.4
     auto p4p2
     iface p4p2 inet static
             address 10.20.4.30
             netmask 255.255.252.0
             network 10.20.4.0
             bracast 10.20.7.255
    
    #ceph35:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.35
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8 8.8.4.4
     auto em4
     iface em4 inet static
             address 10.20.4.35
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph40:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.40
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8
     auto em3
     iface em3 inet static
             address 10.20.4.40
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph45
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.45
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8
     auto em3
     iface em3 inet static
             address 10.20.4.45
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph50:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.50
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8
     auto em3
     iface em3 inet static
             address 10.20.4.50
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    #ceph55:
     auto lo
     iface lo inet loopback
     auto em1
     iface em1 inet static
             address 10.20.8.55
             netmask 255.255.252.0
             network 10.20.8.0
             broadcast 10.20.11.255
             gateway 10.20.8.1
             dns-nameservers 8.8.8.8
     auto em3
     iface em3 inet static
             address 10.20.4.55
             netmask 255.255.252.0
             network 10.20.4.0
             broadcast 10.20.7.255
    
    Ceph.conf:
    Code:
    [global]
    
    fsid=some_uuid
    
    mon initial members =ceph55, ceph50, ceph45, ceph40, ceph35, ceph30, ceph25, ceph20, ceph15, ceph10
    mon host = 10.20.8.55, 10.20.8.50, 10.20.8.45, 10.20.8.40, 10.20.8.35, 10.20.8.30, 10.20.8.25, 10.20.8.20, 10.20.8.15, 10.20.8.10
    
    
    public network = 10.20.8.0/22
    cluster network = 10.20.4.0/22
    
    filestore xattr use omap = true
    filestore max sync interval = 30
    
    
    osd journal size = 10240
    osd mount options xfs = "rw,noatime,inode64,allocsize=4M"
    osd pool default size = 3
    osd pool default min size = 1
    osd pool default pg num = 2048
    osd pool default pgp num = 2048
    osd disk thread ioprio class = idle
    osd disk thread ioprio priority = 7
    osd crush update on start = false
    
    osd crush chooseleaf type = 1
    osd recovery max active = 1
    osd recovery op priority = 1
    osd max backfills = 1
    
    auth cluster required = cephx
    auth service required = cephx
    auth client required = cephx
    
    rbd default format = 2
    
    ##ceph35 osds
    [osd.0]
    cluster addr = 10.20.4.35
    public addr = 10.20.8.35
    [osd.1]
    cluster addr = 10.20.4.35
    public addr = 10.20.8.35
    [osd.2]
    cluster addr = 10.20.4.35
    public addr = 10.20.8.35
    [osd.3]
    cluster addr = 10.20.4.35
    public addr = 10.20.8.35
    [osd.4]
    cluster addr = 10.20.4.35
    [osd.5]
    cluster addr = 10.20.4.35
    [osd.66]
    cluster addr = 10.20.4.35
    [osd.67]
    cluster addr = 10.20.4.35
    
    ##ceph25 osds
    [osd.6]
    cluster addr = 10.20.4.25
    [osd.7]
    cluster addr = 10.20.4.25
    [osd.8]
    cluster addr = 10.20.4.25
    [osd.9]
    cluster addr = 10.20.4.25
    [osd.10]
    cluster addr = 10.20.4.25
    [osd.11]
    cluster addr = 10.20.4.25
    [osd.62]
    cluster addr = 10.20.4.25
    [osd.63]
    cluster addr = 10.20.4.25
    
    ##ceph15 osds
    [osd.12]
    cluster addr = 10.20.4.15
    [osd.13]
    cluster addr = 10.20.4.15
    [osd.14]
    cluster addr = 10.20.4.15
    [osd.15]
    cluster addr = 10.20.4.15
    [osd.58]
    cluster addr = 10.20.4.15
    [osd.59]
    cluster addr = 10.20.4.15
    
    
    ##ceph30 osds
    [osd.16]
    cluster addr = 10.20.4.30
    [osd.17]
    cluster addr = 10.20.4.30
    [osd.18]
    cluster addr = 10.20.4.30
    [osd.19]
    cluster addr = 10.20.4.30
    [osd.20]
    cluster addr = 10.20.4.30
    [osd.21]
    cluster addr = 10.20.4.30
    [osd.64]
    cluster addr = 10.20.4.30
    [osd.65]
    cluster addr = 10.20.4.30
    
    ##ceph20 osds
    [osd.22]
    cluster addr = 10.20.4.20
    [osd.23]
    cluster addr = 10.20.4.20
    [osd.24]
    cluster addr = 10.20.4.20
    [osd.25]
    cluster addr = 10.20.4.20
    [osd.26]
    cluster addr = 10.20.4.20
    [osd.27]
    cluster addr = 10.20.4.20
    [osd.60]
    cluster addr = 10.20.4.20
    [osd.61]
    cluster addr = 10.20.4.20
    
    ##ceph10 osd
    [osd.28]
    cluster addr = 10.20.4.10
    [osd.29]
    cluster addr = 10.20.4.10
    [osd.30]
    cluster addr = 10.20.4.10
    [osd.31]
    cluster addr = 10.20.4.10
    [osd.56]
    cluster addr = 10.20.4.10
    [osd.57]
    cluster addr = 10.20.4.10
    
    #ceph40 osd
    [osd.32]
    cluster addr = 10.20.4.40
    [osd.33]
    cluster addr = 10.20.4.40
    [osd.34]
    cluster addr = 10.20.4.40
    [osd.35]
    cluster addr = 10.20.4.40
    [osd.36]
    cluster addr = 10.20.4.40
    [osd.52]
    cluster addr = 10.20.4.40
    
    #ceph45 osd
    [osd.37]
    cluster addr = 10.20.4.45
    [osd.38]
    cluster addr = 10.20.4.45
    [osd.39]
    cluster addr = 10.20.4.45
    [osd.40]
    cluster addr = 10.20.4.45
    [osd.41]
    cluster addr = 10.20.4.45
    [osd.54]
    cluster addr = 10.20.4.45
    
    #ceph50 osd
    [osd.42]
    cluster addr = 10.20.4.50
    [osd.43]
    cluster addr = 10.20.4.50
    [osd.44]
    cluster addr = 10.20.4.50
    [osd.45]
    cluster addr = 10.20.4.50
    [osd.46]
    cluster addr = 10.20.4.50
    [osd.53]
    cluster addr = 10.20.4.50
    
    #ceph55 osd
    [osd.47]
    cluster addr = 10.20.4.55
    [osd.48]
    cluster addr = 10.20.4.55
    [osd.49]
    cluster addr = 10.20.4.55
    [osd.50]
    cluster addr = 10.20.4.55
    [osd.51]
    cluster addr = 10.20.4.55
    [osd.55]
    cluster addr = 10.20.4.55
    
    
    
    [mon.ceph35]
    host = ceph35
    mon addr = 10.20.8.35:6789
    [mon.ceph30]
    host = ceph30
    mon addr = 10.20.8.30:6789
    [mon.ceph20]
    host = ceph20
    mon addr = 10.20.8.20:6789
    [mon.ceph15]
    host = ceph15
    mon addr = 10.20.8.15:6789
    [mon.ceph25]
    mon addr = 10.20.8.25:6789
    [mon.ceph10]
    host = ceph10
    mon addr = 10.20.8.10:6789
    [mon.ceph40]
    host = ceph40
    mon addr = 10.20.8.40:6789
    [mon.ceph45]
    host = ceph45
    mon addr = 10.20.8.45:6789
    [mon.ceph50]
    host = ceph50
    mon addr = 10.20.8.50:6789
    [mon.ceph55]
    host = ceph55
    mon addr = 10.20.8.55:6789
    
    The disks are also used as a journal, system disk is SSD with partition for system, 6 or 8 partitions (10GB) for osds journal and rest free space as osd in cache tier pool (~ca 100GB), second SSD disk is used only as a osd for cache tier.
    Code:
    # begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    
    # devices
    device 0 osd.0
    device 1 osd.1
    device 2 osd.2
    device 3 osd.3
    device 4 osd.4
    device 5 osd.5
    device 6 osd.6
    device 7 osd.7
    device 8 osd.8
    device 9 osd.9
    device 10 osd.10
    device 11 osd.11
    device 12 osd.12
    device 13 osd.13
    device 14 osd.14
    device 15 osd.15
    device 16 osd.16
    device 17 osd.17
    device 18 osd.18
    device 19 osd.19
    device 20 osd.20
    device 21 osd.21
    device 22 osd.22
    device 23 osd.23
    device 24 osd.24
    device 25 osd.25
    device 26 osd.26
    device 27 osd.27
    device 28 osd.28
    device 29 osd.29
    device 30 osd.30
    device 31 osd.31
    device 32 osd.32
    device 33 osd.33
    device 34 osd.34
    device 35 osd.35
    device 36 osd.36
    device 37 osd.37
    device 38 osd.38
    device 39 osd.39
    device 40 osd.40
    device 41 osd.41
    device 42 osd.42
    device 43 osd.43
    device 44 osd.44
    device 45 osd.45
    device 46 osd.46
    device 47 osd.47
    device 48 osd.48
    device 49 osd.49
    device 50 osd.50
    device 51 osd.51
    device 52 osd.52
    device 53 osd.53
    device 54 osd.54
    device 55 osd.55
    device 56 osd.56
    device 57 osd.57
    device 58 osd.58
    device 59 osd.59
    device 60 osd.60
    device 61 osd.61
    device 62 osd.62
    device 63 osd.63
    device 64 osd.64
    device 65 osd.65
    device 66 osd.66
    device 67 osd.67
    
    # types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10 root
    
    # buckets
    host ceph30 {
            id -5           # do not change unnecessarily
            # weight 5.000
            alg straw
            hash 0  # rjenkins1
            item osd.19 weight 0.910
            item osd.17 weight 0.910
            item osd.20 weight 0.680
            item osd.16 weight 0.680
            item osd.18 weight 0.910
            item osd.21 weight 0.910
    }
    host ceph20 {
            id -6           # do not change unnecessarily
            # weight 3.890
            alg straw
            hash 0  # rjenkins1
            item osd.22 weight 0.550
            item osd.24 weight 0.680
            item osd.25 weight 0.680
            item osd.26 weight 0.680
            item osd.27 weight 0.680
            item osd.23 weight 0.620
    }
    host ceph10 {
            id -7           # do not change unnecessarily
            # weight 3.639
            alg straw
            hash 0  # rjenkins1
            item osd.28 weight 0.910
            item osd.30 weight 0.910
            item osd.31 weight 0.910
            item osd.29 weight 0.909
    }
    rack skwer {
            id -10          # do not change unnecessarily
            # weight 12.529
            alg straw
            hash 0  # rjenkins1
            item ceph30 weight 5.000
            item ceph20 weight 3.890
            item ceph10 weight 3.639
    }
    host ceph35 {
            id -2           # do not change unnecessarily
            # weight 5.410
            alg straw
            hash 0  # rjenkins1
            item osd.0 weight 0.900
            item osd.1 weight 0.900
            item osd.2 weight 0.900
            item osd.3 weight 0.900
            item osd.5 weight 0.900
            item osd.4 weight 0.910
    }
    host ceph25 {
            id -3           # do not change unnecessarily
            # weight 4.310
            alg straw
            hash 0  # rjenkins1
            item osd.6 weight 0.680
            item osd.7 weight 0.680
            item osd.8 weight 0.680
            item osd.9 weight 0.680
            item osd.11 weight 0.680
            item osd.10 weight 0.910
    }
    host ceph15 {
            id -4           # do not change unnecessarily
            # weight 3.640
            alg straw
            hash 0  # rjenkins1
            item osd.12 weight 0.910
            item osd.13 weight 0.910
            item osd.14 weight 0.910
            item osd.15 weight 0.910
    }
    rack nzoz {
            id -20          # do not change unnecessarily
            # weight 13.360
            alg straw
            hash 0  # rjenkins1
            item ceph35 weight 5.410
            item ceph25 weight 4.310
            item ceph15 weight 3.640
    }
    root default {
            id -1           # do not change unnecessarily
            # weight 25.889
            alg straw
            hash 0  # rjenkins1
            item skwer weight 12.529
            item nzoz weight 13.360
    }
    host ceph40-ssd {
            id -16          # do not change unnecessarily
            # weight 0.296
            alg straw
            hash 0  # rjenkins1
            item osd.32 weight 0.171
            item osd.52 weight 0.125
    }
    host ceph50-ssd {
            id -19          # do not change unnecessarily
            # weight 0.296
            alg straw
            hash 0  # rjenkins1
            item osd.42 weight 0.171
            item osd.53 weight 0.125
    }
    rack skwer-ssd {
            id -9           # do not change unnecessarily
            # weight 0.592
            alg straw
            hash 0  # rjenkins1
            item ceph40-ssd weight 0.296
            item ceph50-ssd weight 0.296
    }
    host ceph45-ssd {
            id -17          # do not change unnecessarily
            # weight 0.296
            alg straw
            hash 0  # rjenkins1
            item osd.37 weight 0.171
            item osd.54 weight 0.125
    }
    host ceph55-ssd {
            id -22          # do not change unnecessarily
            # weight 0.296
            alg straw
            hash 0  # rjenkins1
            item osd.47 weight 0.171
            item osd.55 weight 0.125
    }
    rack nzoz-ssd {
            id -11          # do not change unnecessarily
            # weight 0.592
            alg straw
            hash 0  # rjenkins1
            item ceph45-ssd weight 0.296
            item ceph55-ssd weight 0.296
    }
    root ssd {
            id -8           # do not change unnecessarily
            # weight 1.184
            alg straw
            hash 0  # rjenkins1
            item skwer-ssd weight 0.592
            item nzoz-ssd weight 0.592
    }
    host ceph40-sata {
            id -15          # do not change unnecessarily
            # weight 7.272
            alg straw
            hash 0  # rjenkins1
            item osd.33 weight 1.818
            item osd.34 weight 1.818
            item osd.35 weight 1.818
            item osd.36 weight 1.818
    }
    host ceph50-sata {
            id -21          # do not change unnecessarily
            # weight 7.272
            alg straw
            hash 0  # rjenkins1
            item osd.43 weight 1.818
            item osd.44 weight 1.818
            item osd.45 weight 1.818
            item osd.46 weight 1.818
    }
    rack skwer-sata {
            id -13          # do not change unnecessarily
            # weight 14.544
            alg straw
            hash 0  # rjenkins1
            item ceph40-sata weight 7.272
            item ceph50-sata weight 7.272
    }
    host ceph45-sata {
            id -18          # do not change unnecessarily
            # weight 7.272
            alg straw
            hash 0  # rjenkins1
            item osd.38 weight 1.818
            item osd.39 weight 1.818
            item osd.40 weight 1.818
            item osd.41 weight 1.818
    }
    host ceph55-sata {
            id -23          # do not change unnecessarily
            # weight 7.272
            alg straw
            hash 0  # rjenkins1
            item osd.48 weight 1.818
            item osd.49 weight 1.818
            item osd.50 weight 1.818
            item osd.51 weight 1.818
    }
    rack nzoz-sata {
            id -14          # do not change unnecessarily
            # weight 14.544
            alg straw
            hash 0  # rjenkins1
            item ceph45-sata weight 7.272
            item ceph55-sata weight 7.272
    }
    root sata {
            id -12          # do not change unnecessarily
            # weight 29.088
            alg straw
            hash 0  # rjenkins1
            item skwer-sata weight 14.544
            item nzoz-sata weight 14.544
    }
    host ceph10-ssd {
            id -27          # do not change unnecessarily
            # weight 0.296
            alg straw
            hash 0  # rjenkins1
            item osd.56 weight 0.171
            item osd.57 weight 0.125
    }
    host ceph20-ssd {
            id -29          # do not change unnecessarily
            # weight 0.278
            alg straw
            hash 0  # rjenkins1
            item osd.60 weight 0.175
            item osd.61 weight 0.103
    }
    host ceph30-ssd {
            id -31          # do not change unnecessarily
            # weight 0.278
            alg straw
            hash 0  # rjenkins1
            item osd.64 weight 0.175
            item osd.65 weight 0.103
    }
    rack rbd-cache-skwer {
            id -25          # do not change unnecessarily
            # weight 0.852
            alg straw
            hash 0  # rjenkins1
            item ceph10-ssd weight 0.296
            item ceph20-ssd weight 0.278
            item ceph30-ssd weight 0.278
    }
    host ceph15-ssd {
            id -28          # do not change unnecessarily
            # weight 0.296
            alg straw
            hash 0  # rjenkins1
            item osd.58 weight 0.171
            item osd.59 weight 0.125
    }
    host ceph25-ssd {
            id -30          # do not change unnecessarily
            # weight 0.278
            alg straw
            hash 0  # rjenkins1
            item osd.62 weight 0.175
            item osd.63 weight 0.103
    }
    host ceph35-ssd {
            id -32          # do not change unnecessarily
            # weight 0.278
            alg straw
            hash 0  # rjenkins1
            item osd.66 weight 0.175
            item osd.67 weight 0.103
    }
    rack rbd-cache-nzoz {
            id -26          # do not change unnecessarily
            # weight 0.852
            alg straw
            hash 0  # rjenkins1
            item ceph15-ssd weight 0.296
            item ceph25-ssd weight 0.278
            item ceph35-ssd weight 0.278
    }
    root rbd-cache {
            id -24          # do not change unnecessarily
            # weight 1.704
            alg straw
            hash 0  # rjenkins1
            item rbd-cache-skwer weight 0.852
            item rbd-cache-nzoz weight 0.852
    }
    
    # rules
    rule replicated_ruleset {
            ruleset 0
            type replicated
            min_size 1
            max_size 10
            step take default
            step choose firstn 2 type rack
            step chooseleaf firstn 2 type host
            step emit
            step take default
            step chooseleaf firstn -2 type osd
            step emit
    }
    rule ssd {
            ruleset 1
            type replicated
            min_size 1
            max_size 10
            step take ssd
            step choose firstn 2 type rack
            step chooseleaf firstn 2 type host
            step emit
            step take ssd
            step chooseleaf firstn -2 type osd
            step emit
    }
    rule sata {
            ruleset 2
            type replicated
            min_size 1
            max_size 10
            step take sata
            step choose firstn 2 type rack
            step chooseleaf firstn 2 type host
            step emit
            step take sata
            step chooseleaf firstn -2 type osd
            step emit
    }
    rule rbd-cache {
            ruleset 3
            type replicated
            min_size 1
            max_size 10
            step take rbd-cache
            step choose firstn 2 type rack
            step chooseleaf firstn 2 type host
            step emit
            step take rbd-cache
            step chooseleaf firstn -2 type osd
            step emit
    }
    
    # end crush map
    
    Pool "rbd-cache" is set as cache tier for pool "rbd", pool "ssd" is set as cache tier for pool "sata".

    Code:
     ceph osd pool ls detail
    pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 81958 lfor 70295 flags hashpspool tiers 6 read_tier 6 write_tier 6 min_read_recency_for_promote 3 min_write_recency_for_promote 3 stripe_width 0
            removed_snaps [1~2,4~12,17~2e,46~121,16b~a]
    pool 4 'ssd' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 81967 flags hashpspool,incomplete_clones tier_of 5 cache_mode readforward target_bytes 298195056179 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x6 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
            removed_snaps [1~14d,150~1e,16f~8]
    pool 5 'sata' replicated size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 512 pgp_num 512 last_change 81967 lfor 66807 flags hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 0
            removed_snaps [1~14d,150~1e,16f~8]
    pool 6 'rbd-cache' replicated size 3 min_size 1 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 81958 flags hashpspool,incomplete_clones tier_of 2 cache_mode readforward target_bytes 429496729600 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x6 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
            removed_snaps [1~2,4~12,17~2e,46~121,16b~a]
    
    Servers ceph40, ceph45, ceph50 and ceph55 have better SATA disks ( WDC WD2004FBYZ). On this servers are placed pools "ssd" and "sata". RAID controler is H700 512MB, all disks are in raid0.
     
  11. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    Yes, I'm sure, please look at output from ceph osd ls pool detail
    Did You mean avq, avio?
    Please look at charts ceph_reads and ceph_writes. It's parsed from ceph -w from today.
    ceph_reads.PNG ceph_writes.PNG
     
  12. Q-wulf

    Q-wulf Active Member

    Joined:
    Mar 3, 2013
    Messages:
    593
    Likes Received:
    28
    okay, let me recap this based on the information you provided:

    You have 10 Servers
    • each has a 1G link for Public Network
    • each has a 1G link for Cluster Network (exception node Ceph30 - shared with public network)
    • each Server acts as a MON (10 Mons total)
    • you have split Spinners from SSD's using some sort of Crush location hook script.

    OSD spread:
    • Ceph 10
      • 2x SSD (2 seperate weights)
      • 4x Sata (uniform weight )
    • Ceph 15
      • 2x SSD (2 seperate weights)
      • 4x Sata (uniform weight )
    • Ceph 20
      • 2x SSD (2 seperate weights)
      • 6x Sata (3 separate weights)
    • Ceph 25
      • 2x SSD (2 seperate weights)
      • 6x Sata (2 seperate weights)
    • Ceph 30
      • 2x SSD (2 seperate weights)
      • 6x Sata (2 separate weights)
    • Ceph 35
      • 2x SSD (2 seperate weights)
      • 6x Sata (2 uniform weight
    • Ceph 40
      • 2x SSD (2 seperate weights)
      • 4x Sata (uniform weight)
    • Ceph 45
      • 2x SSD (2 seperate weights)
      • 4x Sata (uniform weight)
    • Ceph 50
      • 2x SSD (2 seperate weights)
      • 4x Sata (uniform weight)
    • Ceph 55
      • 2x SSD (2 seperate weights)
      • 4x Sata (uniform weight)

    Bucket-Type spread:
    • root rbd-cache
      • Rack rbd-cache-skwer
        • host Ceph10-ssd
        • host Ceph20-ssd
        • host Ceph30-ssd
      • rack rbd-cache-nzoz
        • host Ceph15-ssd
        • host Ceph24-ssd
        • host Ceph35-ssd
    • root default
      • Skwer
        • host Ceph30
        • host Ceph20
        • host Ceph10
      • nzoz
        • host Ceph35
        • host Ceph25
        • host Ceph15
    • root ssd
      • skwer-ssd
        • host Ceph40-ssd
        • host Ceph50-ssd
      • Nzoz-SSD
        • host Ceph45-ssd
        • host Ceph55-ssd
    • root sata
      • rack Skwer-sata
        • host Ceph40-sata
        • host Ceph50-sata
      • rack Nzoz-sata
        • host Ceph40-sata
        • host Ceph50-sata

    Crush rule / pool config :

    • pool rdb
        • Crush-rule: Default
        • root: Default
      • Cache-pool RDB-Cache
        • Crush-rule rdb-cache
        • root rdb-cache
    • pool sata
        • Crush-rule: sata
        • root: sata
      • Cache-pool ssd
        • Crush-rule: ssd
        • root: ssd



    Questions I still have:

    Q1: SSDs: Your crush map has them added at different weights. This leads me to believe, that there have been different sizes of SSD-Space allocated to these SSD-Osd. Can you shed a light on the exact config of these SSDs?

    Q2: You seem to be using differently weighted HDD based OSDs on the same node and cluster. Any chance these have different performance characteristics? looks like you use at least 4 different types of HDDs on the cluster

    Q3: You mentioned raid devices ... where and how do you use Raid ?


    Things I can already say are this:
    (in no particular order)

    A1. You effectively have 2 logical Clusters under the physical Cluster (or management engine).

    Cluster 1: hosts pool RDB (and its caching pool). It is separated into 2 Racks with 3 nodes each. 6 backing OSD per node.
    Cluster 2: hosts pool sata (and its caching pool). It is separated into 2 Racks with 2 nodes each. 4 Backing OSD per node

    If I was running this setup. I'd have separate these into 2 physical Clusters. Now there is no need to change this.
    You just have to be aware, that performance wise, you basically have a 32 OSD cluster AND a 16 OSD cluster.

    A2. Too many Monintors (mon)
    compare http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/
    You basically run more Mons then you'd need to. Given your rack config (assuming it mirrors physical rack configuration). I'd run an odd number of Mons. At least 1 per logical rack. In your situation I'd run 5 mons. Better yet setup 3 dedicated Mons with more network links for client/cluster network communication.

    compare : http://docs.ceph.com/docs/jewel/start/hardware-recommendations/

    A3. You are most likely (severely) network bottled:
    You can test this by doing your benchmarks and simultaneously monitoring your network links with something like e.g. nload .
    given your already deployed resources, i'd use 2x1G for Public and 2x1G for Cluster networking.
    Compare: http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
    If that is still not enough to elleviate your network bottleneck (which i suppsect it won't), i'd stick the constraint network on a 10G link and use 4x1G for the network or even do 10G and 10G.
    But you basically have to benchmark this.
    edit: given the fact that you use 2 logical Clusters, you might not need to upgrade all nodes to 10G links. seeing how one is a 4-node Cluster, one is a 6 node Cluster with 3 times the OSDs and you probably having different performance requierments for the different Clusters.

    A4. That Ceph30 node with a single 1G network link can't really be helping matters.
    I'd do the network constraint test on that one first.

    edit: Typos & added Question 3
    edit2: expanded on A3
     
    #12 Q-wulf, Mar 17, 2017
    Last edited: Mar 17, 2017
    Alwin likes this.
  13. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,822
    Likes Received:
    158
    Hi,
    for me it's looks, that the ceph-cluster is still quite busy - esp. with writes. So you can't expected good performance during tests.

    Udo
     
  14. mateusz

    mateusz New Member

    Joined:
    Aug 21, 2014
    Messages:
    11
    Likes Received:
    0
    Ceph30 also have separated 1G interface for cluster.

    As I wrote before, all SSD drives are identical. On each server there are 2 SSD, plase look at partition table for this drives:
    • First with system, journals, and osd
    Code:
     parted /dev/sda
    Disk /dev/sda: 199GB
    Sector size (logical/physical): 512B/512B
    Partition Table: msdos
    
    Number  Start   End     Size    Type      File system  Flags
     1      1049kB  256MB   255MB   primary   ext2         boot
     2      256MB   12.3GB  12.0GB  primary                lvm
     3      12.3GB  199GB   187GB   extended               lba
     5      12.6GB  24.7GB  12.1GB  logical
     6      24.7GB  36.8GB  12.1GB  logical
     7      36.8GB  48.9GB  12.1GB  logical
     8      48.9GB  60.9GB  12.1GB  logical
     9      60.9GB  199GB   139GB   logical   xfs
    
    • And second only with osd
    Code:
     parted /dev/sdf
    Disk /dev/sdf: 199GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    
    Number  Start   End     Size    File system  Name          Flags
     2      1049kB  10.7GB  10.7GB               ceph journal
     1      10.7GB  199GB   189GB   xfs          ceph data
    I think that it isn't recommended configuration when system, journal and osd are on the same physical drive, but have no idea how to improve this config and utilize maximum of SSD.
    Yes, HDD drives on ceph10-ceph35 are mixed 5.2k, 7.2k. unfortunately it 's old cluster, drives are changed without unification.
    Servers ceph40-ceph55 have uniform drives.
    Because Dell H700i doesn't have JBOD setting, I setup raid0 on each drive.

    Yes, it is because of difference in performance. Servers in 16OSD have better hardware and they are new, this 32OSD part of cluster in future will be removed.
    Thank You for this.
    For an 1hour test with nload average incoming and outgoing transfers are ~150Mbit/s on each server. Max transfer is ~400Mbit/s.
    Ceph30 have two 1G interfaces, em1 and p4p2 are interfaces from different network adapters.
    Test with iperf from ceph40 to ceph45 gives me ~930Mbit/s
     
  15. Q-wulf

    Q-wulf Active Member

    Joined:
    Mar 3, 2013
    Messages:
    593
    Likes Received:
    28
    That is one of your bottlenecks right there.

    If you use one SSD as a journal and OSD and another only as a OSD, and their weight is similar, then statistically they are as likely to be used as Primary OSD.
    This "multipurpose SSD" handles the following writes:
    • 4x Journal Writes for the 4x HDD-OSD's attached to it.
    • 1x Journal Write for its own OSD
    • 1x Write to its OSD
    • Writes done by the System
    Which most likely makes it significantly slower then the OSD-only SSD.

    You basically tried to maximize the Space your Cache can hold, but in the process you slowed your Cache perfromance way down. Which also makes the other SSD underutilised in this scenario.

    You have 4 options here:

    • Option 1:
      • SSD-0: Buy a "cheap" OS-SSD
      • SSD-1: Journals
      • SSD-2: Cache
    • Option 2:
      • SSD-1: OS + Journals
      • SSD-2: Cache
    • Option 3:
      • SSD-1: OS + Cache
      • SSD-2: Journals (this one will most likely reach its end first)
    • Option 4:
      • SSD-1: OS + 2x Journals + cache
      • SSD-2: 2x Journals + cache

    Option 1: gives you the best performance of the cache, half the cach-size (will wear out the OS-Disk first)
    Option 2: second best performance of cache (your journal SSD will wear out first)
    Option 3: slowest performance of cache, maximum utilisation of Cache-size (will wear out at same rate)


    Did you manually set the weight of these drives ? If not, the weight is based on the disk-size (which then must also be different).

    You have another bottle-neck right there.
    Lets take this example for Ceph30:
    item osd.16 weight 0.680
    item osd.18 weight 0.910

    OSD.16 is roughly 25% less likely to be selected as a OSD compared to OSD.18. Unless your OSD.16 is exactly 25% less performant, you are leaving large chunks of performance on the table, because OSD.18 is selected more often.

    Now you could figure out exactly how much difference it is, and manually adjust this weight, but quite honestly, you'll never think about all factors that go into this equation, nor is it worth the time.

    You could also go and remove the 5.2k drives, and it MIGHT be worth it, but my gut tells me that removing 30% of your OSD's on the node is not worth the performance jump you get with pure 4x7.2k drives. Unless we are talking really really old drives, at which point they should probably not be used anyways.

    In short:
    Best performance is maintained when using same-speed same capacity drives as OSD's.



    Because Dell H700i doesn't have JBOD setting, I setup raid0 on each drive.
    Let me rephrase that.

    On which node(s) did you raid-0 which drives ?

    Or are you saying you "raid-0" each drive into its own volume ?
    What type of test did you run ?
    Did you load up your VM's and created a benchmark on said VM ?
    You should be maxing at least 1 1G network link. And by that i mean at least the 1G link your VM accesses its storage on.
     
    Alwin likes this.
  16. SilverNodashi

    SilverNodashi Member

    Joined:
    Jul 30, 2017
    Messages:
    95
    Likes Received:
    2
    @mateusz did you ever get to the bottom of your performance problem?
     
  17. dgd950712

    dgd950712 New Member

    Joined:
    Nov 8, 2018
    Messages:
    1
    Likes Received:
    0
    Hi!! im making a comparative work and i need to now what is the max storage capacity for Ceph?, do you now where i can find that information from a reliable source?
     
  18. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,822
    Likes Received:
    158
    Harhar,
    good joke!

    Time ago ceph say it's an Petabyte Storage... so i'm sure it's can provide more space than you needed.

    Look at ceph.com - ceph block devices show there: Images up to 16 exabytes

    Udo
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice