Proxmox VE Ceph Benchmark 2018/02

Discussion in 'Proxmox VE: Installation and configuration' started by martin, Feb 27, 2018.

  1. tuonoazzurro

    tuonoazzurro Member

    Joined:
    Oct 28, 2017
    Messages:
    54
    Likes Received:
    1
    I'm looking for this solution for like a year but Never understood how to do It so i've made a raid 0 of every single disk.
    Can you Please explain how you did It?
    Thanks
     
  2. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    To be honest I don't remember how I did it. It's been quite a while since then and I've dismantled the lab to look it up for you.

    Looking into my browser history I see I've visited these two links
    https://unix.stackexchange.com/questions/665/installing-grub-2-on-a-usb-flash-drive
    https://unix.stackexchange.com/questions/28506/how-do-you-install-grub2-on-a-usb-stick

    These should get you started.
    Obviously it won't work just by following blind those answers. It needs some customizing to work with Proxmox.

    Now that I think of it, I may have installed a vanilla Debian and then proxmox on top of it.
    I really don't remember, sorry. I tried a lot during that period of time to finally get it working.
     
  3. Cha0s

    Cha0s New Member

    Joined:
    Feb 9, 2018
    Messages:
    8
    Likes Received:
    0
    I managed to find some notes I kept back then.
    So I did install a vanilla Debian 9 and judging from my vague notes, I probably had the USB stick inserted during installation and used it to mount /boot and then selected it in the final installation steps to install the bootloader onto.

    Then I continued with installing proxmox on debian.

    These final steps maybe outdated. Better check the official installation guides.
     
  4. tuonoazzurro

    tuonoazzurro Member

    Joined:
    Oct 28, 2017
    Messages:
    54
    Likes Received:
    1
    Thanks you very much! I'll try this method.
     
  5. fips

    fips Member

    Joined:
    May 5, 2014
    Messages:
    134
    Likes Received:
    5
    Hi,
    on my P420i I had the same issue, so I installed the OS on the ODD SATA port.
     
  6. tuonoazzurro

    tuonoazzurro Member

    Joined:
    Oct 28, 2017
    Messages:
    54
    Likes Received:
    1
    No redundancy on install disk.

    What i'm looking for is:
    2 small disk in mirror zfs for proxmox
    Other disks for zfs/ceph
     
  7. Runestone

    Runestone New Member

    Joined:
    Oct 12, 2018
    Messages:
    1
    Likes Received:
    0
    Greetings!

    We are looking at building a 4 node HA cluster with Ceph storage on all 4 nodes and had some questions on some items in the FAQ. My idea was to install the OS on pro-sumer SSD's, OSD's on enterprise SSD's and extra storage OSD's for low use servers and backups on spinners. I may not be understanding the context of the FAQ's below, so if someone could help me understand if my idea above is workable, that would be great.

    This answer leads me to believe spinners would be fine if big storage is needed with the caveat that it will be slow.

    This answer leads me to believe that it is not acceptable to use spinners.

    And this answer leads me to believe that nothing less than an enterprise SSD should be used, including consumer & pro-sumer SSD's and spinners.


    Thanks for the help.
     
  8. Alwin

    Alwin Proxmox Staff Member
    Staff Member

    Joined:
    Aug 1, 2017
    Messages:
    2,097
    Likes Received:
    184
    @Runestone, all of this has to be seen in the context of VM/CT hosting, where usually high IO/s is needed to run the infrastructure.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  9. afrugone

    afrugone Member

    Joined:
    Nov 26, 2008
    Messages:
    99
    Likes Received:
    0
    Hi, I've configured a 3 server CEPH cluster, using INFINIBAND/IPOIB with a "iperf" of 20GBS, but rados test perform as 1GB, how can I force CEPH traffic to use the INFINIBAND network?
     
  10. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,829
    Likes Received:
    158
    Hi,
    use the public network (and mon-ip) inside the infiniband-network (if you have two network seperate the cluster network (traffic between osds))
    Code:
    public_network = 192.168.2.0/24
    cluster_network = 192.168.3.0/24
    
    [mon.0]
    host = pve01
    mon_addr = 192.168.2.11:6789
    
    Udo
     
  11. afrugone

    afrugone Member

    Joined:
    Nov 26, 2008
    Messages:
    99
    Likes Received:
    0
    Many Thanks for your answer, I configured the CEPH from GUI, and the ceph.conf is as show bellow.
    ceph.conf
    [global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 172.27.111.0/24
    fsid = 6a128c72-3400-430e-9240-9b75b0936015
    keyring = /etc/pve/priv/$cluster.$name.keyring
    mon allow pool delete = true
    osd journal size = 5120
    osd pool default min size = 2
    osd pool default size = 3
    public network = 172.27.111.0/24

    [osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring
    [mon.STO1001]
    host = STO1001
    mon addr = 172.27.111.141:6789
    [mon.STO1002]
    host = STO1002
    mon addr = 172.27.111.142:6789
    [mon.STO1003]
    host = STO1003
    mon addr = 172.27.111.143:6789​

    The Infiniband is in separate network 10.10.111.0/24 and the public network is at 172.27.111.0/24, so I've to put the following?

    cluster network = 10.10.111.0/24
    public network = 172.27.111.0/24
    host = STO1001
    mon addr = 172.27.111.141:6789
    host = STO1002
    mon addr = 172.27.111.142:6789
    host = STO1003
    mon addr = 172.27.111.143:6789​

    With this modification the test bench is as follows:

    rados bench -p SSDPool 60 write --no-cleanup
    Total time run: 60.470899
    Total writes made: 2858
    Write size: 4194304
    Object size: 4194304
    Bandwidth (MB/sec): 189.05
    Stddev Bandwidth: 24.8311
    Max bandwidth (MB/sec): 244
    Min bandwidth (MB/sec): 144
    Average IOPS: 47
    Stddev IOPS: 6
    Max IOPS: 61
    Min IOPS: 36
    Average Latency(s): 0.338518
    Stddev Latency(s): 0.418556
    Max latency(s): 2.9173
    Min latency(s): 0.0226615​
     
    #91 afrugone, Nov 23, 2018
    Last edited: Nov 23, 2018
  12. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,829
    Likes Received:
    158
    Hi,
    you don't have two ceph-networks!
    don't use an cluster network and use 10.10.111.0/24 for the public network. The mons must be also part of this network!

    Udo
     
  13. afrugone

    afrugone Member

    Joined:
    Nov 26, 2008
    Messages:
    99
    Likes Received:
    0
    Sorry but I'm a little confused with the network configuration, my network is as show bellow, Bond0 is a gigabit and bond0 is infiniband with 40GB interfaces and I'm trying that storage communicate trough infiniband (bond0) interfaces

    auto lo
    iface lo inet loopback
    iface eno3 inet manual
    iface enp64s0f1 inet manual
    iface eno1 inet manual
    iface enp136s0f1 inet manual

    auto ib0
    iface ib0 inet manual

    auto ib1
    iface ib1 inet manual

    auto bond1
    iface bond1 inet manual
    slaves eno1 eno3
    bond_miimon 100
    bond_mode active-backup

    auto bond0
    iface bond0 inet static
    address 10.10.111.111
    netmask 255.255.255.0
    slaves ib0 ib1
    bond_miimon 100
    bond_mode active-backup
    pre-up modprobe ib_ipoib
    pre-up echo connected > /sys/class/net/ib0/mode
    pre-up echo connected > /sys/class/net/ib1/mode
    pre-up modprobe bond0
    mtu 65520

    auto vmbr0
    iface vmbr0 inet static
    address 172.27.111.141
    netmask 255.255.252.0
    gateway 172.27.110.252
    bridge_ports bond1
    bridge_stp off
    bridge_fd 0
     
    #93 afrugone, Nov 28, 2018
    Last edited: Nov 28, 2018
  14. udo

    udo Well-Known Member
    Proxmox Subscriber

    Joined:
    Apr 22, 2009
    Messages:
    5,829
    Likes Received:
    158
    Hi,
    you should open an new thread, because this has nothing to do with ceph-benchmarking...

    Udo
     
  15. chrone

    chrone Member

    Joined:
    Apr 15, 2015
    Messages:
    110
    Likes Received:
    14
    Will there be fio synchronous write benchmark inside a VM running on top of Proxmox and Ceph? Would love to compare numbers.

    Is 212 IOPS for synchronous fio 4k write test on a VM acceptable? I know Samsung SM863a SSD could push 6k IOPS as local storage.
     
    #95 chrone, Jan 8, 2019
    Last edited: Jan 9, 2019
  16. frantek

    frantek Member

    Joined:
    May 30, 2009
    Messages:
    154
    Likes Received:
    3
    My Setup:

    Initially setup with PVE4, Ceph Hammer and a 10 GE mesh network. Upgraded to 5.3. OSDs are 500GB spinning disks.

    proxmox-ve: 5.3-1 (running kernel: 4.15.18-10-pve)
    pve-manager: 5.3-8 (running version: 5.3-8/2929af8e)
    pve-kernel-4.15: 5.3-1
    pve-kernel-4.15.18-10-pve: 4.15.18-32
    ceph: 12.2.10-pve1
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-3
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-43
    libpve-guest-common-perl: 2.0-19
    libpve-http-server-perl: 2.0-11
    libpve-storage-perl: 5.0-36
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.1.0-2
    lxcfs: 3.0.2-2
    novnc-pve: 1.0.0-2
    proxmox-widget-toolkit: 1.0-22
    pve-cluster: 5.0-33
    pve-container: 2.0-33
    pve-docs: 5.3-1
    pve-edk2-firmware: 1.20181023-1
    pve-firewall: 3.0-17
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-6
    pve-i18n: 1.0-9
    pve-libspice-server1: 0.14.1-1
    pve-qemu-kvm: 2.12.1-1
    pve-xtermjs: 3.10.1-1
    qemu-server: 5.0-45
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.12-pve1~bpo1

    [global]
    auth client required = cephx
    auth cluster required = cephx
    auth service required = cephx
    cluster network = 10.15.15.0/24
    filestore xattr use omap = true
    fsid = e9a07274-cba6-4c72-9788-a7b65c93e477
    keyring = /etc/pve/priv/$cluster.$name.keyring
    osd journal size = 5120
    osd pool default min size = 1
    public network = 10.15.15.0/24
    mon allow pool delete = true
    [osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring
    [mon.2]
    host = pve03
    mon addr = 10.15.15.7:6789
    [mon.1]
    host = pve02
    mon addr = 10.15.15.6:6789
    [mon.0]
    host = pve01
    mon addr = 10.15.15.5:6789

    # begin crush map
    tunable choose_local_tries 0
    tunable choose_local_fallback_tries 0
    tunable choose_total_tries 50
    tunable chooseleaf_descend_once 1
    tunable chooseleaf_vary_r 1
    tunable chooseleaf_stable 1
    tunable straw_calc_version 1
    tunable allowed_bucket_algs 54
    # devices
    device 0 osd.0 class hdd
    device 1 osd.1 class hdd
    device 2 osd.2 class hdd
    device 3 osd.3 class hdd
    device 4 osd.4 class hdd
    device 5 osd.5 class hdd
    device 6 osd.6 class hdd
    device 7 osd.7 class hdd
    device 8 osd.8 class hdd
    device 9 osd.9 class hdd
    device 10 osd.10 class hdd
    device 11 osd.11 class hdd
    device 12 osd.12 class hdd
    device 13 osd.13 class hdd
    device 14 osd.14 class hdd
    device 15 osd.15 class hdd
    device 16 osd.16 class hdd
    device 17 osd.17 class hdd
    # types
    type 0 osd
    type 1 host
    type 2 chassis
    type 3 rack
    type 4 row
    type 5 pdu
    type 6 pod
    type 7 room
    type 8 datacenter
    type 9 region
    type 10 root
    # buckets
    host pve01 {
    id -2 # do not change unnecessarily
    id -5 class hdd # do not change unnecessarily
    # weight 2.700
    alg straw
    hash 0 # rjenkins1
    item osd.0 weight 0.450
    item osd.1 weight 0.450
    item osd.2 weight 0.450
    item osd.3 weight 0.450
    item osd.16 weight 0.450
    item osd.17 weight 0.450
    }
    host pve03 {
    id -3 # do not change unnecessarily
    id -6 class hdd # do not change unnecessarily
    # weight 2.700
    alg straw
    hash 0 # rjenkins1
    item osd.4 weight 0.450
    item osd.5 weight 0.450
    item osd.7 weight 0.450
    item osd.14 weight 0.450
    item osd.15 weight 0.450
    item osd.6 weight 0.450
    }
    host pve02 {
    id -4 # do not change unnecessarily
    id -7 class hdd # do not change unnecessarily
    # weight 2.700
    alg straw
    hash 0 # rjenkins1
    item osd.8 weight 0.450
    item osd.9 weight 0.450
    item osd.11 weight 0.450
    item osd.12 weight 0.450
    item osd.13 weight 0.450
    item osd.10 weight 0.450
    }
    root default {
    id -1 # do not change unnecessarily
    id -8 class hdd # do not change unnecessarily
    # weight 8.100
    alg straw
    hash 0 # rjenkins1
    item pve01 weight 2.700
    item pve03 weight 2.700
    item pve02 weight 2.700
    }
    # rules
    rule replicated_ruleset {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
    }
    # end crush map

    Data:

    rados bench -p rbd 60 write -b 4M -t 16 --no-cleanup

    Code:
    Total time run:         60.752370
    Total writes made:      1659
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     109.23
    Stddev Bandwidth:       35.3805
    Max bandwidth (MB/sec): 236
    Min bandwidth (MB/sec): 28
    Average IOPS:           27
    Stddev IOPS:            8
    Max IOPS:               59
    Min IOPS:               7
    Average Latency(s):     0.585889
    Stddev Latency(s):      0.286079
    Max latency(s):         1.6641
    Min latency(s):         0.0752661
    
    rados bench 60 rand -t 16 -p rbd

    Code:
    Total time run:       60.032432
    Total reads made:     25108
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1672.96
    Average IOPS:         418
    Stddev IOPS:          19
    Max IOPS:             465
    Min IOPS:             376
    Average Latency(s):   0.0362159
    Max latency(s):       0.239073
    Min latency(s):       0.00460308
    
    Any other suggestion to get better write performance than using SSDs?
     
  17. chrone

    chrone Member

    Joined:
    Apr 15, 2015
    Messages:
    110
    Likes Received:
    14

    Convert from filestore to bluestore might help reducing the double write penalty.
     
  18. fips

    fips Member

    Joined:
    May 5, 2014
    Messages:
    134
    Likes Received:
    5
    Recently I benchmarked Samsungs Enterprise SSD 860DCT with 960GB with my usual benchmark setup and the result was just horrible:
    FIO Command:
    Code:
    fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
    Result:
    BW: 1030KB/s IOPS: 257

    compared with the SM863a:
    BW: 67MB/s IOPS: 17,4k

    It seems not every enterprise SSD is a good choice for a ceph setup...
     
  19. sg90

    sg90 Member

    Joined:
    Sep 21, 2018
    Messages:
    50
    Likes Received:
    7
    According to their sales blurb they are read intensive disks, to be honest looks like a standard 860 with some extra DC features mainly designed for write once read alot, has a very low TBW as-wel.
     
  20. victorhooi

    victorhooi Member

    Joined:
    Apr 3, 2018
    Messages:
    111
    Likes Received:
    6
    We have a:
    • 3 node cluster running Proxmox/Ceph
    • Node 1 has 48 GB of RAM, and Node 2 and 3 have 32 GB of RAM
    • Ceph drives are Intel Optane 900p (480GB) NVMe.
    • 4 OSDs per node (total of 12 OSDs)
    • NICs are Intel X520-DA2, with 10GBASE-LR going to a Unifi US-XG-16.
    • First 10GB port is for Proxmox VM traffic, second 10GB port is for Ceph traffic.
    I created a new pool to store VMs with 512 PGs. When I copy from a local LVM store to Rados - I'm seeing writes stall at around 318 MiB/s:

    [​IMG]

    I then created a second pool with 128 PGs for benchmarking.

    Write results:
    Code:
    root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16 --no-cleanup
    ....
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
       60      16     12258     12242   816.055       788   0.0856726   0.0783458
    Total time run:         60.069008
    Total writes made:      12258
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     816.261
    Stddev Bandwidth:       17.4584
    Max bandwidth (MB/sec): 856
    Min bandwidth (MB/sec): 780
    Average IOPS:           204
    Stddev IOPS:            4
    Max IOPS:               214
    Min IOPS:               195
    Average Latency(s):     0.0783801
    Stddev Latency(s):      0.0468404
    Max latency(s):         0.437235
    Min latency(s):         0.0177178
    
    Sequential read results - I don't know why this only ran for 32 seconds?
    Code:
    root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
    ....
    Total time run:       32.608549
    Total reads made:     12258
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1503.65
    Average IOPS:         375
    Stddev IOPS:          22
    Max IOPS:             410
    Min IOPS:             326
    Average Latency(s):   0.0412777
    Max latency(s):       0.498116
    Min latency(s):       0.00447062
    
    Random read results:
    Code:
    root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
    ....
    Total time run:       60.066384
    Total reads made:     22819
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1519.59
    Average IOPS:         379
    Stddev IOPS:          21
    Max IOPS:             424
    Min IOPS:             320
    Average Latency(s):   0.0408697
    Max latency(s):       0.662955
    Min latency(s):       0.00172077
    
    I then cleaned-up with:
    Code:
    root@vwnode1:~# rados -p benchmarking cleanup
    Removed 12258 objects
    
    I then tested with the normal Ceph pool, that has 512 PGs (instead of the 128 PGs in the benchmarking pool)

    Write result:
    Code:
    root@vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16 --no-cleanup
    ....
    Total time run:         60.041712
    Total writes made:      12132
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     808.238
    Stddev Bandwidth:       20.7444
    Max bandwidth (MB/sec): 860
    Min bandwidth (MB/sec): 744
    Average IOPS:           202
    Stddev IOPS:            5
    Max IOPS:               215
    Min IOPS:               186
    Average Latency(s):     0.0791746
    Stddev Latency(s):      0.0432707
    Max latency(s):         0.42535
    Min latency(s):         0.0200791
    
    Sequential read result - once again, only ran for 32 seconds:
    Code:
    root@vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16
    ....
    Total time run:       31.249274
    Total reads made:     12132
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1552.93
    Average IOPS:         388
    Stddev IOPS:          30
    Max IOPS:             460
    Min IOPS:             320
    Average Latency(s):   0.0398702
    Max latency(s):       0.481106
    Min latency(s):       0.00461585
    
    Random read result:
    Code:
    root@vwnode1:~# rados bench -p proxmox_vms 60 rand -t 16
    ....
    Total time run:       60.088822
    Total reads made:     23626
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   1572.74
    Average IOPS:         393
    Stddev IOPS:          25
    Max IOPS:             432
    Min IOPS:             322
    Average Latency(s):   0.0392854
    Max latency(s):       0.693123
    Min latency(s):       0.00178545
    
    Code:
    root@vwnode1:~# rados -p proxmox_vms cleanup
    Removed 12132 objects
    root@vwnode1:~# rados df
    POOL_NAME   USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD     WR_OPS WR
    proxmox_vms 169GiB   43396      0 130188                  0       0        0 909519 298GiB 619697 272GiB
    
    total_objects    43396
    total_used       564GiB
    total_avail      768GiB
    total_space      1.30TiB/
    
    Any ideas on why the original transfer from LVM to Ceph stalled at 371 MiB/s?

    And are the above rados bench results in line with what you might expect with this hardware?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice