High I/O slow guests

Discussion in 'Proxmox VE: Installation and configuration' started by tmwl, Apr 23, 2019.

  1. tmwl

    tmwl New Member
    Proxmox Subscriber

    Joined:
    Mar 4, 2019
    Messages:
    5
    Likes Received:
    0
    Hi,

    we have a few problems with proxmox and high I/O. We use ZFS in the following configuration:

    Code:
    :~# zpool status -v
      pool: rpool
     state: ONLINE
      scan: scrub repaired 0B in 33h43m with 0 errors on Mon Apr 15 10:07:13 2019
    config:
    
            NAME         STATE     READ WRITE CKSUM
            rpool        ONLINE       0     0     0
              raidz2-0   ONLINE       0     0     0
                sda      ONLINE       0     0     0
                sdb      ONLINE       0     0     0
                sdc      ONLINE       0     0     0
                sdd      ONLINE       0     0     0
                sde      ONLINE       0     0     0
                sdf      ONLINE       0     0     0
                sdg      ONLINE       0     0     0
            logs
              nvme0n1p1  ONLINE       0     0     0
            cache
              nvme0n1p2  ONLINE       0     0     0
            spares
              sdh        AVAIL
    
    errors: No known data errors
    The HBA is an LSI Logic SAS 9300-8i SGL and for ZIL and l2arc we use an Intel Optane 900P 280GB PCIE card. HDDs are HGST HUS724020ALS640. Proxmox version is 5.3-9.
    When a guest (Ubuntu 18.04 IBM Domino btrfs) reads ~50 M/s from disk the I/O delay (node) raises up to ~30%. This high I/O has an effect on the other guests (they are still functional but the performance is bad). Is it a normal behavior? For me 50 M/s are not a high load.

    Maybe you have an solution for my problem.
     
  2. 6uellerbpanda

    6uellerbpanda Member
    Proxmox Subscriber

    Joined:
    Sep 15, 2015
    Messages:
    45
    Likes Received:
    5
    this may sound strange but on which best practices etc. did you configure your zpool ?
    and what is your typicall workload ?

    arc_summary output please also
     
  3. mailinglists

    mailinglists Active Member

    Joined:
    Mar 14, 2012
    Messages:
    353
    Likes Received:
    33
    Let me start by saying, that in my experience ZFS is twice as slow compared to other systems by design, however we still use it and with right tuning and usage, can be really nice and performant.

    So ~50 M/s from disk can be pretty high value, if you read in small chunks, where each read/write uses one IOP.
    In order for the VM not to bother other guests, you can limit its R/W IOPS.
    Either experiment with what value make your system stable, or do actual tests with fio or bonnie.

    Here are some tests I use to benchmark storage, then set the limits then test again with real life situation.
    2. fio sekvencni test.
    fio --filename=brisi --sync=1 --rw=write --bs=10M --numjobs=1 --iodepth=1 --size=3000MB --name=test
    3. fio manjsi bloki zapisovanja
    fio --filename=brisi --sync=1 --rw=write --bs=512k --numjobs=1 --iodepth=1 --size=3000MB --name=test
    4. vec streamov naenkrat, recimo 48
    fio --filename=brisi --sync=1 --rw=write --bs=512k --numjobs=48 --iodepth=1 --size=333MB --name=test
    5. rand write / read test 25% write 75% read
    fio --randrepeat=1 --ioengine=libaio --direct=1/0 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
    6. rand reads
    fio --randrepeat=1 --ioengine=libaio --direct=0/1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randread
    7. bonnie++ -d brisidir/ -b -s 2000MB -r 1000MB -c 1 -u root
    8. time bonnie++ -d brisidir/ -b -s 2000MB:4k -r 1000MB -c 1 -u root

    Also RAID 1 (stripped mirrors) option should provide more IOPS than RAIDZ2, if that is what you are lacking.

    Also in my experience l2arc (cache) is really never used in any relevant manner unless you tune for more aggressive usage. By default it's max write speed is at laughable 8 MB/s! I mean WDF.. You should tune l2arc_write_max, l2arc_write_boost, l2arc_noprefetch and l2arc_headroom if you actually plan to use it. I will soon disable it on most of my systems.

    Another thing to check is, that brtfs block size matches zvols (def, 8k).

    Please paste output from arc_summary as well as pveperv as stated, should give to whomever it helps you a bit more to work with.

    Good luck!
     
  4. tmwl

    tmwl New Member
    Proxmox Subscriber

    Joined:
    Mar 4, 2019
    Messages:
    5
    Likes Received:
    0
    Thank you for your help. To limit IOPS is an option but then how can you achieve good r/w speeds? All ssd? Is there an option to limit the effect of high I/O to only the guest which generates the high I/O? I only activate l2arc because there is plenty of space left on the ssd.

    The result of arc_summary is attached on this post.

    zpool settings:

    Code:
    ------------------------------------------------------------------------
    zpool get all
    NAME   PROPERTY                       VALUE                          SOURCE
    rpool  size                           12.6T                          -
    rpool  capacity                       52%                            -
    rpool  altroot                        -                              default
    rpool  health                         ONLINE                         -
    rpool  guid                           960260605231179924             -
    rpool  version                        -                              default
    rpool  bootfs                         -                              default
    rpool  delegation                     on                             default
    rpool  autoreplace                    off                            default
    rpool  cachefile                      -                              default
    rpool  failmode                       wait                           default
    rpool  listsnapshots                  off                            default
    rpool  autoexpand                     off                            default
    rpool  dedupditto                     0                              default
    rpool  dedupratio                     1.00x                          -
    rpool  free                           6.00T                          -
    rpool  allocated                      6.62T                          -
    rpool  readonly                       off                            -
    rpool  ashift                         12                             local
    rpool  comment                        -                              default
    rpool  expandsize                     -                              -
    rpool  freeing                        0                              -
    rpool  fragmentation                  18%                            -
    rpool  leaked                         0                              -
    rpool  multihost                      off                            default
    rpool  feature@async_destroy          enabled                        local
    rpool  feature@empty_bpobj            active                         local
    rpool  feature@lz4_compress           active                         local
    rpool  feature@multi_vdev_crash_dump  enabled                        local
    rpool  feature@spacemap_histogram     active                         local
    rpool  feature@enabled_txg            active                         local
    rpool  feature@hole_birth             active                         local
    rpool  feature@extensible_dataset     active                         local
    rpool  feature@embedded_data          active                         local
    rpool  feature@bookmarks              enabled                        local
    rpool  feature@filesystem_limits      enabled                        local
    rpool  feature@large_blocks           enabled                        local
    rpool  feature@large_dnode            enabled                        local
    rpool  feature@sha512                 enabled                        local
    rpool  feature@skein                  enabled                        local
    rpool  feature@edonr                  enabled                        local
    rpool  feature@userobj_accounting     active                         local
    
     

    Attached Files:

  5. Whatever

    Whatever Member

    Joined:
    Nov 19, 2012
    Messages:
    196
    Likes Received:
    5
    I would recommend to:
    0. Increase ARC size (set min=max) in your case the ARC size is limiting factor (at least 32Gb from my experiences)
    1. User 3 stripes of mirrors + 2 hot spares instead of RAIDZ2 (or even 4 stripes of mirrors)
    2. Set atime=off
    3. Use ashift=12 for HDDs with 4k block size and ashift=9 for 512K instead
    4. If you don't use ZVOL set recordsize to value less than default 128 (8k for example, in my environment 16k performed the best)
    5. set sync=always and VM cache mode = none
    6. Tune l2arc as mentioned before and set l2arc_noprefetch=0
    7. Split remaining space on NVMe to 2 or 3 slices and attach them as cache separately (3x64Gb for example)
    8. Make sure swap is NOT enabled on ZFS partition (if so - disable it in /etc/fstab) and set vm.swappiness=1 (or 0)
     
    #5 Whatever, Apr 24, 2019
    Last edited: Apr 24, 2019
  6. Whatever

    Whatever Member

    Joined:
    Nov 19, 2012
    Messages:
    196
    Likes Received:
    5
    If #0 is impossible try to set logbias=throughput and sync=always with tuning L2ARC.

    In some scenarios this configuration (with Optane NVMe) gave me better results than using small ARC and default logbias behavior)
     
  7. Whatever

    Whatever Member

    Joined:
    Nov 19, 2012
    Messages:
    196
    Likes Received:
    5
    My zfs custom options set:

    Code:
    root@storageB:~# cat /etc/modprobe.d/zfs.conf
    #
    # Don't let ZFS use less than 256GB and more than 320GB
    
    options zfs zfs_arc_min=274877906944
    options zfs zfs_arc_max=343597383680
    
    options zfs l2arc_noprefetch=0
    
    options zfs l2arc_write_max=67108864
    options zfs l2arc_write_boost=134217728
    options zfs l2arc_headroom=4
    options zfs l2arc_feed_min_ms=200
    
    options zfs zfs_txg_timeout=10
    
    options zfs zfs_vdev_async_read_min_active=4
    options zfs zfs_vdev_async_read_max_active=8
    
    options zfs zfs_vdev_async_write_min_active=8
    options zfs zfs_vdev_async_write_max_active=16
    
    options zfs zfs_vdev_sync_read_min_active=8
    options zfs zfs_vdev_sync_read_max_active=16
    
    options zfs zfs_vdev_sync_write_min_active=8
    options zfs zfs_vdev_sync_write_max_active=16
    
    Note: after changing anything in /etc/modprobe.d/zfs.conf use "update-initramfs -u" and reboot
     
  8. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    839
    Likes Received:
    114
    wrong. 128 k is the maximum block, but for any other less value, the block is variable.
    Anyway he use a vm and not a CT, so he use a zvol

    @tmvl, you have bad I/O because your guest use btrfs (cow fs) on top on other cow (zfs). Why you need btrfs in your guest?


    I can guess that you have ashift=4k (default on a fresh install) and the default 8k for any VM. I also guess that your btrfs will also use 4k ?

    No let do some basic math(with my guessis, but you can cach the ideea with other values). Btrfs will need to write a block on disk (4k) so for 7x hdd raidz2 zfs will use 5 hdd for write(stripe).

    4k / 5 hdd = 0.8 blocks, but the minimum is 1 block of 4k. In the end for each 4k btrfs block who is write => 1 block of 4k × 5 hdd = 20 k. Combine this with raidz2 = iops of a single hdd, and then you will see a high i/o on your storage ;)

    Good luck !
     
  9. Whatever

    Whatever Member

    Joined:
    Nov 19, 2012
    Messages:
    196
    Likes Received:
    5
    Wrong. Max block size (record in terms of ZFS) is equal to recordsize but for any other less value (small files for example), the block is variable. In case of PVE (large VM image raw file) = N * recordsize + rest

    In the simplest case it looks like:
    If guest reads 4k block than QEMU reads "recordsize"
    If guest modifies 4k block than QEMU reads "recordsize" and write "recordsize"

    For sure ZFS aggregates IO operations (+use compression and other ABD buffer optimizations) in real word transformation from guest read/modify to ZFS real operations are much more complex

    With this assumption keeping recordsize=default (128k) generates quite large overhead in most workloads
     
  10. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    839
    Likes Received:
    114

    I have say the same thing. Read again.

    Is true if the guest use a data-set.


    Your statetment is not useful for the author of this thred, because he use zvol and not a dataset.

    Most of the time it will depends by many many things. But as a non obiective opinion (aka reding various forums regarding zfs datasets for many years) I have see less performance problems (I/O overhead) in case of zfs data-sets.
     
  11. tmwl

    tmwl New Member
    Proxmox Subscriber

    Joined:
    Mar 4, 2019
    Messages:
    5
    Likes Received:
    0
    @Whatever I will try to set ARC size min=max but should zfs not set the size dynamically between min and max? Swap is on an extra ssd with the proxmox os. Strange is that proxmox uses swap although there is plenty of free RAM (vm.swappiness = 10).
    @guletz The cow on cow problem with btrfs was on my mind too. So is there a recommendation for the guest file system like xfs?
     
  12. guletz

    guletz Active Member

    Joined:
    Apr 19, 2017
    Messages:
    839
    Likes Received:
    114
    In your case xfs is far way BETTER, then btrfs. Also ext4 is also OK!
     
  13. 6uellerbpanda

    6uellerbpanda Member
    Proxmox Subscriber

    Joined:
    Sep 15, 2015
    Messages:
    45
    Likes Received:
    5
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice