High I/O slow guests

tmwl · Apr 23, 2019

Hi,

we have a few problems with proxmox and high I/O. We use ZFS in the following configuration:

Code:

:~# zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 33h43m with 0 errors on Mon Apr 15 10:07:13 2019
config:

        NAME         STATE     READ WRITE CKSUM
        rpool        ONLINE       0     0     0
          raidz2-0   ONLINE       0     0     0
            sda      ONLINE       0     0     0
            sdb      ONLINE       0     0     0
            sdc      ONLINE       0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     0
            sdg      ONLINE       0     0     0
        logs
          nvme0n1p1  ONLINE       0     0     0
        cache
          nvme0n1p2  ONLINE       0     0     0
        spares
          sdh        AVAIL

errors: No known data errors

The HBA is an LSI Logic SAS 9300-8i SGL and for ZIL and l2arc we use an Intel Optane 900P 280GB PCIE card. HDDs are HGST HUS724020ALS640. Proxmox version is 5.3-9.
When a guest (Ubuntu 18.04 IBM Domino btrfs) reads ~50 M/s from disk the I/O delay (node) raises up to ~30%. This high I/O has an effect on the other guests (they are still functional but the performance is bad). Is it a normal behavior? For me 50 M/s are not a high load.

Maybe you have an solution for my problem.

6uellerbpanda · Apr 23, 2019

tmwl said:
Hi,

we have a few problems with proxmox and high I/O. We use ZFS in the following configuration:

Code:

:~# zpool status -v pool: rpool state: ONLINE scan: scrub repaired 0B in 33h43m with 0 errors on Mon Apr 15 10:07:13 2019 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 logs nvme0n1p1 ONLINE 0 0 0 cache nvme0n1p2 ONLINE 0 0 0 spares sdh AVAIL errors: No known data errors

The HBA is an LSI Logic SAS 9300-8i SGL and for ZIL and l2arc we use an Intel Optane 900P 280GB PCIE card. HDDs are HGST HUS724020ALS640. Proxmox version is 5.3-9.
When a guest (Ubuntu 18.04 IBM Domino btrfs) reads ~50 M/s from disk the I/O delay (node) raises up to ~30%. This high I/O has an effect on the other guests (they are still functional but the performance is bad). Is it a normal behavior? For me 50 M/s are not a high load.

Maybe you have an solution for my problem.

this may sound strange but on which best practices etc. did you configure your zpool ?
and what is your typicall workload ?

arc_summary output please also

mailinglists · Apr 24, 2019

Let me start by saying, that in my experience ZFS is twice as slow compared to other systems by design, however we still use it and with right tuning and usage, can be really nice and performant.

So ~50 M/s from disk can be pretty high value, if you read in small chunks, where each read/write uses one IOP.
In order for the VM not to bother other guests, you can limit its R/W IOPS.
Either experiment with what value make your system stable, or do actual tests with fio or bonnie.

Here are some tests I use to benchmark storage, then set the limits then test again with real life situation.
2. fio sekvencni test.
fio --filename=brisi --sync=1 --rw=write --bs=10M --numjobs=1 --iodepth=1 --size=3000MB --name=test
3. fio manjsi bloki zapisovanja
fio --filename=brisi --sync=1 --rw=write --bs=512k --numjobs=1 --iodepth=1 --size=3000MB --name=test
4. vec streamov naenkrat, recimo 48
fio --filename=brisi --sync=1 --rw=write --bs=512k --numjobs=48 --iodepth=1 --size=333MB --name=test
5. rand write / read test 25% write 75% read
fio --randrepeat=1 --ioengine=libaio --direct=1/0 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
6. rand reads
fio --randrepeat=1 --ioengine=libaio --direct=0/1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randread
7. bonnie++ -d brisidir/ -b -s 2000MB -r 1000MB -c 1 -u root
8. time bonnie++ -d brisidir/ -b -s 2000MB:4k -r 1000MB -c 1 -u root

Also RAID 1 (stripped mirrors) option should provide more IOPS than RAIDZ2, if that is what you are lacking.

Also in my experience l2arc (cache) is really never used in any relevant manner unless you tune for more aggressive usage. By default it's max write speed is at laughable 8 MB/s! I mean WDF.. You should tune l2arc_write_max, l2arc_write_boost, l2arc_noprefetch and l2arc_headroom if you actually plan to use it. I will soon disable it on most of my systems.

Another thing to check is, that brtfs block size matches zvols (def, 8k).

Please paste output from arc_summary as well as pveperv as stated, should give to whomever it helps you a bit more to work with.

Good luck!

tmwl · Apr 24, 2019

Thank you for your help. To limit IOPS is an option but then how can you achieve good r/w speeds? All ssd? Is there an option to limit the effect of high I/O to only the guest which generates the high I/O? I only activate l2arc because there is plenty of space left on the ssd.

The result of arc_summary is attached on this post.

zpool settings:

Code:

------------------------------------------------------------------------
zpool get all
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           12.6T                          -
rpool  capacity                       52%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           960260605231179924             -
rpool  version                        -                              default
rpool  bootfs                         -                              default
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupditto                     0                              default
rpool  dedupratio                     1.00x                          -
rpool  free                           6.00T                          -
rpool  allocated                      6.62T                          -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  18%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local

Whatever · Apr 24, 2019

I would recommend to:
0. Increase ARC size (set min=max) in your case the ARC size is limiting factor (at least 32Gb from my experiences)
1. User 3 stripes of mirrors + 2 hot spares instead of RAIDZ2 (or even 4 stripes of mirrors)
2. Set atime=off
3. Use ashift=12 for HDDs with 4k block size and ashift=9 for 512K instead
4. If you don't use ZVOL set recordsize to value less than default 128 (8k for example, in my environment 16k performed the best)
5. set sync=always and VM cache mode = none
6. Tune l2arc as mentioned before and set l2arc_noprefetch=0
7. Split remaining space on NVMe to 2 or 3 slices and attach them as cache separately (3x64Gb for example)
8. Make sure swap is NOT enabled on ZFS partition (if so - disable it in /etc/fstab) and set vm.swappiness=1 (or 0)

Whatever · Apr 24, 2019

If #0 is impossible try to set logbias=throughput and sync=always with tuning L2ARC.

In some scenarios this configuration (with Optane NVMe) gave me better results than using small ARC and default logbias behavior)

Whatever · Apr 24, 2019

My zfs custom options set:

Code:

root@storageB:~# cat /etc/modprobe.d/zfs.conf
#
# Don't let ZFS use less than 256GB and more than 320GB

options zfs zfs_arc_min=274877906944
options zfs zfs_arc_max=343597383680

options zfs l2arc_noprefetch=0

options zfs l2arc_write_max=67108864
options zfs l2arc_write_boost=134217728
options zfs l2arc_headroom=4
options zfs l2arc_feed_min_ms=200

options zfs zfs_txg_timeout=10

options zfs zfs_vdev_async_read_min_active=4
options zfs zfs_vdev_async_read_max_active=8

options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=16

options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=16

options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=16

Note: after changing anything in /etc/modprobe.d/zfs.conf use "update-initramfs -u" and reboot

guletz · Apr 24, 2019

Whatever said:
If you don't use ZVOL set recordsize to value less than default 128 (8k for example, in my environment 16k performed the best)

wrong. 128 k is the maximum block, but for any other less value, the block is variable.
Anyway he use a vm and not a CT, so he use a zvol

@tmvl, you have bad I/O because your guest use btrfs (cow fs) on top on other cow (zfs). Why you need btrfs in your guest?

I can guess that you have ashift=4k (default on a fresh install) and the default 8k for any VM. I also guess that your btrfs will also use 4k ?

No let do some basic math(with my guessis, but you can cach the ideea with other values). Btrfs will need to write a block on disk (4k) so for 7x hdd raidz2 zfs will use 5 hdd for write(stripe).

4k / 5 hdd = 0.8 blocks, but the minimum is 1 block of 4k. In the end for each 4k btrfs block who is write => 1 block of 4k × 5 hdd = 20 k. Combine this with raidz2 = iops of a single hdd, and then you will see a high i/o on your storage

Good luck !

Whatever · Apr 24, 2019

guletz said:
wrong. 128 k is the maximum block, but for any other less value, the block is variable.
Anyway he use a vm and not a CT, so he use a zvol

Wrong. Max block size (record in terms of ZFS) is equal to recordsize but for any other less value (small files for example), the block is variable. In case of PVE (large VM image raw file) = N * recordsize + rest

In the simplest case it looks like:
If guest reads 4k block than QEMU reads "recordsize"
If guest modifies 4k block than QEMU reads "recordsize" and write "recordsize"

For sure ZFS aggregates IO operations (+use compression and other ABD buffer optimizations) in real word transformation from guest read/modify to ZFS real operations are much more complex

With this assumption keeping recordsize=default (128k) generates quite large overhead in most workloads

guletz · Apr 24, 2019

Whatever said:
Max block size (record in terms of ZFS) is equal to recordsize but for any other less value (

I have say the same thing. Read again.

Whatever said:
If guest reads 4k block than QEMU reads "recordsize"

Is true if the guest use a data-set.

Whatever said:
With this assumption keeping recordsize=default (128k) generates quite large overhead in most workloads

Your statetment is not useful for the author of this thred, because he use zvol and not a dataset.

Most of the time it will depends by many many things. But as a non obiective opinion (aka reding various forums regarding zfs datasets for many years) I have see less performance problems (I/O overhead) in case of zfs data-sets.

tmwl · Apr 25, 2019

@Whatever I will try to set ARC size min=max but should zfs not set the size dynamically between min and max? Swap is on an extra ssd with the proxmox os. Strange is that proxmox uses swap although there is plenty of free RAM (vm.swappiness = 10).
@guletz The cow on cow problem with btrfs was on my mind too. So is there a recommendation for the guest file system like xfs?

guletz · Apr 25, 2019

tmwl said:
The cow on cow problem with btrfs was on my mind too. So is there a recommendation for the guest file system like xfs?

In your case xfs is far way BETTER, then btrfs. Also ext4 is also OK!

6uellerbpanda · Apr 25, 2019

why do you have a l2arc ??? check https://forum.proxmox.com/threads/zfs-worth-it-tuning-tips.45262/page-2#post-217209

for random read/write io your raid-z2 is the worst choice. check dr. google why.
use a mirrored zpool except you only have seq io. every other "tuning" you now do is "useless" 'cause you're already limited one layer below.

pro lamer · Jun 27, 2019

6uellerbpanda said:
for random read/write io your raid-z2 is the worst choice

Doesn't slog "fix" this problem somehow?

Sent from my phone

guletz · Jun 27, 2019

pro lamer said:
Doesn't slog "fix" this problem somehow?

No. Slog is used only for write operation, but in the end the data must bw write on the pool(after 5 sec by default). For read, if the data are not already present in arc/l2arc then must be read from the pool.

For any op on raidzX you will have mostly the same iops like only one hdd. And for random op mostly it will be bound with iops.

So @6uellerbpanda said the true.

Search

Search

High I/O slow guests

tmwl

New Member

6uellerbpanda

Renowned Member

mailinglists

Renowned Member

tmwl

New Member

Attachments

Whatever

Renowned Member

Whatever

Renowned Member

Whatever

Renowned Member

guletz

Distinguished Member

Whatever

Renowned Member

guletz

Distinguished Member

tmwl

New Member

guletz

Distinguished Member

6uellerbpanda

Renowned Member

pro lamer

New Member

guletz

Distinguished Member

We value your privacy