ZFS causes high load, unable to figure out why

Lunar

Active Member
Sep 24, 2016
22
0
41
I'm experimenting with a mirrored ZFS pool under LUKS devices. I've been performing write tests with dd in a VM using VirtIO SCSI, specifically with these options;

Code:
dd if=/dev/urandom of=/root/test bs=4096 status=progress

It seems like the load shoots up very high but I am unsure what the actual cause is. It's probably disk I/O but I'm unsure what I should be doing or what options should be adjusted.

I have an E3-1245 V2 with 32GB of ECC RAM operating at 1333Mhz.

Example of load after 2 minute of writes:
Code:
root@zfs-test ~ # uptime
 05:23:51 up  1:34,  2 users,  load average: 27.57, 16.05, 9.56

My pool:
Code:
root@zfs-test ~ # zpool status
  pool: pool0
 state: ONLINE
  scan: none requested
config:

    NAME            STATE     READ WRITE CKSUM
    pool0           ONLINE       0     0     0
      mirror-0      ONLINE       0     0     0
        sda5_crypt  ONLINE       0     0     0
        sdb5_crypt  ONLINE       0     0     0

errors: No known data errors

Pool configuration:
Code:
root@zfs-test ~ # zfs get all pool0/vmstor
NAME          PROPERTY              VALUE                  SOURCE
pool0/vmstor  type                  filesystem             -
pool0/vmstor  creation              Fri Aug 17  3:51 2018  -
pool0/vmstor  used                  264G                   -
pool0/vmstor  available             2.30T                  -
pool0/vmstor  referenced            96K                    -
pool0/vmstor  compressratio         1.01x                  -
pool0/vmstor  mounted               yes                    -
pool0/vmstor  quota                 none                   default
pool0/vmstor  reservation           none                   default
pool0/vmstor  recordsize            128K                   default
pool0/vmstor  mountpoint            /pool0/vmstor          default
pool0/vmstor  sharenfs              off                    default
pool0/vmstor  checksum              on                     default
pool0/vmstor  compression           on                     inherited from pool0
pool0/vmstor  atime                 on                     default
pool0/vmstor  devices               on                     default
pool0/vmstor  exec                  on                     default
pool0/vmstor  setuid                on                     default
pool0/vmstor  readonly              off                    default
pool0/vmstor  zoned                 off                    default
pool0/vmstor  snapdir               hidden                 default
pool0/vmstor  aclinherit            restricted             default
pool0/vmstor  createtxg             36                     -
pool0/vmstor  canmount              on                     default
pool0/vmstor  xattr                 on                     default
pool0/vmstor  copies                1                      default
pool0/vmstor  version               5                      -
pool0/vmstor  utf8only              off                    -
pool0/vmstor  normalization         none                   -
pool0/vmstor  casesensitivity       sensitive              -
pool0/vmstor  vscan                 off                    default
pool0/vmstor  nbmand                off                    default
pool0/vmstor  sharesmb              off                    default
pool0/vmstor  refquota              none                   default
pool0/vmstor  refreservation        none                   default
pool0/vmstor  guid                  5999095079813220367    -
pool0/vmstor  primarycache          all                    default
pool0/vmstor  secondarycache        all                    default
pool0/vmstor  usedbysnapshots       0B                     -
pool0/vmstor  usedbydataset         96K                    -
pool0/vmstor  usedbychildren        264G                   -
pool0/vmstor  usedbyrefreservation  0B                     -
pool0/vmstor  logbias               latency                default
pool0/vmstor  dedup                 off                    local
pool0/vmstor  mlslabel              none                   default
pool0/vmstor  sync                  standard               default
pool0/vmstor  dnodesize             legacy                 default
pool0/vmstor  refcompressratio      1.00x                  -
pool0/vmstor  written               96K                    -
pool0/vmstor  logicalused           29.0G                  -
pool0/vmstor  logicalreferenced     40K                    -
pool0/vmstor  volmode               default                default
pool0/vmstor  filesystem_limit      none                   default
pool0/vmstor  snapshot_limit        none                   default
pool0/vmstor  filesystem_count      none                   default
pool0/vmstor  snapshot_count        none                   default
pool0/vmstor  snapdev               hidden                 default
pool0/vmstor  acltype               off                    default
pool0/vmstor  context               none                   default
pool0/vmstor  fscontext             none                   default
pool0/vmstor  defcontext            none                   default
pool0/vmstor  rootcontext           none                   default
pool0/vmstor  relatime              off                    default
pool0/vmstor  redundant_metadata    all                    default
pool0/vmstor  overlay               off                    default
 
Last edited:
Hi,

You could disable atime with atime=off and maybe compression=of and then check if it has a smaller load. Another ideeas could be:
- check if your kernel have aes support enable or not (if not I guess this is the root cause of the high load)
- check your lucks on disk block size, it must be the same for zfs pool (ashift 12 for 4k , or ashift 9 for 512 b)
 
Hi,

You could disable atime with atime=off and maybe compression=of and then check if it has a smaller load. Another ideeas could be:
- check if your kernel have aes support enable or not (if not I guess this is the root cause of the high load)
- check your lucks on disk block size, it must be the same for zfs pool (ashift 12 for 4k , or ashift 9 for 512 b)

Thank you for your suggestion. I just tried to turn compression and atime off on the pool. This did not increase performance unfortunately. My processor does support AES-NI and has it enabled.

I've read that I should be using ashift 12 even if my disks use 512 and not 4096? I am going to try using ashift 9 to see if there is any difference. But this might cause issues if I replace a disk and it's 4096 instead from what I've read.

Code:
root@zfs-test ~ # blockdev --getbsz /dev/sda5
512
root@zfs-test ~ # blockdev --getbsz /dev/sdb5
512
 
Thank you for your suggestion. I just tried to turn compression and atime off on the pool. This did not increase performance unfortunately. My processor does support AES-NI and has it enabled.

I've read that I should be using ashift 12 even if my disks use 512 and not 4096? I am going to try using ashift 9 to see if there is any difference. But this might cause issues if I replace a disk and it's 4096 instead from what I've read.

Code:
root@zfs-test ~ # blockdev --getbsz /dev/sda5
512
root@zfs-test ~ # blockdev --getbsz /dev/sdb5
512

It doesn't seem like ashift 9 helps at all. I'm at a loss for ideas.
 
So it seems that only dd with /dev/urandom kills the server. /dev/zero doesn't cause load to shoot up that much. Question is, why do random writes hurt my zpool so much?
 
maybe you do not have enough "randomness", in other words your system has low entropy? Check it with:

cat /proc/sys/kernel/random/entropy_avail

Try to observe this value while writing to disk with if=/dev/urandom

It is difficult to find optimum values (except for "the more, the better), but above 1000 you should be safe, and below 500 you can already expect some problems. If your entropy is low, there are still some ways to increase it (i.e. haveged, etc).
 
Hi,

Yes, is better/safe to use ashift = 12 for future hdd replace . I sugesst to try to copy let say a iso file from a usb hdd to your pool. If during this your load is ok, then is very probably that the problem is what @Rhinox say.
 
maybe you do not have enough "randomness", in other words your system has low entropy? Check it with:

cat /proc/sys/kernel/random/entropy_avail

Try to observe this value while writing to disk with if=/dev/urandom

It is difficult to find optimum values (except for "the more, the better), but above 1000 you should be safe, and below 500 you can already expect some problems. If your entropy is low, there are still some ways to increase it (i.e. haveged, etc).

I have haveged running on the host and VM so the entropy is pretty high. Currently 3599 on the host and 1841 on the VM. I'm doing the dd tests from the VM. Unfortunately the load still shoots up regardless of available entropy.
 
Tried setting sync=disabled on the pool but not really any performance increase. On a server with ext4 under LUKS and mdadm RAID 1 I get about 104 Mb/s writes from urandom which is disappointing since I expected better performance with ZFS . I'm really not sure why ZFS causes such poor performance with my setup. Any additional details I can provide? Maybe @fabian can provide some input.
 
I've tried these options on my pool, still no luck in decreasing the load sadly..

Recent additions:
Code:
zfs set sync=disabled pool0
zfs set checksum=off pool0
zfs set atime=off pool0
zfs set redundant_metadata=most pool0
zfs set xattr=sa pool0

ZVOL options at the moment:
Code:
root@zfs-test ~ # zfs get all pool0/vmstor/vm-100-disk-1
NAME                        PROPERTY              VALUE                  SOURCE
pool0/vmstor/vm-100-disk-1  type                  volume                 -
pool0/vmstor/vm-100-disk-1  creation              Sat Aug 18  6:16 2018  -
pool0/vmstor/vm-100-disk-1  used                  215G                   -
pool0/vmstor/vm-100-disk-1  available             2.35T                  -
pool0/vmstor/vm-100-disk-1  referenced            215G                   -
pool0/vmstor/vm-100-disk-1  compressratio         1.00x                  -
pool0/vmstor/vm-100-disk-1  reservation           none                   default
pool0/vmstor/vm-100-disk-1  volsize               256G                   local
pool0/vmstor/vm-100-disk-1  volblocksize          4K                     -
pool0/vmstor/vm-100-disk-1  checksum              off                    inherited from pool0
pool0/vmstor/vm-100-disk-1  compression           lz4                    inherited from pool0
pool0/vmstor/vm-100-disk-1  readonly              off                    default
pool0/vmstor/vm-100-disk-1  createtxg             10989                  -
pool0/vmstor/vm-100-disk-1  copies                1                      default
pool0/vmstor/vm-100-disk-1  refreservation        none                   default
pool0/vmstor/vm-100-disk-1  guid                  7292985780615750418    -
pool0/vmstor/vm-100-disk-1  primarycache          all                    default
pool0/vmstor/vm-100-disk-1  secondarycache        all                    default
pool0/vmstor/vm-100-disk-1  usedbysnapshots       0B                     -
pool0/vmstor/vm-100-disk-1  usedbydataset         215G                   -
pool0/vmstor/vm-100-disk-1  usedbychildren        0B                     -
pool0/vmstor/vm-100-disk-1  usedbyrefreservation  0B                     -
pool0/vmstor/vm-100-disk-1  logbias               latency                default
pool0/vmstor/vm-100-disk-1  dedup                 off                    default
pool0/vmstor/vm-100-disk-1  mlslabel              none                   default
pool0/vmstor/vm-100-disk-1  sync                  disabled               inherited from pool0
pool0/vmstor/vm-100-disk-1  refcompressratio      1.00x                  -
pool0/vmstor/vm-100-disk-1  written               215G                   -
pool0/vmstor/vm-100-disk-1  logicalused           213G                   -
pool0/vmstor/vm-100-disk-1  logicalreferenced     213G                   -
pool0/vmstor/vm-100-disk-1  volmode               default                default
pool0/vmstor/vm-100-disk-1  snapshot_limit        none                   default
pool0/vmstor/vm-100-disk-1  snapshot_count        none                   default
pool0/vmstor/vm-100-disk-1  snapdev               hidden                 default
pool0/vmstor/vm-100-disk-1  context               none                   default
pool0/vmstor/vm-100-disk-1  fscontext             none                   default
pool0/vmstor/vm-100-disk-1  defcontext            none                   default
pool0/vmstor/vm-100-disk-1  rootcontext           none                   default
pool0/vmstor/vm-100-disk-1  redundant_metadata    most                   inherited from pool0
 
Code:
root@zfs-test ~ # blockdev --getbsz /dev/sda5
512
root@zfs-test ~ # blockdev --getbsz /dev/sdb5
512

Because yours HDD have 512 block size, you must have the same on luks devices. For future HDD replacements, your luks device could have 4 K, and zfs with ashif=12.
 
Because yours HDD have 512 block size, you must have the same on luks devices. For future HDD replacements, your luks device could have 4 K, and zfs with ashif=12.

It looks like the block size is the same for my LUKS devices.

Code:
root@zfs-test ~ # blockdev --getbsz /dev/mapper/sda5_crypt
512
root@zfs-test ~ # blockdev --getbsz /dev/mapper/sdb5_crypt
512
 
So it seems that only dd with /dev/urandom kills the server. /dev/zero doesn't cause load to shoot up that much. Question is, why do random writes hurt my zpool so much?
Hi,
to writes zeros to zfs do not realy write-io... so this isn't an write-test! Random data from /dev/urandom do realy writes on zfs - so this an write test.

Udo
 
Hi,
to writes zeros to zfs do not realy write-io... so this isn't an write-test! Random data from /dev/urandom do realy writes on zfs - so this an write test.

Udo

I'm sorry but I'm confused. Writing zeros is writing, it's just not a real world scenario. Going back to the issue, if I write from urandom on an ext4 filesystem I have no issue and I get a consistent 104 Mb/s. With ZFS and one the same hardware writes crawl to a stop going below 20 Mb/s. Both filesystems also sit on LUKS devices with the same cipher/hash.
 
If you write zeros with compression on, the actual writes to disk are drastically reduced.
Have you checked whether it is the reading from /dev/urandom that is slow? It is very slow on my host. Now that you have the /root/test file, try copying it to see if the load is different when not reading from /dev/urandom. If you have two disks, put the copy target on the other so you are not reading and writing to the same disk. In the dev/urandom case, only writes happen with the disk.
 
  • Like
Reactions: Lunar
I found this issue posted for ZoL

Is anyone else having issues with high load when doing writes?

I had issues on a ZFS stripe set. Write performance was poor, and I had massive UI latency during writes. After switching to EXT4 things went back to normal.
 
  • Like
Reactions: Lunar
I noticed the same issue with setup up to 2 striped devs (like: 1xRAIDZ, 1 mirrored pair, 1 striped mirrored pair with 4 disk)
I've tried almost every ZFS option.

Next tunes had the most noticeable effect for me:
1) Increase ARC size (at least 32GB with zfs zfs_arc_max==zfs zfs_arc_min)
2) Increase number of udevs at least to 3 (3 pairs in my case) (you have only 1 udev)
2) Add L2ARC
3) Set ashift value inline with physical disk block size (on pool creation)
4) Set options zfs zfs_txg_timeout=15
 
I noticed the same issue with setup up to 2 striped devs (like: 1xRAIDZ, 1 mirrored pair, 1 striped mirrored pair with 4 disk)
I've tried almost every ZFS option.

Next tunes had the most noticeable effect for me:
1) Increase ARC size (at least 32GB with zfs zfs_arc_max==zfs zfs_arc_min)
2) Increase number of udevs at least to 3 (3 pairs in my case) (you have only 1 udev)
2) Add L2ARC
3) Set ashift value inline with physical disk block size (on pool creation)
4) Set options zfs zfs_txg_timeout=15

1) I only have 32GB of RAM sadly so I cannot increase the ARC to that size.
2) I cannot add more disks to this system, I currently have 2x HDDs in RAID 1.
3) I have an L2ARC, but this does not help much with writes, which are my problem.
4) I will attempt this one.
 
I've tried both ZVOLs and raw disks on a ZFS dataset and both have the same issues. Any intensive writes to the VM's disk makes the load shoot up, which reaches 32 if I let it keep running. I've never seen such high load on ext4 systems just from writes before. I've tried almost all 'zfs set' tuning options that I have found and none have helped the situation. I really want to deploy ZFS into production but I might have to give up as the writes are terrible on my system.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!