ZFS causes high load, unable to figure out why

Lunar · Aug 17, 2018

I'm experimenting with a mirrored ZFS pool under LUKS devices. I've been performing write tests with dd in a VM using VirtIO SCSI, specifically with these options;

Code:

dd if=/dev/urandom of=/root/test bs=4096 status=progress

It seems like the load shoots up very high but I am unsure what the actual cause is. It's probably disk I/O but I'm unsure what I should be doing or what options should be adjusted.

I have an E3-1245 V2 with 32GB of ECC RAM operating at 1333Mhz.

Example of load after 2 minute of writes:

Code:

root@zfs-test ~ # uptime
 05:23:51 up  1:34,  2 users,  load average: 27.57, 16.05, 9.56

My pool:

Code:

root@zfs-test ~ # zpool status
  pool: pool0
 state: ONLINE
  scan: none requested
config:

    NAME            STATE     READ WRITE CKSUM
    pool0           ONLINE       0     0     0
      mirror-0      ONLINE       0     0     0
        sda5_crypt  ONLINE       0     0     0
        sdb5_crypt  ONLINE       0     0     0

errors: No known data errors

Pool configuration:

Code:

root@zfs-test ~ # zfs get all pool0/vmstor
NAME          PROPERTY              VALUE                  SOURCE
pool0/vmstor  type                  filesystem             -
pool0/vmstor  creation              Fri Aug 17  3:51 2018  -
pool0/vmstor  used                  264G                   -
pool0/vmstor  available             2.30T                  -
pool0/vmstor  referenced            96K                    -
pool0/vmstor  compressratio         1.01x                  -
pool0/vmstor  mounted               yes                    -
pool0/vmstor  quota                 none                   default
pool0/vmstor  reservation           none                   default
pool0/vmstor  recordsize            128K                   default
pool0/vmstor  mountpoint            /pool0/vmstor          default
pool0/vmstor  sharenfs              off                    default
pool0/vmstor  checksum              on                     default
pool0/vmstor  compression           on                     inherited from pool0
pool0/vmstor  atime                 on                     default
pool0/vmstor  devices               on                     default
pool0/vmstor  exec                  on                     default
pool0/vmstor  setuid                on                     default
pool0/vmstor  readonly              off                    default
pool0/vmstor  zoned                 off                    default
pool0/vmstor  snapdir               hidden                 default
pool0/vmstor  aclinherit            restricted             default
pool0/vmstor  createtxg             36                     -
pool0/vmstor  canmount              on                     default
pool0/vmstor  xattr                 on                     default
pool0/vmstor  copies                1                      default
pool0/vmstor  version               5                      -
pool0/vmstor  utf8only              off                    -
pool0/vmstor  normalization         none                   -
pool0/vmstor  casesensitivity       sensitive              -
pool0/vmstor  vscan                 off                    default
pool0/vmstor  nbmand                off                    default
pool0/vmstor  sharesmb              off                    default
pool0/vmstor  refquota              none                   default
pool0/vmstor  refreservation        none                   default
pool0/vmstor  guid                  5999095079813220367    -
pool0/vmstor  primarycache          all                    default
pool0/vmstor  secondarycache        all                    default
pool0/vmstor  usedbysnapshots       0B                     -
pool0/vmstor  usedbydataset         96K                    -
pool0/vmstor  usedbychildren        264G                   -
pool0/vmstor  usedbyrefreservation  0B                     -
pool0/vmstor  logbias               latency                default
pool0/vmstor  dedup                 off                    local
pool0/vmstor  mlslabel              none                   default
pool0/vmstor  sync                  standard               default
pool0/vmstor  dnodesize             legacy                 default
pool0/vmstor  refcompressratio      1.00x                  -
pool0/vmstor  written               96K                    -
pool0/vmstor  logicalused           29.0G                  -
pool0/vmstor  logicalreferenced     40K                    -
pool0/vmstor  volmode               default                default
pool0/vmstor  filesystem_limit      none                   default
pool0/vmstor  snapshot_limit        none                   default
pool0/vmstor  filesystem_count      none                   default
pool0/vmstor  snapshot_count        none                   default
pool0/vmstor  snapdev               hidden                 default
pool0/vmstor  acltype               off                    default
pool0/vmstor  context               none                   default
pool0/vmstor  fscontext             none                   default
pool0/vmstor  defcontext            none                   default
pool0/vmstor  rootcontext           none                   default
pool0/vmstor  relatime              off                    default
pool0/vmstor  redundant_metadata    all                    default
pool0/vmstor  overlay               off                    default

guletz · Aug 17, 2018

Hi,

You could disable atime with atime=off and maybe compression=of and then check if it has a smaller load. Another ideeas could be:
- check if your kernel have aes support enable or not (if not I guess this is the root cause of the high load)
- check your lucks on disk block size, it must be the same for zfs pool (ashift 12 for 4k , or ashift 9 for 512 b)

Lunar · Aug 17, 2018

guletz said:
Hi,

You could disable atime with atime=off and maybe compression=of and then check if it has a smaller load. Another ideeas could be:
- check if your kernel have aes support enable or not (if not I guess this is the root cause of the high load)
- check your lucks on disk block size, it must be the same for zfs pool (ashift 12 for 4k , or ashift 9 for 512 b)

Thank you for your suggestion. I just tried to turn compression and atime off on the pool. This did not increase performance unfortunately. My processor does support AES-NI and has it enabled.

I've read that I should be using ashift 12 even if my disks use 512 and not 4096? I am going to try using ashift 9 to see if there is any difference. But this might cause issues if I replace a disk and it's 4096 instead from what I've read.

Code:

root@zfs-test ~ # blockdev --getbsz /dev/sda5
512
root@zfs-test ~ # blockdev --getbsz /dev/sdb5
512

Lunar · Aug 17, 2018

Lunar said:
Thank you for your suggestion. I just tried to turn compression and atime off on the pool. This did not increase performance unfortunately. My processor does support AES-NI and has it enabled.

I've read that I should be using ashift 12 even if my disks use 512 and not 4096? I am going to try using ashift 9 to see if there is any difference. But this might cause issues if I replace a disk and it's 4096 instead from what I've read.

Code:

root@zfs-test ~ # blockdev --getbsz /dev/sda5 512 root@zfs-test ~ # blockdev --getbsz /dev/sdb5 512

It doesn't seem like ashift 9 helps at all. I'm at a loss for ideas.

Lunar · Aug 17, 2018

So it seems that only dd with /dev/urandom kills the server. /dev/zero doesn't cause load to shoot up that much. Question is, why do random writes hurt my zpool so much?

Rhinox · Aug 18, 2018

maybe you do not have enough "randomness", in other words your system has low entropy? Check it with:

cat /proc/sys/kernel/random/entropy_avail

Try to observe this value while writing to disk with if=/dev/urandom

It is difficult to find optimum values (except for "the more, the better), but above 1000 you should be safe, and below 500 you can already expect some problems. If your entropy is low, there are still some ways to increase it (i.e. haveged, etc).

guletz · Aug 18, 2018

Hi,

Yes, is better/safe to use ashift = 12 for future hdd replace . I sugesst to try to copy let say a iso file from a usb hdd to your pool. If during this your load is ok, then is very probably that the problem is what @Rhinox say.

Lunar · Aug 18, 2018

Rhinox said:
maybe you do not have enough "randomness", in other words your system has low entropy? Check it with:

cat /proc/sys/kernel/random/entropy_avail

Try to observe this value while writing to disk with if=/dev/urandom

It is difficult to find optimum values (except for "the more, the better), but above 1000 you should be safe, and below 500 you can already expect some problems. If your entropy is low, there are still some ways to increase it (i.e. haveged, etc).

I have haveged running on the host and VM so the entropy is pretty high. Currently 3599 on the host and 1841 on the VM. I'm doing the dd tests from the VM. Unfortunately the load still shoots up regardless of available entropy.

Lunar · Aug 18, 2018

I found this issue posted for ZoL https://github.com/zfsonlinux/zfs/issues/7787#issuecomment-412508089

Is anyone else having issues with high load when doing writes?

Lunar · Aug 18, 2018

Tried setting sync=disabled on the pool but not really any performance increase. On a server with ext4 under LUKS and mdadm RAID 1 I get about 104 Mb/s writes from urandom which is disappointing since I expected better performance with ZFS . I'm really not sure why ZFS causes such poor performance with my setup. Any additional details I can provide? Maybe @fabian can provide some input.

Lunar · Aug 19, 2018

I've tried these options on my pool, still no luck in decreasing the load sadly..

Recent additions:

Code:

zfs set sync=disabled pool0
zfs set checksum=off pool0
zfs set atime=off pool0
zfs set redundant_metadata=most pool0
zfs set xattr=sa pool0

ZVOL options at the moment:

Code:

root@zfs-test ~ # zfs get all pool0/vmstor/vm-100-disk-1
NAME                        PROPERTY              VALUE                  SOURCE
pool0/vmstor/vm-100-disk-1  type                  volume                 -
pool0/vmstor/vm-100-disk-1  creation              Sat Aug 18  6:16 2018  -
pool0/vmstor/vm-100-disk-1  used                  215G                   -
pool0/vmstor/vm-100-disk-1  available             2.35T                  -
pool0/vmstor/vm-100-disk-1  referenced            215G                   -
pool0/vmstor/vm-100-disk-1  compressratio         1.00x                  -
pool0/vmstor/vm-100-disk-1  reservation           none                   default
pool0/vmstor/vm-100-disk-1  volsize               256G                   local
pool0/vmstor/vm-100-disk-1  volblocksize          4K                     -
pool0/vmstor/vm-100-disk-1  checksum              off                    inherited from pool0
pool0/vmstor/vm-100-disk-1  compression           lz4                    inherited from pool0
pool0/vmstor/vm-100-disk-1  readonly              off                    default
pool0/vmstor/vm-100-disk-1  createtxg             10989                  -
pool0/vmstor/vm-100-disk-1  copies                1                      default
pool0/vmstor/vm-100-disk-1  refreservation        none                   default
pool0/vmstor/vm-100-disk-1  guid                  7292985780615750418    -
pool0/vmstor/vm-100-disk-1  primarycache          all                    default
pool0/vmstor/vm-100-disk-1  secondarycache        all                    default
pool0/vmstor/vm-100-disk-1  usedbysnapshots       0B                     -
pool0/vmstor/vm-100-disk-1  usedbydataset         215G                   -
pool0/vmstor/vm-100-disk-1  usedbychildren        0B                     -
pool0/vmstor/vm-100-disk-1  usedbyrefreservation  0B                     -
pool0/vmstor/vm-100-disk-1  logbias               latency                default
pool0/vmstor/vm-100-disk-1  dedup                 off                    default
pool0/vmstor/vm-100-disk-1  mlslabel              none                   default
pool0/vmstor/vm-100-disk-1  sync                  disabled               inherited from pool0
pool0/vmstor/vm-100-disk-1  refcompressratio      1.00x                  -
pool0/vmstor/vm-100-disk-1  written               215G                   -
pool0/vmstor/vm-100-disk-1  logicalused           213G                   -
pool0/vmstor/vm-100-disk-1  logicalreferenced     213G                   -
pool0/vmstor/vm-100-disk-1  volmode               default                default
pool0/vmstor/vm-100-disk-1  snapshot_limit        none                   default
pool0/vmstor/vm-100-disk-1  snapshot_count        none                   default
pool0/vmstor/vm-100-disk-1  snapdev               hidden                 default
pool0/vmstor/vm-100-disk-1  context               none                   default
pool0/vmstor/vm-100-disk-1  fscontext             none                   default
pool0/vmstor/vm-100-disk-1  defcontext            none                   default
pool0/vmstor/vm-100-disk-1  rootcontext           none                   default
pool0/vmstor/vm-100-disk-1  redundant_metadata    most                   inherited from pool0

guletz · Aug 20, 2018

Lunar said:

Because yours HDD have 512 block size, you must have the same on luks devices. For future HDD replacements, your luks device could have 4 K, and zfs with ashif=12.

Lunar · Aug 20, 2018

guletz said:
Because yours HDD have 512 block size, you must have the same on luks devices. For future HDD replacements, your luks device could have 4 K, and zfs with ashif=12.

It looks like the block size is the same for my LUKS devices.

Code:

root@zfs-test ~ # blockdev --getbsz /dev/mapper/sda5_crypt
512
root@zfs-test ~ # blockdev --getbsz /dev/mapper/sdb5_crypt
512

udo · Aug 20, 2018

So it seems that only dd with /dev/urandom kills the server. /dev/zero doesn't cause load to shoot up that much. Question is, why do random writes hurt my zpool so much?

Hi,
to writes zeros to zfs do not realy write-io... so this isn't an write-test! Random data from /dev/urandom do realy writes on zfs - so this an write test.

Udo

Lunar · Aug 21, 2018

udo said:
Hi,
to writes zeros to zfs do not realy write-io... so this isn't an write-test! Random data from /dev/urandom do realy writes on zfs - so this an write test.

Udo

I'm sorry but I'm confused. Writing zeros is writing, it's just not a real world scenario. Going back to the issue, if I write from urandom on an ext4 filesystem I have no issue and I get a consistent 104 Mb/s. With ZFS and one the same hardware writes crawl to a stop going below 20 Mb/s. Both filesystems also sit on LUKS devices with the same cipher/hash.

Tapio Lehtonen · Aug 21, 2018

If you write zeros with compression on, the actual writes to disk are drastically reduced.
Have you checked whether it is the reading from /dev/urandom that is slow? It is very slow on my host. Now that you have the /root/test file, try copying it to see if the load is different when not reading from /dev/urandom. If you have two disks, put the copy target on the other so you are not reading and writing to the same disk. In the dev/urandom case, only writes happen with the disk.

ferociousmilkyway · Aug 21, 2018

Lunar said:
I found this issue posted for ZoL

Is anyone else having issues with high load when doing writes?

I had issues on a ZFS stripe set. Write performance was poor, and I had massive UI latency during writes. After switching to EXT4 things went back to normal.

Whatever · Aug 22, 2018

I noticed the same issue with setup up to 2 striped devs (like: 1xRAIDZ, 1 mirrored pair, 1 striped mirrored pair with 4 disk)
I've tried almost every ZFS option.

Next tunes had the most noticeable effect for me:
1) Increase ARC size (at least 32GB with zfs zfs_arc_max==zfs zfs_arc_min)
2) Increase number of udevs at least to 3 (3 pairs in my case) (you have only 1 udev)
2) Add L2ARC
3) Set ashift value inline with physical disk block size (on pool creation)
4) Set options zfs zfs_txg_timeout=15

Lunar · Aug 24, 2018

Whatever said:
I noticed the same issue with setup up to 2 striped devs (like: 1xRAIDZ, 1 mirrored pair, 1 striped mirrored pair with 4 disk)
I've tried almost every ZFS option.

Next tunes had the most noticeable effect for me:
1) Increase ARC size (at least 32GB with zfs zfs_arc_max==zfs zfs_arc_min)
2) Increase number of udevs at least to 3 (3 pairs in my case) (you have only 1 udev)
2) Add L2ARC
3) Set ashift value inline with physical disk block size (on pool creation)
4) Set options zfs zfs_txg_timeout=15

1) I only have 32GB of RAM sadly so I cannot increase the ARC to that size.
2) I cannot add more disks to this system, I currently have 2x HDDs in RAID 1.
3) I have an L2ARC, but this does not help much with writes, which are my problem.
4) I will attempt this one.

Lunar · Aug 24, 2018

I've tried both ZVOLs and raw disks on a ZFS dataset and both have the same issues. Any intensive writes to the VM's disk makes the load shoot up, which reaches 32 if I let it keep running. I've never seen such high load on ext4 systems just from writes before. I've tried almost all 'zfs set' tuning options that I have found and none have helped the situation. I really want to deploy ZFS into production but I might have to give up as the writes are terrible on my system.

ZFS causes high load, unable to figure out why

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Active Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Active Member

Well-Known Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy