ZFS high io. Again...

Kot Degt · Jun 21, 2019

Hi people!
It is my second time to try use zfs. I spend a lot of time by digging google and this forum, and results is not so bad, but high io is bother me.
I try to be short. It is my testing configuration, but goal it's study.

I have 80G ECC (ARC 32G)
2xSATA (AF) 1Tb WD
nvme 120G (zil)
sata ssd 512G (cache)
.... and 500G HD for proxmox

Proxmox 5.4-3 installed by proxmox.iso (not debian).

zpool create -o ashift=12 rpool mirror /dev/sdb /dev/sdc
zpool add rpool log /dev/nvme0n1
zpool add rpool cache /dev/sdd
zfs create -V 50G -o compression=lz4 -o volblocksize=64k rpool/64k
dedup is off, checksum is on

and fio by simple sequental test:
[writetest]
blocksize=64k
filename=/dev/zvol/rpool/vm-100-disk-1
rw=write
direct=1
buffered=0
ioengine=libaio
iodepth=1

fio-2.16
Starting 1 process
Jobs: 1 (f=1): [f(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s]
writetest: (groupid=0, jobs=1): err= 0: pid=17204: Fri Jun 21 20:58:05 2019
write: io=102400MB, bw=211308KB/s, iops=3301, runt=496231msec
slat (usec): min=5, max=3670, avg=12.23, stdev=12.54
clat (usec): min=3, max=522139, avg=282.41, stdev=1785.75
lat (usec): min=28, max=522152, avg=296.41, stdev=1785.83
clat percentiles (usec):
| 1.00th=[ 24], 5.00th=[ 195], 10.00th=[ 203], 20.00th=[ 213],
| 30.00th=[ 219], 40.00th=[ 223], 50.00th=[ 231], 60.00th=[ 239],
| 70.00th=[ 255], 80.00th=[ 302], 90.00th=[ 386], 95.00th=[ 402],
| 99.00th=[ 434], 99.50th=[ 454], 99.90th=[ 1896], 99.95th=[15168],
| 99.99th=[49408]
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=2.57%, 100=0.25%
lat (usec) : 250=64.98%, 500=31.91%, 750=0.08%, 1000=0.04%
lat (msec) : 2=0.09%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.04%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
cpu : usr=4.00%, sys=5.41%, ctx=1648427, majf=0, minf=27
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=1638400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: io=102400MB, aggrb=211308KB/s, minb=211308KB/s, maxb=211308KB/s, mint=496231msec, maxt=496231msec

I tried play with:

options zfs zfs_txg_timeout=30
options zfs zfs_vdev_sync_read_min_active=1
options zfs zfs_vdev.sync_read_max_active=1
options zfs zfs_vdev.sync_write_min_active=1
options zfs zfs_vdev.sync_write_max_active=1
options zfs zfs_vdev.async_read_min_active=1
options zfs zfs_vdev.async_read_max_active=1
options zfs zfs_vdev.async_write_min_active=1
options zfs zfs_vdev.async_write_max_active=1
options zfs zfs_vdev.scrub_min_active=1
options zfs zfs_vdev.scrub_max_active=1

and
to options zfs zfs_vdev_max_active=1

and
disable NCQ
echo 1 > /sys/block/sdc/device/queue_depth
echo 1 > /sys/block/sdb/device/queue_depth

The io wait jumped to 8%
It is normal for zfs? Why simple seq test generate so high iow?

Thanx!

Nemesiz · Jun 21, 2019

I had ZFS mirror pool of 2 disks and another pool raidz from 3 disks. Both had IO problems. After I upgraded to 2 x raidz2 of 6 disks ( 12 total ) IO problems gone.

Kot Degt · Jun 24, 2019

Thank you very much for the answer!

Look like io problem it's inevitability for simple or low cost equipment (like SATA). Did you have tangible problems like freezes guest or host vm's, when your zfs io has been high?
The equipment in cases with 12 and 2,3 drives is same (SAS & SAS)? What the mechanism of this high IO case (what it wait)?

guletz · Jun 24, 2019

Hi,

As I see you try to make a test that use sequential IO Direct write(direct=1). But what will be the user test case for this? In the majority cases IO direct is used by DataBases case. But the DBs do not use sequential IO Direct most of the time! Even for others FS sequential IO Direct write will be bad.

Kot Degt said:
what it wait

When the system wait that some IO op to be completed, so then it will be possible to send more data!

Kot Degt · Jun 24, 2019

In guest windows, during copying large file (xx Gigabytes) from ssd to zvol, system remain freeze (by fits and starts). Long of freeze depends on size of file and intensity of copying(f.e. i copy one big file repeatedly).
In real life during backup MSSQL to bak file (20GB), all guests and host(proxmox) machines begin to slow down.

Copying of big file is sequential loading of IO system, cause i chosed it. And sequential is sparing for my SATA drives. Why random direct write is slow for HDD i can understand, but big io while sequential copying by big blocks (64k) - i can't.

guletz · Jun 24, 2019

Kot Degt said:
but big io while sequential copying by big blocks (64k) - i can't.

Because your HDDs can not write/cache so much data. Also if your zfs cache is LOW for your own load, then when the date is stored in the ARC/zfs-cache it can not be efficiently agregated in sequential write for the most of the data. Also if you have some VM/CT this will also consume from your HDDs IOPs/bandwith.

Nemesiz · Jun 24, 2019

ZFS pool write works by the slowest disk. So in mirror ZFS write speed will be like single disk ( mirror pool can be 2 disks or 10 disks, its not matter ). In your situation ( like I was too ) write speed jumps happens because of ZFS cache in RAM. It can load write data to RAM from application but then pool write is at max ZFS stops accepting request while cache is fool and flush data is in progress.

Kot Degt · Jun 24, 2019

Thank you for the study me!

I just have deleted zpool and created mdadm + lvm. If i do same action, copy big file from ssd to mirror, after exhaustion of windows cache (in proxmox i chosed no cache) speed is ~ 60 MB/s and io not increase more than 1,5%
On zfs write speed around 100MB/s in my case and io 8% (with nvme ZIL)
Guest - same, host cache - same... It is ZIL or RAM(ARC) to Mirror overload?

All the time what i test ZFS it seems that it, in complex (drives + zil), work more quickly than it can. Is there any way to reduce greed of zvol force to be affordable?

Nemesiz · Jun 24, 2019

ZFS ARC cache is for read only. How to manage write cache I still don`t know.

You can try to limit IO and speed for VM.

guletz · Jun 24, 2019

Nemesiz said:
ZFS ARC cache is for read only. How to manage write cache I still don`t know.

ARC is used for both read and write. Until the flush is start to write to the disks(at 5 sec by default), the data that is present in cache can be re-write.

Nemesiz · Jun 24, 2019

Are you sure it is for read and write cache?

Code:

ARC Size:                               99.97%  12.00   GiB
        Target Size: (Adaptive)         100.00% 12.00   GiB
        Min Size (Hard Limit):          100.00% 12.00   GiB
        Max Size (High Water):          1:1     12.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       93.39%  10.84   GiB
        Frequently Used Cache Size:     6.61%   785.40  MiB

I can`t find where is size of write data cache.

guletz · Jun 24, 2019

Kot Degt said:
Thank you for the study me!

I just have deleted zpool and created mdadm + lvm. If i do same action, copy big file from ssd to mirror, after exhaustion of windows cache (in proxmox i chosed no cache) speed is ~ 60 MB/s and io not increase more than 1,5%
On zfs write speed around 100MB/s in my case and io 8% (with nvme ZIL)
Guest - same, host cache - same... It is ZIL or RAM(ARC) to Mirror overload?

All the time what i test ZFS it seems that it, in complex (drives + zil), work more quickly than it can. Is there any way to reduce greed of zvol force to be affordable?

- be sure that for any VM do you have use cache=none
- disable zil (logbias=throughput)

guletz · Jun 24, 2019

Nemesiz said:
I can`t find where is size of write data cache.

ARC can be write or read. Without the possibility to write the ARC, then ARC will be empty all the time!

Kot Degt · Jun 25, 2019

Hi everyone.
cache = none - set everywhere in my case. logbias=throughput - I will try, but later. Now i disable zil by remove it from pool, now my zil on hdd)
So...
I have removed zpool, create mdadm + lvm and did a few sequential tests (sequential cause vm freeze on sequential loading) and than i create zfs pool withOUT zil, and test it too.
Testing was be in identical conditions. Same vm from backup. Same all, but on the one hand mdadm + lvm and on the other hand zfs (compress off,dedup off,checksum on, 64k, NO ZIL) by crystal disk mark with default settings.

And sequential write on mdadm is 146MB/s . Ok it is speed slowest HDD.
Same test zfs - is 3884MB/s - what is this? cache=none, no zil (no nmve zil)
May be it windows cache? Why same test on mdadm did't show magic numbers? Linux cache? /dev/zvol/* caching by linux?

And general... When on zfs without zil (may be with zil too, but easy demonstrative case without)
So... when load hard sequential write, vm becomes freezing by several seconds (30-50) several times. RDP on this time are not accessible.
Yes ... VM on mdadm working slowly, but it not freeze.

What the nature of the magical write numbers? ARC linux cache?

May be it's disk write cache of zvol (https://pve.proxmox.com/wiki/Performance_Tweaks)
I have tried cache=directsync. Result it very high io wait (22%) and 10MB seq write
Look like all operatons is synchronous, and i need't it. I need reduce write cache(?) of zfs (zvol)...

Nemesiz · Jun 25, 2019

ZIL - sync write device. ONLY for sync writes.
If you don`t have external ZIL - ZFS will use same pool for ZIL ( double write )

How to disable the use of ZIL? #zfs set sync=disabled pool/name/or/sub

Why the ZFS write speed outrun disks speed? As I told before write cache. But if you try to track the disks activity you can see ZFS continue flushing data to disks after write test is done.

guletz · Jun 25, 2019

Hi @Kot Degt

You can try to favor the read operation on disk with:

Code:

echo deadline > /sys/block/sdX/queue/scheduler

and for zfs part(need reboot to activate it)

options zfs zfs_vdev_scheduler=deadline

Nemesiz · Jun 25, 2019

Why noop is bad ?

Kot Degt · Jun 25, 2019

With ZIL, without zil, with doublewrite... In any case my zvolume show fantastic high write speed, which it cannot provide. In result high io wait, and freeze of vm.
Situation:
ZIL is on, inside pool (double write)
The cache=none therefore - Host page cache = disabled and disk write cache = enabled
The cache, that you told, it is "disk write cache" of zvol? Or it is ARC? You told about 5 seconds re-write by ARC, i chose 3 minutes interval, and write speed stay be fantasic high.
If it is "disk write cache" of zvol - how to tune it?

I have tried deadline - same. May be little worse.

Nemesiz · Jun 25, 2019

You are very confused. I have posted information in this forum about how ZFS works.

Lets do it again.

Data read:
ZFS looks at ARC cache ( for metadata and data ) -> then looks at L2ARC if possible -> then to the pool.

You can configure ARC size and set the size for metadata and data in it.
Each file system/volume have primary and secondary cache setup.

Data write:
There are 2 types of data writes. Sync and not sync writes.

IF:
1) sync=always -> all data writes are handle as sync writes
2) sync=standard -> sync is sync and not sync is not sync writes.
3) sync=disabled -> all data writes are handle as not sync writes.

So if the program sends sync writes to ZFS it will write to ZIL ( external if attached or "local" to pool ) and to the write cache. ZFS reports to the program about success and flush that data later from write cache.

If ZIL is disabled ( sync=disabled ) ZFS reports "success" to the program with no data written to pool and flush data to pool later. This way works not sync writes too.

Standard files systems writes data to disks immediately ( direct ) or to the OS cache. But it is small cache so it does not accept a lot of information and don`t makes long delay.

The ZFS use flush method ( default every 5 sec ) and cache the data in bigger cache to make good/balanced data write to pool ( don`t forget its COW ).

I use the same ZFS setup now and before with old pools. I can tell you - the pool with less disks have more problem with delay.
Due to ZFS flush it cannot be managed to control like others FS can be.

In your place I will try to set the limits for VM from kvm side.

Kot Degt · Jun 26, 2019

Ok. Thank you very much!
I will read more, i will test more.
And in any case i have to set limits from kvm.

5s - it is zfs_txg_timeout?
Should i change it adjusted for SATA drive?
For previos versions zfs recommends to reduce vfs.zfs.vdev.max_pending for sata, and disable NCQ.
What recommends now for SATA? What you can advice to try?
Is iothread=1 advisable for zvol volume from kvm side?

ZFS high io. Again...

Active Member

Attachments

Renowned Member

Active Member

Famous Member

Active Member

Famous Member

Renowned Member

Active Member

Renowned Member

Famous Member

Renowned Member

Famous Member

Famous Member

Active Member

Renowned Member

Famous Member

Renowned Member

Active Member

Renowned Member

Active Member