huge IO delay ZFS

ZIL is now disabled, here is the benchmark disk that was used in ZIL.

fio --size=20G --bs=4k --rw=write --direct=1 --sync=1 --runtime=60 --group_reporting --name=test --ramp_time=5s --filename=/dev/sdb
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=5765KiB/s][w=1441 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=431: Fri Apr 10 11:06:15 2020
write: IOPS=1612, BW=6451KiB/s (6605kB/s)(378MiB/60001msec); 0 zone resets
clat (nsec): min=0, max=673397k, avg=618947.49, stdev=2888110.17
lat (nsec): min=0, max=673397k, avg=619112.57, stdev=2888109.38
clat percentiles (usec):
| 1.00th=[ 318], 5.00th=[ 326], 10.00th=[ 330], 20.00th=[ 334],
| 30.00th=[ 334], 40.00th=[ 338], 50.00th=[ 343], 60.00th=[ 347],
| 70.00th=[ 351], 80.00th=[ 375], 90.00th=[ 701], 95.00th=[ 807],
| 99.00th=[17433], 99.50th=[17957], 99.90th=[18482], 99.95th=[18482],
| 99.99th=[32900]
bw ( KiB/s): min= 0, max=10200, per=100.00%, avg=6513.69, stdev=1586.18, samples=118
iops : min= 0, max= 2550, avg=1628.42, stdev=396.54, samples=118
lat (usec) : 500=85.32%, 750=7.70%, 1000=4.90%
lat (msec) : 2=0.70%, 4=0.12%, 10=0.11%, 20=1.12%, 50=0.03%
lat (msec) : 750=0.01%
cpu : usr=0.47%, sys=1.66%, ctx=288446, majf=0, minf=21
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,96760,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=6451KiB/s (6605kB/s), 6451KiB/s-6451KiB/s (6605kB/s-6605kB/s), io=378MiB (396MB), run=60001-60001msec

Disk stats (read/write):
sdb: ios=105/215634, merge=0/0, ticks=43/64064, in_queue=19760, util=74.42%

I think, I have a problem in high IO when reading from a pool.
 
Last edited:
Maybe problem with ARC, I noticed that ARC don't use all the memory that is given to him.

Code:
root@pve0:~# uptime
 18:53:50 up 8 days,  7:20,  4 users,  load average: 23.75, 22.35, 20.31
root@pve0:~# arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Apr 12 18:46:05 2020
Linux 5.3.18-3-pve                                            0.8.3-pve1
Machine: pve0 (x86_64)                                        0.8.3-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    16.8 %   16.1 GiB
        Target size (adaptive):                        16.8 %   16.1 GiB
        Min size (hard limit):                          4.1 %    3.9 GiB
        Max size (high water):                           24:1   96.0 GiB
        Most Frequently Used (MFU) cache size:         79.9 %    4.1 GiB
        Most Recently Used (MRU) cache size:           20.1 %    1.0 GiB
        Metadata cache size (hard limit):              25.0 %   24.0 GiB
        Metadata cache size (current):                 47.8 %   11.5 GiB
        Dnode cache size (hard limit):                 10.0 %    2.4 GiB
        Dnode cache size (current):                    20.9 %  513.4 MiB

ARC hash breakdown:
        Elements max:                                             109.3M
        Elements current:                              99.4 %     108.6M
        Collisions:                                                 1.4G
        Chain max:                                                    26
        Chains:                                                    16.6M

ARC misc:
        Deleted:                                                  769.3M
        Mutex misses:                                               1.4M
        Eviction skips:                                           872.7M

ARC total accesses (hits + misses):                                 3.7G
        Cache hit ratio:                               77.7 %       2.9G
        Cache miss ratio:                              22.3 %     824.4M
        Actual hit ratio (MFU + MRU hits):             76.1 %       2.8G
        Data demand efficiency:                        69.2 %     992.7M
        Data prefetch efficiency:                      19.9 %     627.2M

Cache hits by cache type:
        Most frequently used (MFU):                    85.0 %       2.4G
        Most recently used (MRU):                      12.8 %     368.4M
        Most frequently used (MFU) ghost:               0.3 %       8.4M
        Most recently used (MRU) ghost:                 1.7 %      50.3M
        Anonymously used:                               0.1 %       3.1M

Cache hits by data type:
        Demand data:                                   23.9 %     687.4M
        Demand prefetch data:                           4.3 %     124.7M
        Demand metadata:                               71.3 %       2.1G
        Demand prefetch metadata:                       0.4 %      12.9M

Cache misses by data type:
        Demand data:                                   37.0 %     305.3M
        Demand prefetch data:                          60.9 %     502.4M
        Demand metadata:                                0.8 %       6.3M
        Demand prefetch metadata:                       1.3 %      10.3M

DMU prefetch efficiency:                                            2.2G
        Hit ratio:                                      3.4 %      76.5M
        Miss ratio:                                    96.6 %       2.2G

L2ARC status:                                                    HEALTHY
        Low memory aborts:                                          1.0k
        Free on write:                                            229.7k
        R/W clashes:                                                  14
        Bad checksums:                                                 0
        I/O errors:                                                    0

L2ARC size (adaptive):                                         875.5 GiB
        Compressed:                                    96.4 %  844.2 GiB
        Header size:                                    1.1 %    9.6 GiB

L2ARC breakdown:                                                  824.4M
        Hit ratio:                                     25.4 %     209.8M
        Miss ratio:                                    74.6 %     614.6M
        Feeds:                                                    861.6k

L2ARC writes:
        Writes sent:                                    100 %  821.3 KiB

L2ARC evicts:
        Lock retries:                                               1.4k
        Upon reading:                                                 12

ZIL committed transactions:                                       126.3M
        Commit requests:                                            3.6M
        Flushes to stable storage:                                  3.6M
        Transactions to SLOG storage pool:          457.8 GiB       4.7M
        Transactions to non-SLOG storage pool:      352.4 GiB       4.5M
 
Were you able to solve this? I'm asking because I just ordered the same board for my server, and plan to run Proxmox with TrueNAS on it, which will probably result in a similar problem like yours
 
I have one ZFS pool:
datastore1: SATA 6 HDD Toshiba HDWD130 - raidz1 (ashift=12, compression=lz4, atime=off)
+ logs: msata SSD 256 gb
+ cache: NVME SSD 1024GB
+ spare: SATA 1 HDD Toshiba HDWD130
running 11 LXC and 4 KVM

Hi,

Your LOAD is too intensive for what you have as a storage(raidz IOPs = the same as a single disk).
(for example, to backup over sftp a large number of small files 30-150kb, copied in 1.5 hours ~48000 files or start a mysql database) then iodelay increases very quickly and the rest of the containers and vm slow down very much. IO delay ~ 20%- 35%.

Yes, because you will need to read a lot of metadata from the pool / disks(a lot of seeks => high latency especial for small files). The same for any DBase(a lot of sync's + seeks).

I tried to make fine adjustments to zfs and mysql and it gave results

What you try to do for mysql?

=================================


Slog is not very usefull for your load:

Most frequently used (MFU) ghost: 0.3 % 1.9M
Most recently used (MRU) ghost: 1.7 % 10.1M


Your L2ARC is huge .....
L2ARC size (adaptive): 590.0 GiB

... and as result you use a lot of RAM => so you have this:


L2ARC status: HEALTHY
Low memory aborts: 192


L2ARC breakdown: 129.5M
Hit ratio: 10.7 % 13.9M
Miss ratio: 89.3 % 115.6M


.... so your L2ARC hit is very low => L2ARC for your case can not help!!!!



Cache hits by data type:
Demand data: 25.2 % 146.7M
Demand prefetch data: 4.7 % 27.6M
Demand metadata: 69.8 % 405.8M
Demand prefetch metadata: 0.2 % 1.3M

... disable prefatch, is not worth for only 30 M -> you will lower your disk latency!

What can you do to improve your setup:

A. Buy or add one new HDD(8 HDDs in total, without spare disk)
- create a new zpool layout, like RAID10(4 mirror in RAID0) => IOPS will be much better ...

B. Destroy your pool and create a new one RAID50(RAID0 with 2x raidz) , IOPs will be 2x what you have now

C. Maybe, the best option will be to add a second NVME SSD, and used as zfs special device for any block < 4 K(attention here, because if you lose both of them at the same time you lose also the zfs pool)

D. For your VM/CT with many small files, change the zfs proprieties(of the dataseset/zvol) from:

primarycache all

TO

primarycache metadata




Good luck/ Bafta!
 
@FlohEinstein
Don't worry, I've got a H12 and its fine. If you got high IO demands just use NVME.

Edit:
On my previous hardware the CPU could not keep up and caused long IO delays ...
But I would expect a 32 Core Epyc to be fast enough.
Raidz causes a lot of writes to the disks, this is amplified if you move data from one pool to another.
I hope you do not use zfs encryption, which slows down things even more. Luks is much faster ...
 
Last edited: