huge IO delay ZFS

ZIL is now disabled, here is the benchmark disk that was used in ZIL.

fio --size=20G --bs=4k --rw=write --direct=1 --sync=1 --runtime=60 --group_reporting --name=test --ramp_time=5s --filename=/dev/sdb
test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=5765KiB/s][w=1441 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=431: Fri Apr 10 11:06:15 2020
write: IOPS=1612, BW=6451KiB/s (6605kB/s)(378MiB/60001msec); 0 zone resets
clat (nsec): min=0, max=673397k, avg=618947.49, stdev=2888110.17
lat (nsec): min=0, max=673397k, avg=619112.57, stdev=2888109.38
clat percentiles (usec):
| 1.00th=[ 318], 5.00th=[ 326], 10.00th=[ 330], 20.00th=[ 334],
| 30.00th=[ 334], 40.00th=[ 338], 50.00th=[ 343], 60.00th=[ 347],
| 70.00th=[ 351], 80.00th=[ 375], 90.00th=[ 701], 95.00th=[ 807],
| 99.00th=[17433], 99.50th=[17957], 99.90th=[18482], 99.95th=[18482],
| 99.99th=[32900]
bw ( KiB/s): min= 0, max=10200, per=100.00%, avg=6513.69, stdev=1586.18, samples=118
iops : min= 0, max= 2550, avg=1628.42, stdev=396.54, samples=118
lat (usec) : 500=85.32%, 750=7.70%, 1000=4.90%
lat (msec) : 2=0.70%, 4=0.12%, 10=0.11%, 20=1.12%, 50=0.03%
lat (msec) : 750=0.01%
cpu : usr=0.47%, sys=1.66%, ctx=288446, majf=0, minf=21
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,96760,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=6451KiB/s (6605kB/s), 6451KiB/s-6451KiB/s (6605kB/s-6605kB/s), io=378MiB (396MB), run=60001-60001msec

Disk stats (read/write):
sdb: ios=105/215634, merge=0/0, ticks=43/64064, in_queue=19760, util=74.42%

I think, I have a problem in high IO when reading from a pool.
 
Last edited:
Maybe problem with ARC, I noticed that ARC don't use all the memory that is given to him.

Code:
root@pve0:~# uptime
 18:53:50 up 8 days,  7:20,  4 users,  load average: 23.75, 22.35, 20.31
root@pve0:~# arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Apr 12 18:46:05 2020
Linux 5.3.18-3-pve                                            0.8.3-pve1
Machine: pve0 (x86_64)                                        0.8.3-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    16.8 %   16.1 GiB
        Target size (adaptive):                        16.8 %   16.1 GiB
        Min size (hard limit):                          4.1 %    3.9 GiB
        Max size (high water):                           24:1   96.0 GiB
        Most Frequently Used (MFU) cache size:         79.9 %    4.1 GiB
        Most Recently Used (MRU) cache size:           20.1 %    1.0 GiB
        Metadata cache size (hard limit):              25.0 %   24.0 GiB
        Metadata cache size (current):                 47.8 %   11.5 GiB
        Dnode cache size (hard limit):                 10.0 %    2.4 GiB
        Dnode cache size (current):                    20.9 %  513.4 MiB

ARC hash breakdown:
        Elements max:                                             109.3M
        Elements current:                              99.4 %     108.6M
        Collisions:                                                 1.4G
        Chain max:                                                    26
        Chains:                                                    16.6M

ARC misc:
        Deleted:                                                  769.3M
        Mutex misses:                                               1.4M
        Eviction skips:                                           872.7M

ARC total accesses (hits + misses):                                 3.7G
        Cache hit ratio:                               77.7 %       2.9G
        Cache miss ratio:                              22.3 %     824.4M
        Actual hit ratio (MFU + MRU hits):             76.1 %       2.8G
        Data demand efficiency:                        69.2 %     992.7M
        Data prefetch efficiency:                      19.9 %     627.2M

Cache hits by cache type:
        Most frequently used (MFU):                    85.0 %       2.4G
        Most recently used (MRU):                      12.8 %     368.4M
        Most frequently used (MFU) ghost:               0.3 %       8.4M
        Most recently used (MRU) ghost:                 1.7 %      50.3M
        Anonymously used:                               0.1 %       3.1M

Cache hits by data type:
        Demand data:                                   23.9 %     687.4M
        Demand prefetch data:                           4.3 %     124.7M
        Demand metadata:                               71.3 %       2.1G
        Demand prefetch metadata:                       0.4 %      12.9M

Cache misses by data type:
        Demand data:                                   37.0 %     305.3M
        Demand prefetch data:                          60.9 %     502.4M
        Demand metadata:                                0.8 %       6.3M
        Demand prefetch metadata:                       1.3 %      10.3M

DMU prefetch efficiency:                                            2.2G
        Hit ratio:                                      3.4 %      76.5M
        Miss ratio:                                    96.6 %       2.2G

L2ARC status:                                                    HEALTHY
        Low memory aborts:                                          1.0k
        Free on write:                                            229.7k
        R/W clashes:                                                  14
        Bad checksums:                                                 0
        I/O errors:                                                    0

L2ARC size (adaptive):                                         875.5 GiB
        Compressed:                                    96.4 %  844.2 GiB
        Header size:                                    1.1 %    9.6 GiB

L2ARC breakdown:                                                  824.4M
        Hit ratio:                                     25.4 %     209.8M
        Miss ratio:                                    74.6 %     614.6M
        Feeds:                                                    861.6k

L2ARC writes:
        Writes sent:                                    100 %  821.3 KiB

L2ARC evicts:
        Lock retries:                                               1.4k
        Upon reading:                                                 12

ZIL committed transactions:                                       126.3M
        Commit requests:                                            3.6M
        Flushes to stable storage:                                  3.6M
        Transactions to SLOG storage pool:          457.8 GiB       4.7M
        Transactions to non-SLOG storage pool:      352.4 GiB       4.5M
 
Were you able to solve this? I'm asking because I just ordered the same board for my server, and plan to run Proxmox with TrueNAS on it, which will probably result in a similar problem like yours
 
I have one ZFS pool:
datastore1: SATA 6 HDD Toshiba HDWD130 - raidz1 (ashift=12, compression=lz4, atime=off)
+ logs: msata SSD 256 gb
+ cache: NVME SSD 1024GB
+ spare: SATA 1 HDD Toshiba HDWD130
running 11 LXC and 4 KVM

Hi,

Your LOAD is too intensive for what you have as a storage(raidz IOPs = the same as a single disk).
(for example, to backup over sftp a large number of small files 30-150kb, copied in 1.5 hours ~48000 files or start a mysql database) then iodelay increases very quickly and the rest of the containers and vm slow down very much. IO delay ~ 20%- 35%.

Yes, because you will need to read a lot of metadata from the pool / disks(a lot of seeks => high latency especial for small files). The same for any DBase(a lot of sync's + seeks).

I tried to make fine adjustments to zfs and mysql and it gave results

What you try to do for mysql?

=================================


Slog is not very usefull for your load:

Most frequently used (MFU) ghost: 0.3 % 1.9M
Most recently used (MRU) ghost: 1.7 % 10.1M


Your L2ARC is huge .....
L2ARC size (adaptive): 590.0 GiB

... and as result you use a lot of RAM => so you have this:


L2ARC status: HEALTHY
Low memory aborts: 192


L2ARC breakdown: 129.5M
Hit ratio: 10.7 % 13.9M
Miss ratio: 89.3 % 115.6M


.... so your L2ARC hit is very low => L2ARC for your case can not help!!!!



Cache hits by data type:
Demand data: 25.2 % 146.7M
Demand prefetch data: 4.7 % 27.6M
Demand metadata: 69.8 % 405.8M
Demand prefetch metadata: 0.2 % 1.3M

... disable prefatch, is not worth for only 30 M -> you will lower your disk latency!

What can you do to improve your setup:

A. Buy or add one new HDD(8 HDDs in total, without spare disk)
- create a new zpool layout, like RAID10(4 mirror in RAID0) => IOPS will be much better ...

B. Destroy your pool and create a new one RAID50(RAID0 with 2x raidz) , IOPs will be 2x what you have now

C. Maybe, the best option will be to add a second NVME SSD, and used as zfs special device for any block < 4 K(attention here, because if you lose both of them at the same time you lose also the zfs pool)

D. For your VM/CT with many small files, change the zfs proprieties(of the dataseset/zvol) from:

primarycache all

TO

primarycache metadata




Good luck/ Bafta!
 
@FlohEinstein
Don't worry, I've got a H12 and its fine. If you got high IO demands just use NVME.

Edit:
On my previous hardware the CPU could not keep up and caused long IO delays ...
But I would expect a 32 Core Epyc to be fast enough.
Raidz causes a lot of writes to the disks, this is amplified if you move data from one pool to another.
I hope you do not use zfs encryption, which slows down things even more. Luks is much faster ...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!