ZFS raid-1 data pool bad rand write performance

freeman1doma · May 29, 2020

I have hetzner AX61 server

2x sata ssd 240Gb for OS (zfs raid-1 uefi boot)

2x toshiba nvme u.2 KXD51RUE3T84 3.84Tb (for data)

Test with fio data pool
ZFS RAID-1
zfs pool ashift=12, atime=off, compression=off

# zpool status
pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7WD240HAFV-00003_S16LNYAF402056-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7WD240HAFV-00003_S16LNYAD905297-part3 ONLINE 0 0 0

errors: No known data errors

pool: zfsr1nvme
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
zfsr1nvme ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-KXD51RUE3T84_TOSHIBA_10JS1019T7UM ONLINE 0 0 0
nvme-KXD51RUE3T84_TOSHIBA_10JS101AT7UM ONLINE 0 0 0

errors: No known data errors

#fio --filename=/zfsr1nvme/test-fio.bin --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=30.9MiB/s][w=7910 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=46963: Fri May 29 13:35:43 2020
write: IOPS=10.3k, BW=40.4MiB/s (42.3MB/s)(2440MiB/60453msec); 0 zone resets
slat (nsec): min=341, max=185623, avg=1043.96, stdev=362.64
clat (usec): min=5, max=560, avg=94.64, stdev=49.68
lat (usec): min=6, max=562, avg=95.68, stdev=49.68
clat percentiles (usec):
| 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 12], 20.00th=[ 35],
| 30.00th=[ 47], 40.00th=[ 112], 50.00th=[ 114], 60.00th=[ 122],
| 70.00th=[ 135], 80.00th=[ 141], 90.00th=[ 143], 95.00th=[ 145],
| 99.00th=[ 149], 99.50th=[ 153], 99.90th=[ 165], 99.95th=[ 176],
| 99.99th=[ 221]
bw ( KiB/s): min=28512, max=153304, per=100.00%, avg=41693.09, stdev=15178.40, samples=119
iops : min= 7128, max=38326, avg=10423.27, stdev=3794.60, samples=119
lat (usec) : 10=5.43%, 20=8.49%, 50=16.89%, 100=2.46%, 250=66.73%
lat (usec) : 500=0.01%, 750=0.01%
cpu : usr=2.13%, sys=1.99%, ctx=624613, majf=1, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,624555,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=40.4MiB/s (42.3MB/s), 40.4MiB/s-40.4MiB/s (42.3MB/s-42.3MB/s), io=2440MiB (2558MB), run=60453-60453msec

# fio --filename=/dev/nvme0n1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=62193: Fri May 29 13:57:13 2020
write: IOPS=74.7k, BW=292MiB/s (306MB/s)(17.7GiB/62150msec); 0 zone resets
slat (nsec): min=330, max=191796, avg=1217.38, stdev=326.49
clat (nsec): min=260, max=778174, avg=6823.75, stdev=2969.88
lat (usec): min=3, max=779, avg= 8.04, stdev= 3.26
clat percentiles (nsec):
| 1.00th=[ 3312], 5.00th=[ 3440], 10.00th=[ 3504], 20.00th=[ 3632],
| 30.00th=[ 3696], 40.00th=[ 4256], 50.00th=[ 8896], 60.00th=[ 9152],
| 70.00th=[ 9280], 80.00th=[ 9408], 90.00th=[ 9664], 95.00th=[10048],
| 99.00th=[11200], 99.50th=[12224], 99.90th=[14272], 99.95th=[16064],
| 99.99th=[19072]
bw ( KiB/s): min=37368, max=808168, per=100.00%, avg=459929.64, stdev=201721.42, samples=80
iops : min= 9344, max=202042, avg=114982.47, stdev=50430.27, samples=80
lat (nsec) : 500=0.01%
lat (usec) : 4=38.02%, 10=57.08%, 20=4.90%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
cpu : usr=13.35%, sys=25.34%, ctx=4905240, majf=0, minf=48
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,4645696,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=292MiB/s (306MB/s), 292MiB/s-292MiB/s (306MB/s-306MB/s), io=17.7GiB (19.0GB), run=62150-62150msec

Disk stats (read/write):
nvme0n1: ios=95/319560, merge=0/4325957, ticks=11/7698292, in_queue=7091352, util=34.76%

Why so huge difference between this results? 42MB/s and 306MB/s...

Even old sata ssd have better performance in zfs mirror:

# fio --filename=/tmp/test-fio.bin --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][w=7755KiB/s][w=1938 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=47846: Fri May 29 14:34:10 2020
write: IOPS=21.0k, BW=85.8MiB/s (90.0MB/s)(5229MiB/60910msec); 0 zone resets
slat (nsec): min=340, max=202386, avg=1385.65, stdev=581.93
clat (nsec): min=130, max=5498.6k, avg=42997.68, stdev=43586.78
lat (usec): min=6, max=5501, avg=44.38, stdev=43.59
clat percentiles (usec):
| 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 12],
| 30.00th=[ 13], 40.00th=[ 14], 50.00th=[ 16], 60.00th=[ 39],
| 70.00th=[ 46], 80.00th=[ 111], 90.00th=[ 116], 95.00th=[ 118],
| 99.00th=[ 125], 99.50th=[ 149], 99.90th=[ 180], 99.95th=[ 208],
| 99.99th=[ 498]
bw ( KiB/s): min=55168, max=217128, per=100.00%, avg=89147.58, stdev=30842.68, samples=119
iops : min=13792, max=54282, avg=22286.88, stdev=7710.68, samples=119
lat (nsec) : 250=0.01%
lat (usec) : 4=0.01%, 10=12.97%, 20=41.88%, 50=18.04%, 100=4.31%
lat (usec) : 250=22.76%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
cpu : usr=4.16%, sys=4.60%, ctx=1338783, majf=0, minf=46
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1338513,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=85.8MiB/s (90.0MB/s), 85.8MiB/s-85.8MiB/s (90.0MB/s-90.0MB/s), io=5229MiB (5483MB), run=60910-60910msec

Alwin · May 29, 2020

freeman1doma said:
write: IOPS=74.7k, BW=292MiB/s (306MB/s)(17.7GiB/62150msec); 0 zone resets

This is caching, since the specs says 21k IOps. Run the FIO test with direct and sync to get a comparable result.
https://business.kioxia.com/en-us/ssd/data-center-ssd/xd5-1.html

freeman1doma said:
Why so huge difference between this results? 42MB/s and 306MB/s...

You compare the ZFS filesystem on a mirrored pool against a plain block device.

freeman1doma · May 29, 2020

Alwin said:
This is caching, since the specs says 21k IOps. Run the FIO test with direct and sync to get a comparable result.
https://business.kioxia.com/en-us/ssd/data-center-ssd/xd5-1.html

Thanks. Fio with direct and sync

# fio --filename=/zfsr1nvme/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=13.0MiB/s][w=3578 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=40737: Fri May 29 15:20:40 2020
write: IOPS=4525, BW=17.7MiB/s (18.5MB/s)(1061MiB/60001msec); 0 zone resets
slat (nsec): min=341, max=204319, avg=1417.28, stdev=959.24
clat (usec): min=61, max=13452, avg=219.00, stdev=415.98
lat (usec): min=62, max=13453, avg=220.41, stdev=416.22
clat percentiles (usec):
| 1.00th=[ 65], 5.00th=[ 72], 10.00th=[ 82], 20.00th=[ 85],
| 30.00th=[ 88], 40.00th=[ 93], 50.00th=[ 100], 60.00th=[ 104],
| 70.00th=[ 112], 80.00th=[ 208], 90.00th=[ 289], 95.00th=[ 1139],
| 99.00th=[ 1942], 99.50th=[ 2147], 99.90th=[ 4228], 99.95th=[ 5342],
| 99.99th=[ 7767]
bw ( KiB/s): min= 4240, max=40496, per=100.00%, avg=18201.13, stdev=11745.36, samples=119
iops : min= 1060, max=10124, avg=4550.26, stdev=2936.34, samples=119
lat (usec) : 100=50.53%, 250=36.44%, 500=6.08%, 750=0.88%, 1000=0.70%
lat (msec) : 2=4.56%, 4=0.71%, 10=0.11%, 20=0.01%
cpu : usr=1.17%, sys=1.31%, ctx=271553, majf=8, minf=44
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,271513,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=17.7MiB/s (18.5MB/s), 17.7MiB/s-17.7MiB/s (18.5MB/s-18.5MB/s), io=1061MiB (1112MB), run=60001-60001msec

You compare the ZFS filesystem on a mirrored pool against a plain block device.

Compare to zfs mirror on two sata ssd, where OS installed(sync+direct):

# fio --filename=/tmp/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1093KiB/s][w=273 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=39270: Fri May 29 15:24:45 2020
write: IOPS=266, BW=1064KiB/s (1090kB/s)(62.4MiB/60003msec); 0 zone resets
slat (nsec): min=1092, max=208337, avg=5671.32, stdev=2313.41
clat (usec): min=1484, max=16804, avg=3749.91, stdev=1073.52
lat (usec): min=1488, max=16808, avg=3755.58, stdev=1073.59
clat percentiles (usec):
| 1.00th=[ 1942], 5.00th=[ 3130], 10.00th=[ 3228], 20.00th=[ 3359],
| 30.00th=[ 3458], 40.00th=[ 3523], 50.00th=[ 3556], 60.00th=[ 3589],
| 70.00th=[ 3621], 80.00th=[ 3687], 90.00th=[ 4015], 95.00th=[ 5735],
| 99.00th=[ 8717], 99.50th=[ 9896], 99.90th=[12911], 99.95th=[13960],
| 99.99th=[16450]
bw ( KiB/s): min= 680, max= 1240, per=99.99%, avg=1063.87, stdev=74.24, samples=120
iops : min= 170, max= 310, avg=265.94, stdev=18.57, samples=120
lat (msec) : 2=1.55%, 4=88.33%, 10=9.71%, 20=0.41%
cpu : usr=0.44%, sys=0.33%, ctx=15969, majf=0, minf=43
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,15963,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=1064KiB/s (1090kB/s), 1064KiB/s-1064KiB/s (1090kB/s-1090kB/s), io=62.4MiB (65.4MB), run=60003-60003msec

Code:

# pveversion
pve-manager/6.2-4/9824574a (running kernel: 5.4.41-1-pve)

Code:

# pveperf
CPU BOGOMIPS:      319387.52
REGEX/SECOND:      3132034
HD SIZE:           192.77 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     356.60
DNS EXT:           28.77 ms
DNS INT:           0.53 ms (local)

Code:

# pveperf /zfsr1nvme/
CPU BOGOMIPS:      319387.52
REGEX/SECOND:      3164174
HD SIZE:           3456.48 GB (zfsr1nvme)
FSYNCS/SECOND:     9124.26
DNS EXT:           29.14 ms
DNS INT:           0.53 ms (local)

Alwin · May 29, 2020

freeman1doma said:
write: IOPS=4525, BW=17.7MiB/s (18.5MB/s)(1061MiB/60001msec); 0 zone resets

freeman1doma said:
write: IOPS=266, BW=1064KiB/s (1090kB/s)(62.4MiB/60003msec); 0 zone resets

There you can see it shine. And that leaves you with tweaking zfs to your workload.

freeman1doma · May 29, 2020

Alwin said:
There you can see it shine. And that leaves you with tweaking zfs to your workload.

That's normal results 4,5 KIOPS for nvme in zfs mirror at 4k randwrite workloads? Specs says it must be 21 KIOPS. Can you turn me to right way in tweaking zfs for better randwrite 4k?

And my pveperf fsyncs on sata ssd zfs mirror very low (356). ZFS root was installed by default from official proxmox 6.2 iso.

LnxBil · May 30, 2020

freeman1doma said:
Can you turn me to right way in tweaking zfs for better randwrite 4k?

You can also optimise the recordsize, but if optimising for the benchmark is also good for running a real workload is debatable.

freeman1doma · May 31, 2020

LnxBil said:
You can also optimise the recordsize, but if optimising for the benchmark is also good for running a real workload is debatable.

Thnx, I was change default recordsize to 16K and have double penalty to performance randwrite

# fio --filename=/zfsr1nvme/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=10.6MiB/s][w=2722 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=45817: Sun May 31 23:14:49 2020
write: IOPS=10.3k, BW=40.1MiB/s (42.0MB/s)(2404MiB/60001msec); 0 zone resets
slat (nsec): min=761, max=186084, avg=1195.44, stdev=401.08
clat (usec): min=60, max=62477, avg=95.84, stdev=294.64
lat (usec): min=61, max=62478, avg=97.04, stdev=294.69
clat percentiles (usec):
| 1.00th=[ 64], 5.00th=[ 68], 10.00th=[ 70], 20.00th=[ 72],
| 30.00th=[ 74], 40.00th=[ 76], 50.00th=[ 78], 60.00th=[ 80],
| 70.00th=[ 83], 80.00th=[ 87], 90.00th=[ 94], 95.00th=[ 102],
| 99.00th=[ 594], 99.50th=[ 1106], 99.90th=[ 2245], 99.95th=[ 2474],
| 99.99th=[ 5211]
bw ( KiB/s): min= 3336, max=54696, per=100.00%, avg=41131.67, stdev=16311.61, samples=119
iops : min= 834, max=13674, avg=10282.90, stdev=4077.88, samples=119
lat (usec) : 100=93.95%, 250=4.33%, 500=0.54%, 750=0.41%, 1000=0.23%
lat (msec) : 2=0.33%, 4=0.19%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=1.91%, sys=2.12%, ctx=615523, majf=1, minf=43
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,615475,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=40.1MiB/s (42.0MB/s), 40.1MiB/s-40.1MiB/s (42.0MB/s-42.0MB/s), io=2404MiB (2521MB), run=60001-60001msec

Its very important for my VM's with DB's.
Last question. Can I improve performance of my zpool nvme's mirror, if will add SLOG vdev as sata ssd? Or, I'll just have intact dirty data after power loss and nothing more ?

LnxBil · Jun 1, 2020

freeman1doma said:
Can I improve performance of my zpool nvme's mirror, if will add SLOG vdev as sata ssd?

You need a faster device than your "normal" devices, so you have to add something that is faster than your NVMe. Depending on the NVMe, this is very hard. The only option I can think of is an Intel Optane (maybe even Optane DC Persistent Memory), but you have to inspect the numbers. Normally with NVMe, there is nothing you can do with normal pocket memory.

freeman1doma said:
Thnx, I was change default recordsize to 16K and have double penalty to performance randwrite

Then check with a 16K test if your volume is e.g. for MySQL/MariaDB. Numbers should be much better. I can also recommend splitting your VM disks between normal operation disk (e.g. 4K volblocksize) and MySQL/MariaDB one with 16K volblocksize. Testing the throughput with mismatched blocksize will give you the tremendous write/read amplitication.

guletz · Jun 1, 2020

Hi,

Also in additon what @LnxBil said, take in account that muste be done some specific fonfiguration in case of MySQL/MariaDB, like this:
- any INNODB file/DB use 16 K
- MySQL/MariaDB checksums could be disable, zfs already do the same
- MySQL/MariaDB log/intendlog can be more then 16k
- zfs do mot have directio => disable directio for MySQL/MariaDB

NVME devices can not do so many IOPs using a single thred. Most of the SSDs use internall 16k, so I would also try to use at least for a test ashift=13.

LnxBil · Jun 1, 2020

guletz said:
- zfs do mot have directio => disable directio for MySQL/MariaDB

This was fixed in 0.8.

guletz · Jun 1, 2020

LnxBil said:
This was fixed in 0.8.

True

freeman1doma · Jun 1, 2020

LnxBil said:
Then check with a 16K test if your volume is e.g. for MySQL/MariaDB. Numbers should be much better.

True

~# fio --filename=/zfsr1nvme/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=16k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=54.5MiB/s][w=3485 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=56926: Mon Jun 1 13:19:38 2020
write: IOPS=7400, BW=116MiB/s (121MB/s)(6938MiB/60001msec); 0 zone resets
slat (nsec): min=661, max=192186, avg=1498.57, stdev=533.72
clat (usec): min=89, max=47128, avg=133.18, stdev=264.47
lat (usec): min=90, max=47131, avg=134.68, stdev=264.53
clat percentiles (usec):
| 1.00th=[ 97], 5.00th=[ 100], 10.00th=[ 101], 20.00th=[ 103],
| 30.00th=[ 106], 40.00th=[ 109], 50.00th=[ 112], 60.00th=[ 116],
| 70.00th=[ 122], 80.00th=[ 128], 90.00th=[ 137], 95.00th=[ 145],
| 99.00th=[ 652], 99.50th=[ 1450], 99.90th=[ 2180], 99.95th=[ 2540],
| 99.99th=[ 4948]
bw ( KiB/s): min=13189, max=149568, per=99.94%, avg=118332.88, stdev=39487.27, samples=119
iops : min= 824, max= 9348, avg=7395.73, stdev=2467.99, samples=119
lat (usec) : 100=6.63%, 250=91.23%, 500=0.77%, 750=0.56%, 1000=0.19%
lat (msec) : 2=0.45%, 4=0.16%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.49%, sys=1.82%, ctx=444070, majf=0, minf=46
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,444025,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=116MiB/s (121MB/s), 116MiB/s-116MiB/s (121MB/s-121MB/s), io=6938MiB (7275MB), run=60001-60001msec

LnxBil said:
I can also recommend splitting your VM disks between normal operation disk (e.g. 4K volblocksize) and MySQL/MariaDB one with 16K volblocksize. Testing the throughput with mismatched blocksize will give you the tremendous write/read amplitication.

Done, thanks. As I see align must be in all level of system (ege block devices, vdev, zvol, vm's etc)

guletz said:
Hi,

Also in additon what @LnxBil said, take in account that muste be done some specific fonfiguration in case of MySQL/MariaDB, like this:
- any INNODB file/DB use 16 K
- MySQL/MariaDB checksums could be disable, zfs already do the same
- MySQL/MariaDB log/intendlog can be more then 16k
- zfs do mot have directio => disable directio for MySQL/MariaDB

NVME devices can not do so many IOPs using a single thred. Most of the SSDs use internall 16k, so I would also try to use at least for a test ashift=13.

Thank you for these important comments, and yes - it will be VM's with MariaDB (InnoDB engine) .
All last tests was with ashift=13 on nvme pool.

Code:

~# zpool get all | grep ashift
rpool      ashift                         12                             local
zfsr1nvme  ashift                         13                             local

But on system boot zfs rpool ashit=12 (mirrored two oldest Samsung sm843t 240gb) and pveperf fsyncs far from ideal

Code:

# pveperf
CPU BOGOMIPS:      319387.52
REGEX/SECOND:      3132034
HD SIZE:           192.77 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     356.60
DNS EXT:           28.77 ms
DNS INT:           0.53 ms (local)

FSYNCS/SECOND: 356.60 Сan this numbers somehow affect the performance of the whole system? Or need some tune too?

guletz · Jun 8, 2020

Hi again,

Each pool has his own settings. I have some PMX nodes with HDD boot disk, and a additional zfs pool for date on SSD. I can say that I do not see any performance problem using a HDD for the system, so I can imagine that a SSD pool is much better. In your case I will try to optimise only the DATA part (zfsr1nvme).

Good luck/ Bafta

freeman1doma · Jun 8, 2020

guletz said:
Hi again,

In your case I will try to optimise only the DATA part (zfsr1nvme).

Good luck/ Bafta

Thx, And I will do that.

Search

Search

ZFS raid-1 data pool bad rand write performance

freeman1doma

Active Member

Alwin

Proxmox Retired Staff

freeman1doma

Active Member

Alwin

Proxmox Retired Staff

freeman1doma

Active Member

LnxBil

Distinguished Member

freeman1doma

Active Member

LnxBil

Distinguished Member

guletz

Famous Member

LnxBil

Distinguished Member

guletz

Famous Member

freeman1doma

Active Member

guletz

Famous Member

freeman1doma

Active Member