ZFS raid-1 data pool bad rand write performance

freeman1doma

Member
Apr 1, 2019
23
1
23
41
I have hetzner AX61 server

2x sata ssd 240Gb for OS (zfs raid-1 uefi boot)

2x toshiba nvme u.2 KXD51RUE3T84 3.84Tb (for data)

Test with fio data pool
ZFS RAID-1
zfs pool ashift=12, atime=off, compression=off

# zpool status
pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-SAMSUNG_MZ7WD240HAFV-00003_S16LNYAF402056-part3 ONLINE 0 0 0
ata-SAMSUNG_MZ7WD240HAFV-00003_S16LNYAD905297-part3 ONLINE 0 0 0

errors: No known data errors

pool: zfsr1nvme
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
zfsr1nvme ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-KXD51RUE3T84_TOSHIBA_10JS1019T7UM ONLINE 0 0 0
nvme-KXD51RUE3T84_TOSHIBA_10JS101AT7UM ONLINE 0 0 0

errors: No known data errors

#fio --filename=/zfsr1nvme/test-fio.bin --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=30.9MiB/s][w=7910 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=46963: Fri May 29 13:35:43 2020
write: IOPS=10.3k, BW=40.4MiB/s (42.3MB/s)(2440MiB/60453msec); 0 zone resets
slat (nsec): min=341, max=185623, avg=1043.96, stdev=362.64
clat (usec): min=5, max=560, avg=94.64, stdev=49.68
lat (usec): min=6, max=562, avg=95.68, stdev=49.68
clat percentiles (usec):
| 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 12], 20.00th=[ 35],
| 30.00th=[ 47], 40.00th=[ 112], 50.00th=[ 114], 60.00th=[ 122],
| 70.00th=[ 135], 80.00th=[ 141], 90.00th=[ 143], 95.00th=[ 145],
| 99.00th=[ 149], 99.50th=[ 153], 99.90th=[ 165], 99.95th=[ 176],
| 99.99th=[ 221]
bw ( KiB/s): min=28512, max=153304, per=100.00%, avg=41693.09, stdev=15178.40, samples=119
iops : min= 7128, max=38326, avg=10423.27, stdev=3794.60, samples=119
lat (usec) : 10=5.43%, 20=8.49%, 50=16.89%, 100=2.46%, 250=66.73%
lat (usec) : 500=0.01%, 750=0.01%
cpu : usr=2.13%, sys=1.99%, ctx=624613, majf=1, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,624555,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=40.4MiB/s (42.3MB/s), 40.4MiB/s-40.4MiB/s (42.3MB/s-42.3MB/s), io=2440MiB (2558MB), run=60453-60453msec


# fio --filename=/dev/nvme0n1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=62193: Fri May 29 13:57:13 2020
write: IOPS=74.7k, BW=292MiB/s (306MB/s)(17.7GiB/62150msec); 0 zone resets
slat (nsec): min=330, max=191796, avg=1217.38, stdev=326.49
clat (nsec): min=260, max=778174, avg=6823.75, stdev=2969.88
lat (usec): min=3, max=779, avg= 8.04, stdev= 3.26
clat percentiles (nsec):
| 1.00th=[ 3312], 5.00th=[ 3440], 10.00th=[ 3504], 20.00th=[ 3632],
| 30.00th=[ 3696], 40.00th=[ 4256], 50.00th=[ 8896], 60.00th=[ 9152],
| 70.00th=[ 9280], 80.00th=[ 9408], 90.00th=[ 9664], 95.00th=[10048],
| 99.00th=[11200], 99.50th=[12224], 99.90th=[14272], 99.95th=[16064],
| 99.99th=[19072]
bw ( KiB/s): min=37368, max=808168, per=100.00%, avg=459929.64, stdev=201721.42, samples=80
iops : min= 9344, max=202042, avg=114982.47, stdev=50430.27, samples=80
lat (nsec) : 500=0.01%
lat (usec) : 4=38.02%, 10=57.08%, 20=4.90%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
cpu : usr=13.35%, sys=25.34%, ctx=4905240, majf=0, minf=48
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,4645696,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=292MiB/s (306MB/s), 292MiB/s-292MiB/s (306MB/s-306MB/s), io=17.7GiB (19.0GB), run=62150-62150msec

Disk stats (read/write):
nvme0n1: ios=95/319560, merge=0/4325957, ticks=11/7698292, in_queue=7091352, util=34.76%


Why so huge difference between this results? 42MB/s and 306MB/s...

Even old sata ssd have better performance in zfs mirror:
# fio --filename=/tmp/test-fio.bin --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][w=7755KiB/s][w=1938 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=47846: Fri May 29 14:34:10 2020
write: IOPS=21.0k, BW=85.8MiB/s (90.0MB/s)(5229MiB/60910msec); 0 zone resets
slat (nsec): min=340, max=202386, avg=1385.65, stdev=581.93
clat (nsec): min=130, max=5498.6k, avg=42997.68, stdev=43586.78
lat (usec): min=6, max=5501, avg=44.38, stdev=43.59
clat percentiles (usec):
| 1.00th=[ 8], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 12],
| 30.00th=[ 13], 40.00th=[ 14], 50.00th=[ 16], 60.00th=[ 39],
| 70.00th=[ 46], 80.00th=[ 111], 90.00th=[ 116], 95.00th=[ 118],
| 99.00th=[ 125], 99.50th=[ 149], 99.90th=[ 180], 99.95th=[ 208],
| 99.99th=[ 498]
bw ( KiB/s): min=55168, max=217128, per=100.00%, avg=89147.58, stdev=30842.68, samples=119
iops : min=13792, max=54282, avg=22286.88, stdev=7710.68, samples=119
lat (nsec) : 250=0.01%
lat (usec) : 4=0.01%, 10=12.97%, 20=41.88%, 50=18.04%, 100=4.31%
lat (usec) : 250=22.76%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
cpu : usr=4.16%, sys=4.60%, ctx=1338783, majf=0, minf=46
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1338513,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=85.8MiB/s (90.0MB/s), 85.8MiB/s-85.8MiB/s (90.0MB/s-90.0MB/s), io=5229MiB (5483MB), run=60910-60910msec
 
This is caching, since the specs says 21k IOps. Run the FIO test with direct and sync to get a comparable result.
https://business.kioxia.com/en-us/ssd/data-center-ssd/xd5-1.html
Thanks. Fio with direct and sync
# fio --filename=/zfsr1nvme/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=13.0MiB/s][w=3578 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=40737: Fri May 29 15:20:40 2020
write: IOPS=4525, BW=17.7MiB/s (18.5MB/s)(1061MiB/60001msec); 0 zone resets
slat (nsec): min=341, max=204319, avg=1417.28, stdev=959.24
clat (usec): min=61, max=13452, avg=219.00, stdev=415.98
lat (usec): min=62, max=13453, avg=220.41, stdev=416.22
clat percentiles (usec):
| 1.00th=[ 65], 5.00th=[ 72], 10.00th=[ 82], 20.00th=[ 85],
| 30.00th=[ 88], 40.00th=[ 93], 50.00th=[ 100], 60.00th=[ 104],
| 70.00th=[ 112], 80.00th=[ 208], 90.00th=[ 289], 95.00th=[ 1139],
| 99.00th=[ 1942], 99.50th=[ 2147], 99.90th=[ 4228], 99.95th=[ 5342],
| 99.99th=[ 7767]
bw ( KiB/s): min= 4240, max=40496, per=100.00%, avg=18201.13, stdev=11745.36, samples=119
iops : min= 1060, max=10124, avg=4550.26, stdev=2936.34, samples=119
lat (usec) : 100=50.53%, 250=36.44%, 500=6.08%, 750=0.88%, 1000=0.70%
lat (msec) : 2=4.56%, 4=0.71%, 10=0.11%, 20=0.01%
cpu : usr=1.17%, sys=1.31%, ctx=271553, majf=8, minf=44
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,271513,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=17.7MiB/s (18.5MB/s), 17.7MiB/s-17.7MiB/s (18.5MB/s-18.5MB/s), io=1061MiB (1112MB), run=60001-60001msec


You compare the ZFS filesystem on a mirrored pool against a plain block device.
Compare to zfs mirror on two sata ssd, where OS installed(sync+direct):
# fio --filename=/tmp/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1093KiB/s][w=273 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=39270: Fri May 29 15:24:45 2020
write: IOPS=266, BW=1064KiB/s (1090kB/s)(62.4MiB/60003msec); 0 zone resets
slat (nsec): min=1092, max=208337, avg=5671.32, stdev=2313.41
clat (usec): min=1484, max=16804, avg=3749.91, stdev=1073.52
lat (usec): min=1488, max=16808, avg=3755.58, stdev=1073.59
clat percentiles (usec):
| 1.00th=[ 1942], 5.00th=[ 3130], 10.00th=[ 3228], 20.00th=[ 3359],
| 30.00th=[ 3458], 40.00th=[ 3523], 50.00th=[ 3556], 60.00th=[ 3589],
| 70.00th=[ 3621], 80.00th=[ 3687], 90.00th=[ 4015], 95.00th=[ 5735],
| 99.00th=[ 8717], 99.50th=[ 9896], 99.90th=[12911], 99.95th=[13960],
| 99.99th=[16450]
bw ( KiB/s): min= 680, max= 1240, per=99.99%, avg=1063.87, stdev=74.24, samples=120
iops : min= 170, max= 310, avg=265.94, stdev=18.57, samples=120
lat (msec) : 2=1.55%, 4=88.33%, 10=9.71%, 20=0.41%
cpu : usr=0.44%, sys=0.33%, ctx=15969, majf=0, minf=43
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,15963,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=1064KiB/s (1090kB/s), 1064KiB/s-1064KiB/s (1090kB/s-1090kB/s), io=62.4MiB (65.4MB), run=60003-60003msec

Code:
# pveversion
pve-manager/6.2-4/9824574a (running kernel: 5.4.41-1-pve)

Code:
# pveperf
CPU BOGOMIPS:      319387.52
REGEX/SECOND:      3132034
HD SIZE:           192.77 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     356.60
DNS EXT:           28.77 ms
DNS INT:           0.53 ms (local)
Code:
# pveperf /zfsr1nvme/
CPU BOGOMIPS:      319387.52
REGEX/SECOND:      3164174
HD SIZE:           3456.48 GB (zfsr1nvme)
FSYNCS/SECOND:     9124.26
DNS EXT:           29.14 ms
DNS INT:           0.53 ms (local)
 
  • Like
Reactions: freeman1doma
There you can see it shine. And that leaves you with tweaking zfs to your workload.
That's normal results 4,5 KIOPS for nvme in zfs mirror at 4k randwrite workloads? Specs says it must be 21 KIOPS. Can you turn me to right way in tweaking zfs for better randwrite 4k?

And my pveperf fsyncs on sata ssd zfs mirror very low (356). ZFS root was installed by default from official proxmox 6.2 iso.
 
Last edited:
You can also optimise the recordsize, but if optimising for the benchmark is also good for running a real workload is debatable.
Thnx, I was change default recordsize to 16K and have double penalty to performance randwrite
# fio --filename=/zfsr1nvme/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=10.6MiB/s][w=2722 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=45817: Sun May 31 23:14:49 2020
write: IOPS=10.3k, BW=40.1MiB/s (42.0MB/s)(2404MiB/60001msec); 0 zone resets
slat (nsec): min=761, max=186084, avg=1195.44, stdev=401.08
clat (usec): min=60, max=62477, avg=95.84, stdev=294.64
lat (usec): min=61, max=62478, avg=97.04, stdev=294.69
clat percentiles (usec):
| 1.00th=[ 64], 5.00th=[ 68], 10.00th=[ 70], 20.00th=[ 72],
| 30.00th=[ 74], 40.00th=[ 76], 50.00th=[ 78], 60.00th=[ 80],
| 70.00th=[ 83], 80.00th=[ 87], 90.00th=[ 94], 95.00th=[ 102],
| 99.00th=[ 594], 99.50th=[ 1106], 99.90th=[ 2245], 99.95th=[ 2474],
| 99.99th=[ 5211]
bw ( KiB/s): min= 3336, max=54696, per=100.00%, avg=41131.67, stdev=16311.61, samples=119
iops : min= 834, max=13674, avg=10282.90, stdev=4077.88, samples=119
lat (usec) : 100=93.95%, 250=4.33%, 500=0.54%, 750=0.41%, 1000=0.23%
lat (msec) : 2=0.33%, 4=0.19%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=1.91%, sys=2.12%, ctx=615523, majf=1, minf=43
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,615475,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=40.1MiB/s (42.0MB/s), 40.1MiB/s-40.1MiB/s (42.0MB/s-42.0MB/s), io=2404MiB (2521MB), run=60001-60001msec
Its very important for my VM's with DB's.
Last question. Can I improve performance of my zpool nvme's mirror, if will add SLOG vdev as sata ssd? Or, I'll just have intact dirty data after power loss and nothing more ?
 
Last edited:
Can I improve performance of my zpool nvme's mirror, if will add SLOG vdev as sata ssd?

You need a faster device than your "normal" devices, so you have to add something that is faster than your NVMe. Depending on the NVMe, this is very hard. The only option I can think of is an Intel Optane (maybe even Optane DC Persistent Memory), but you have to inspect the numbers. Normally with NVMe, there is nothing you can do with normal pocket memory.

Thnx, I was change default recordsize to 16K and have double penalty to performance randwrite

Then check with a 16K test if your volume is e.g. for MySQL/MariaDB. Numbers should be much better. I can also recommend splitting your VM disks between normal operation disk (e.g. 4K volblocksize) and MySQL/MariaDB one with 16K volblocksize. Testing the throughput with mismatched blocksize will give you the tremendous write/read amplitication.
 
Hi,

Also in additon what @LnxBil said, take in account that muste be done some specific fonfiguration in case of MySQL/MariaDB, like this:
- any INNODB file/DB use 16 K
- MySQL/MariaDB checksums could be disable, zfs already do the same
- MySQL/MariaDB log/intendlog can be more then 16k
- zfs do mot have directio => disable directio for MySQL/MariaDB

NVME devices can not do so many IOPs using a single thred. Most of the SSDs use internall 16k, so I would also try to use at least for a test ashift=13.
 
  • Like
Reactions: freeman1doma
Then check with a 16K test if your volume is e.g. for MySQL/MariaDB. Numbers should be much better.
True
~# fio --filename=/zfsr1nvme/test-fio.bin --sync=1 --direct=1 --name=random-write --ioengine=posixaio --rw=randwrite --bs=16k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=posixaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=54.5MiB/s][w=3485 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=56926: Mon Jun 1 13:19:38 2020
write: IOPS=7400, BW=116MiB/s (121MB/s)(6938MiB/60001msec); 0 zone resets
slat (nsec): min=661, max=192186, avg=1498.57, stdev=533.72
clat (usec): min=89, max=47128, avg=133.18, stdev=264.47
lat (usec): min=90, max=47131, avg=134.68, stdev=264.53
clat percentiles (usec):
| 1.00th=[ 97], 5.00th=[ 100], 10.00th=[ 101], 20.00th=[ 103],
| 30.00th=[ 106], 40.00th=[ 109], 50.00th=[ 112], 60.00th=[ 116],
| 70.00th=[ 122], 80.00th=[ 128], 90.00th=[ 137], 95.00th=[ 145],
| 99.00th=[ 652], 99.50th=[ 1450], 99.90th=[ 2180], 99.95th=[ 2540],
| 99.99th=[ 4948]
bw ( KiB/s): min=13189, max=149568, per=99.94%, avg=118332.88, stdev=39487.27, samples=119
iops : min= 824, max= 9348, avg=7395.73, stdev=2467.99, samples=119
lat (usec) : 100=6.63%, 250=91.23%, 500=0.77%, 750=0.56%, 1000=0.19%
lat (msec) : 2=0.45%, 4=0.16%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.49%, sys=1.82%, ctx=444070, majf=0, minf=46
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,444025,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=116MiB/s (121MB/s), 116MiB/s-116MiB/s (121MB/s-121MB/s), io=6938MiB (7275MB), run=60001-60001msec

I can also recommend splitting your VM disks between normal operation disk (e.g. 4K volblocksize) and MySQL/MariaDB one with 16K volblocksize. Testing the throughput with mismatched blocksize will give you the tremendous write/read amplitication.
Done, thanks. As I see align must be in all level of system (ege block devices, vdev, zvol, vm's etc)

Hi,

Also in additon what @LnxBil said, take in account that muste be done some specific fonfiguration in case of MySQL/MariaDB, like this:
- any INNODB file/DB use 16 K
- MySQL/MariaDB checksums could be disable, zfs already do the same
- MySQL/MariaDB log/intendlog can be more then 16k
- zfs do mot have directio => disable directio for MySQL/MariaDB

NVME devices can not do so many IOPs using a single thred. Most of the SSDs use internall 16k, so I would also try to use at least for a test ashift=13.
Thank you for these important comments, and yes - it will be VM's with MariaDB (InnoDB engine) .
All last tests was with ashift=13 on nvme pool.
Code:
~# zpool get all | grep ashift
rpool      ashift                         12                             local
zfsr1nvme  ashift                         13                             local
But on system boot zfs rpool ashit=12 (mirrored two oldest Samsung sm843t 240gb) and pveperf fsyncs far from ideal
Code:
# pveperf
CPU BOGOMIPS:      319387.52
REGEX/SECOND:      3132034
HD SIZE:           192.77 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     356.60
DNS EXT:           28.77 ms
DNS INT:           0.53 ms (local)

FSYNCS/SECOND: 356.60 Сan this numbers somehow affect the performance of the whole system? Or need some tune too?
 
Hi again,

Each pool has his own settings. I have some PMX nodes with HDD boot disk, and a additional zfs pool for date on SSD. I can say that I do not see any performance problem using a HDD for the system, so I can imagine that a SSD pool is much better. In your case I will try to optimise only the DATA part (zfsr1nvme).

Good luck/ Bafta
 
  • Like
Reactions: freeman1doma

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!