Proxmox VE ZFS Benchmark with NVMe

Hi again,


It will be very intersting for me, if you can do only these tests:


1. (zfs 16 k)fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2

2. (ext4) fio --filename=/mnt/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2

During this tests:

- enable writes info with : echo 1 > /proc/sys/vm/block_dump
- then you can see how many blocks are affected, like this:

Code:
Feb 16 16:23:47 pv2 kernel: [16430.235270] z_wr_int(1576): WRITE block 1694519248 on sdc1 (24 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235490] z_wr_iss(1556): WRITE block 1577084816 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235500] z_wr_iss(1566): WRITE block 1510030224 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235579] z_wr_iss(1570): WRITE block 1560817504 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235588] z_wr_int(1574): WRITE block 1510030232 on sdc1 (24 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235589] z_wr_int(1576): WRITE block 1560817512 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235592] z_wr_iss(1568): WRITE block 1526807432 on sdc1 (32 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235679] z_wr_int(1577): WRITE block 1526807464 on sdc1 (16 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235718] z_wr_int(1577): WRITE block 1560817496 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235741] z_wr_int(1574): WRITE block 1560817520 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235759] z_wr_int(1578): WRITE block 1577084808 on sdc1 (8 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235783] z_wr_int(1574): WRITE block 1577084824 on sdc1 (24 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235810] z_wr_int(1577): WRITE block 1593862080 on sdc1 (40 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235850] z_wr_int(1579): WRITE block 1680651808 on sdc1 (32 sectors)
Feb 16 16:23:47 pv2 kernel: [16430.235924] z_wr_int(1577): WRITE block 1694519248 on sdc1 (136 sectors)

It would be intersting to see 1 sample log after each test(1 and 2). And see if are some WTITE blocks IO in the log after fio will finish the test!




Thx a lot!

I guess you have zfs set atime=off on your zfs test dataset


Good luck / Bafta!
 
Last edited:
  • Like
Reactions: Alwin Antreich
@aaron i am not using the DC1000B for testing. It is advertised as an boot disk for servers. I had one sample and just used it to make sure that it is not the consumer drive which is bad. The read/write test is even lower with the DC1000. For testing I am using a cheap SN750. I just want to test zfs with a nvme drive.


Here is the full benchmark:

Code:
DIRECT TO MOUNTPOINT ((( ZFS recordsize=default )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=48.5k, BW=189MiB/s (199MB/s)(11.1GiB/60002msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=5169, BW=5169MiB/s (5420MB/s)(60.0GiB/11886msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=1164, BW=4658MiB/s (4884MB/s)(60.0GiB/13191msec)


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
test2: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
  write: IOPS=9624, BW=37.6MiB/s (39.4MB/s)(2256MiB/60001msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=1212, BW=1212MiB/s (1271MB/s)(60.0GiB/50691msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=302, BW=1209MiB/s (1268MB/s)(60.0GiB/50809msec); 0 zone resets



DIRECT TO MOUNTPOINT ((( ZFS recordsize=16K )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=248k, BW=967MiB/s (1014MB/s)(56.7GiB/60002msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=3829, BW=3830MiB/s (4016MB/s)(60.0GiB/16042msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=785, BW=3143MiB/s (3296MB/s)(60.0GiB/19549msec)


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=49.9k, BW=195MiB/s (204MB/s)(11.4GiB/60001msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=1074, BW=1074MiB/s (1126MB/s)(60.0GiB/57194msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=270, BW=1084MiB/s (1136MB/s)(60.0GiB/56694msec); 0 zone resets



DIRECT TO MOUNTPOINT ((( ZFS recordsize=32K  )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=153k, BW=598MiB/s (627MB/s)(35.0GiB/60002msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=4131, BW=4132MiB/s (4332MB/s)(60.0GiB/14871msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=837, BW=3352MiB/s (3515MB/s)(60.0GiB/18330msec)


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=34.4k, BW=134MiB/s (141MB/s)(8057MiB/60001msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=1156, BW=1157MiB/s (1213MB/s)(60.0GiB/53120msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=261, BW=1044MiB/s (1095MB/s)(60.0GiB/58837msec); 0 zone resets



DIRECT TO MOUNTPOINT ((( ZFS recordsize=64K )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=99.8k, BW=390MiB/s (409MB/s)(22.8GiB/60001msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=4646, BW=4646MiB/s (4872MB/s)(60.0GiB/13223msec)


root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=1129, BW=4519MiB/s (4738MB/s)(60.0GiB/13597msec)


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=16.5k, BW=64.4MiB/s (67.5MB/s)(3862MiB/60002msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=1002, BW=1003MiB/s (1052MB/s)(58.8GiB/60004msec); 0 zone resets


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=248, BW=995MiB/s (1043MB/s)(58.3GiB/60018msec); 0 zone resets



DIRECT TO MOUNTPOINT ((( EXT4  )))


root@pve01:~# fio --filename=/mnt/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=286k, BW=1118MiB/s (1172MB/s)(60.0GiB/54949msec)


root@pve01:~# fio --filename=/mnt/nvme01/test1 --rw=randread --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=7421, BW=7421MiB/s (7782MB/s)(60.0GiB/8279msec)


root@pve01:~# fio --filename=/mnt/nvme01/test1 --rw=randread --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
  read: IOPS=1746, BW=6985MiB/s (7324MB/s)(60.0GiB/8796msec)


root@pve01:~# fio --filename=/mnt/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=260k, BW=1014MiB/s (1063MB/s)(59.4GiB/60001msec); 0 zone resets


root@pve01:~# fio --filename=/mnt/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=1M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2


root@pve01:~# fio --filename=/mnt/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4M --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
  write: IOPS=408, BW=1634MiB/s (1714MB/s)(60.0GiB/37598msec); 0 zone resets
 
Last edited:
@guletz ext4 and zfs have write blocks after fio is finished.


Code:
ZFS


Feb 16 20:43:38 pve01 kernel: [84848.002858] z_wr_int(9835): WRITE block 33602760 on nvme0n1p1 (48 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.002911] z_wr_iss(49339): WRITE block 50340888 on nvme0n1p1 (32 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.002923] z_wr_iss(49325): WRITE block 58729496 on nvme0n1p1 (32 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.002933] z_wr_iss(49322): WRITE block 41991176 on nvme0n1p1 (240 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.002935] z_wr_iss(49332): WRITE block 318880728 on nvme0n1p1 (72 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.002944] z_wr_iss(49346): WRITE block 327165224 on nvme0n1p1 (168 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.002950] z_wr_int(9831): WRITE block 10569656 on nvme0n1p1 (72 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003329] z_wr_iss(49331): WRITE block 16799792 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003346] z_wr_int(9873): WRITE block 25188392 on nvme0n1p1 (16 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003351] z_wr_int(49349): WRITE block 10569728 on nvme0n1p1 (112 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003486] z_wr_int(9856): WRITE block 50353944 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003489] z_wr_int(9854): WRITE block 33602808 on nvme0n1p1 (120 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003502] z_wr_iss(49331): WRITE block 41991416 on nvme0n1p1 (120 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003549] z_wr_int(9853): WRITE block 58742552 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003564] z_wr_int(9863): WRITE block 318880224 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003628] z_wr_int(9849): WRITE block 327163912 on nvme0n1p1 (16 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003631] z_wr_int(9872): WRITE block 318880800 on nvme0n1p1 (112 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003725] z_wr_int(9873): WRITE block 10569152 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.003761] z_wr_int(9835): WRITE block 327165400 on nvme0n1p1 (256 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004799] z_wr_iss(49338): WRITE block 33602888 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004811] z_wr_iss(49338): WRITE block 41991496 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004834] z_wr_int(9863): WRITE block 327163976 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004860] z_wr_int(9884): WRITE block 327165576 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004878] z_wr_int(9856): WRITE block 327165656 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004926] z_wr_int(9849): WRITE block 10569840 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.004951] z_wr_int(9894): WRITE block 16799520 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005023] z_wr_iss(49330): WRITE block 25188128 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005068] z_wr_iss(49315): WRITE block 33602320 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005071] z_wr_int(9847): WRITE block 25188136 on nvme0n1p1 (32 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005088] z_wr_int(9869): WRITE block 16799528 on nvme0n1p1 (32 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005146] z_wr_int(9871): WRITE block 16799560 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005174] z_wr_int(9875): WRITE block 16799776 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005196] z_wr_int(49353): WRITE block 25188168 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005218] z_wr_int(9882): WRITE block 25188384 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005227] z_wr_int(9854): WRITE block 33602360 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005259] z_wr_int(49354): WRITE block 33602424 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005264] z_wr_int(9880): WRITE block 33602552 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005293] z_wr_int(9855): WRITE block 33602624 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005300] z_wr_int(9845): WRITE block 33602896 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005327] z_wr_int(9888): WRITE block 41990928 on nvme0n1p1 (8 sectors)
Feb 16 20:43:38 pve01 kernel: [84848.005338] z_wr_int(9846): WRITE block 41990968 on nvme0n1p1 (8 sectors)




EXT4


Feb 16 20:28:59 pve01 kernel: [83968.237384] kworker/u114:2(54634): WRITE block 3785256 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237389] kworker/u114:2(54634): WRITE block 3785280 on nvme0n1p1 (24 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237392] kworker/u114:2(54634): WRITE block 3785312 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237394] kworker/u114:2(54634): WRITE block 3785408 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237397] kworker/u114:2(54634): WRITE block 3785456 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237400] kworker/u114:2(54634): WRITE block 3785832 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237403] kworker/u114:2(54634): WRITE block 3785848 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237405] kworker/u114:2(54634): WRITE block 3785896 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237408] kworker/u114:2(54634): WRITE block 3786688 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237411] kworker/u114:2(54634): WRITE block 3787032 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237414] kworker/u114:2(54634): WRITE block 3787072 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237434] kworker/u114:2(54634): WRITE block 3787104 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237436] kworker/u114:2(54634): WRITE block 3787128 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237439] kworker/u114:2(54634): WRITE block 3787328 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237444] kworker/u114:2(54634): WRITE block 3787352 on nvme0n1p1 (32 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237447] kworker/u114:2(54634): WRITE block 3787432 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237449] kworker/u114:2(54634): WRITE block 3787456 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237452] kworker/u114:2(54634): WRITE block 3787488 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237456] kworker/u114:2(54634): WRITE block 3787544 on nvme0n1p1 (32 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237462] kworker/u114:2(54634): WRITE block 3787592 on nvme0n1p1 (40 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237473] kworker/u114:2(54634): WRITE block 3787712 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237477] kworker/u114:2(54634): WRITE block 3787824 on nvme0n1p1 (40 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237481] kworker/u114:2(54634): WRITE block 3787872 on nvme0n1p1 (24 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237486] kworker/u114:2(54634): WRITE block 3787904 on nvme0n1p1 (24 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237491] kworker/u114:2(54634): WRITE block 3787984 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237496] kworker/u114:2(54634): WRITE block 3788024 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237500] kworker/u114:2(54634): WRITE block 3788104 on nvme0n1p1 (24 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237511] kworker/u114:2(54634): WRITE block 3788144 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237513] kworker/u114:2(54634): WRITE block 3788176 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237516] kworker/u114:2(54634): WRITE block 3788208 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237519] kworker/u114:2(54634): WRITE block 3788320 on nvme0n1p1 (24 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237525] kworker/u114:2(54634): WRITE block 3788360 on nvme0n1p1 (24 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237528] kworker/u114:2(54634): WRITE block 3788464 on nvme0n1p1 (16 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237533] kworker/u114:2(54634): WRITE block 3788488 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237537] kworker/u114:2(54634): WRITE block 3788984 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237540] kworker/u114:2(54634): WRITE block 3789000 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237544] kworker/u114:2(54634): WRITE block 3789032 on nvme0n1p1 (8 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237549] kworker/u114:2(54634): WRITE block 3789048 on nvme0n1p1 (40 sectors)
Feb 16 20:28:59 pve01 kernel: [83968.237555] kworker/u114:2(54634): WRITE block 3789304 on nvme0n1p1 (16 sectors)
 
I am using a cheap SN750
Okay, well, the read speeds (bandwidth) with 1M or 4M block sizes do look good to me. Too good to be honest with about 5000MB/s. Looks like the ZFS cache might be involved. You can try to set the primarycache for the volume to only metadata with
Code:
zfs set primarycache=metadata POOL/DATASET

The writes with the larger block sizes also look okay for a cheaper SSD with a bit over 1000MB/s.

Again just to be clear, the 4k tests will run into the IOPS limit and the 1M/4M tests will run into the bandwidth limits. The reality will be somewhere in between, depending on the use case, overall load, caching etc etc.
 
First, thanks for all the help.

@aaron sure, this is cached. i dont think that one pcie gen3 x4 nvme can ever reach over 5000mb/s.

for sure the speeds are OK, but this does not clarify why ext4 is so much faster than zfs in random read/write. for now, nvme and zfs seems to be a waste of money. you could reach the same (and better) speeds with cheaper 2,5inch ssds in a raidz. maybe i miss something... ?

@guletz

yes . for longer than 5 secs ;)

root@pve01:~# nvme id-ns /dev/nvme0n1 -n 1 -H | tail -n 2
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)
 
Last edited:
I checked speeds in TRUENAS newest version which should use openzfs 2.0 , results are the same.
 
Since the Micron 9300 that we use in the benchmark paper support different block sizes that can be configured for the namespaces, we did some testing to see how they affect performance.

We tested the setup as in the benchmark paper:

IOPS tests​

1 Mirror pool on the NVMEs with the default 512b block"size and a zvol with default 8k volblocksize:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=181MiB/s][w=46.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=95469: Fri Jan 15 12:54:59 2021
  write: IOPS=66.2k, BW=258MiB/s (271MB/s)(151GiB/600002msec); 0 zone resets
    clat (usec): min=59, max=143209, avg=482.78, stdev=1112.45
     lat (usec): min=59, max=143210, avg=482.95, stdev=1112.45
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  245], 10.00th=[  265], 20.00th=[  293],
     | 30.00th=[  322], 40.00th=[  359], 50.00th=[  404], 60.00th=[  445],
     | 70.00th=[  494], 80.00th=[  578], 90.00th=[  742], 95.00th=[  996],
     | 99.00th=[ 1401], 99.50th=[ 1614], 99.90th=[ 5145], 99.95th=[ 7701],
     | 99.99th=[12649]
   bw (  KiB/s): min= 4856, max=11424, per=3.13%, avg=8270.95, stdev=1912.80, samples=38369
   iops        : min= 1214, max= 2856, avg=2067.72, stdev=478.20, samples=38369
  lat (usec)   : 100=0.03%, 250=5.97%, 500=64.97%, 750=19.27%, 1000=4.77%
  lat (msec)   : 2=4.76%, 4=0.11%, 10=0.09%, 20=0.01%, 100=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.45%, sys=24.12%, ctx=281344506, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,39696863,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=258MiB/s (271MB/s), 258MiB/s-258MiB/s (271MB/s-271MB/s), io=151GiB (163GB), run=600002-600002msec

The result of 46k IOPS is in the ballpark of the result of the benchmark paper. So far no surprise.

Recreating the test on the same kind of NVMEs but with the block size set to 4k:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=297MiB/s][w=75.9k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=121126: Fri Jan 15 13:20:53 2021
  write: IOPS=85.9k, BW=335MiB/s (352MB/s)(197GiB/600002msec); 0 zone resets
    clat (usec): min=58, max=144238, avg=371.76, stdev=1097.88
     lat (usec): min=58, max=144238, avg=371.95, stdev=1097.88
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  225], 10.00th=[  239], 20.00th=[  258],
     | 30.00th=[  273], 40.00th=[  289], 50.00th=[  302], 60.00th=[  322],
     | 70.00th=[  347], 80.00th=[  392], 90.00th=[  498], 95.00th=[  676],
     | 99.00th=[ 1287], 99.50th=[ 1532], 99.90th=[ 5932], 99.95th=[ 8029],
     | 99.99th=[12387]
   bw (  KiB/s): min= 7456, max=11680, per=3.12%, avg=10730.64, stdev=866.24, samples=38374
   iops        : min= 1864, max= 2920, avg=2682.64, stdev=216.56, samples=38374
  lat (usec)   : 100=0.09%, 250=15.87%, 500=74.23%, 750=5.73%, 1000=1.90%
  lat (msec)   : 2=1.98%, 4=0.06%, 10=0.11%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%
  cpu          : usr=0.56%, sys=37.53%, ctx=339985859, majf=0, minf=404
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,51513479,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=335MiB/s (352MB/s), 335MiB/s-335MiB/s (352MB/s-352MB/s), io=197GiB (211GB), run=600002-600002msec

As you can see, using the larger 4k block size for the NVME namespace, we get ~76k IOPS which is close to double the IOPS performance.

Bandwidth tests:​

NVME 512b blocksize pool:
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1810MiB/s][w=452 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=81902: Fri Jan 15 12:44:31 2021
  write: IOPS=431, BW=1727MiB/s (1811MB/s)(1012GiB/600016msec); 0 zone resets
    clat (msec): min=2, max=403, avg=73.90, stdev=22.84
     lat (msec): min=3, max=403, avg=74.10, stdev=22.86
    clat percentiles (msec):
     |  1.00th=[   43],  5.00th=[   49], 10.00th=[   55], 20.00th=[   59],
     | 30.00th=[   64], 40.00th=[   68], 50.00th=[   72], 60.00th=[   77],
     | 70.00th=[   82], 80.00th=[   86], 90.00th=[   89], 95.00th=[   93],
     | 99.00th=[  197], 99.50th=[  222], 99.90th=[  271], 99.95th=[  284],
     | 99.99th=[  313]
   bw (  KiB/s): min= 8192, max=98304, per=3.12%, avg=55256.56, stdev=12647.86, samples=38399
   iops        : min=    2, max=   24, avg=13.43, stdev= 3.10, samples=38399
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.06%, 50=5.71%, 100=90.53%
  lat (msec)   : 250=3.46%, 500=0.21%
  cpu          : usr=0.27%, sys=4.86%, ctx=3678933, majf=0, minf=359
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,259085,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1727MiB/s (1811MB/s), 1727MiB/s-1727MiB/s (1811MB/s-1811MB/s), io=1012GiB (1087GB), run=600016-600016msec

We get about 1700MB/s Bandwidth.

NVME 4k blocksize pool
Code:
# fio --ioengine=psync --filename=/dev/zvol/tank4k/test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1280MiB/s][w=320 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=201124: Fri Jan 15 12:29:19 2021
  write: IOPS=454, BW=1818MiB/s (1907MB/s)(1066GiB/600049msec); 0 zone resets
    clat (msec): min=3, max=411, avg=70.14, stdev=25.82
     lat (msec): min=3, max=411, avg=70.39, stdev=25.83
    clat percentiles (msec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   54], 20.00th=[   58],
     | 30.00th=[   61], 40.00th=[   64], 50.00th=[   67], 60.00th=[   70],
     | 70.00th=[   75], 80.00th=[   79], 90.00th=[   83], 95.00th=[   89],
     | 99.00th=[  230], 99.50th=[  271], 99.90th=[  338], 99.95th=[  359],
     | 99.99th=[  388]
   bw (  KiB/s): min=16384, max=98304, per=3.12%, avg=58180.81, stdev=12074.67, samples=38400
   iops        : min=    4, max=   24, avg=14.17, stdev= 2.95, samples=38400
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.03%, 50=3.77%, 100=94.07%
  lat (msec)   : 250=1.37%, 500=0.75%
  cpu          : usr=0.34%, sys=6.25%, ctx=3648814, majf=0, minf=346
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,272779,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1818MiB/s (1907MB/s), 1818MiB/s-1818MiB/s (1907MB/s-1907MB/s), io=1066GiB (1144GB), run=600049-600049msec

With the 4k blocksize namespaces there is not a significant higher bandwidth (1800MB/s).

The output of nvme list to show the NVMEs configured with different block sizes:

Code:
# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     194525xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme1n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme3n1     195025xxxxxx         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0

TL;DR:​

Changing the block size of the NVME namespace can improve the performance. Tested with 512b and 4k NVME block sizes and a zfs mirror with a zvol (8k volblocksize).

512b NVME block size: ~46k IOPS, ~1700MB/s bandwidth
4k NVME block size: ~75k IOPS, ~1800MB/s bandwidth
Hi Aaron, how can you block size of the NVME namespace ?
 
Hi Aaron, how can you block size of the NVME namespace ?
Depends on the NVME in use. For the Micron 9300 that we have it is possible with the msecli tool provided by Micron.
 
Depends on the NVME in use. For the Micron 9300 that we have it is possible with the msecli tool provided by Micron.
in case of micron
Code:
 msecli -N -f 1 -m 0 -g 4096 -n /dev/nvme1

msecli requires registration on their website to get the download link
 
Hello!

I've been testing out 2 new PM9A3 U.2 (SAMSUNG MZQL23T8HCLS-00A07) drives and I'm getting real slow zfs performance. Any help would be appreciated!
The methodology should be as close to the benchmark paper as I could reproduce.

Hardware:
CPUSingle AMD EPYC 7252
MainboardSupermicro H11DSi Rev.2
ControllerAOC-SLG3-2E4R-O PCI to 2x U.2
Memory8x 16GB DDR4 Samsung M393A2K43CB2-CTD
SystemDiskSamsung SSD 860 PRO 256GB
Disks2x 3.84TB Samsung PM9A3 U.2 (SAMSUNG MZQL23T8HCLS-00A07)

Software version: (July 2021)
pve-manager/6.4-13/9f411e79 (running kernel: 5.4.124-1-pve)
zfs-2.0.4-pve1
zfs-kmod-2.0.4-pve1

The storage controller is not officially supported for this mainboard. Due to lack of JNVi2c connector, this was not connected. The disks showed up in bios after setting the corresponding PCIe 8x slot to 4x4x Bifurcation.

Fio to blockdevice:
fio --ioengine=libaio --filename=/dev/nvme0n1 --direct=1 --sync=1 --rw=write --bs=4K--numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio
bs: min value out of range: 0 (1 min)
bs: min value out of range: 0 (1 min)
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=209MiB/s][w=53.5k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=9462: Mon Jul 12 18:23:33 2021
write: IOPS=53.3k, BW=208MiB/s (219MB/s)(12.2GiB/60001msec); 0 zone resets
slat (nsec): min=1490, max=44679, avg=1638.51, stdev=138.12
clat (nsec): min=6930, max=68989, avg=16710.90, stdev=809.97
lat (nsec): min=17869, max=79909, avg=18388.80, stdev=833.28
clat percentiles (nsec):
| 1.00th=[16512], 5.00th=[16512], 10.00th=[16512], 20.00th=[16512],
| 30.00th=[16512], 40.00th=[16512], 50.00th=[16512], 60.00th=[16512],
| 70.00th=[16512], 80.00th=[16768], 90.00th=[16768], 95.00th=[17024],
| 99.00th=[20608], 99.50th=[21120], 99.90th=[25984], 99.95th=[26496],
| 99.99th=[28288]
bw ( KiB/s): min=212399, max=214520, per=100.00%, avg=213380.78, stdev=469.43, samples=119
iops : min=53099, max=53630, avg=53345.18, stdev=117.38, samples=119
lat (usec) : 10=0.01%, 20=98.02%, 50=1.98%, 100=0.01%
cpu : usr=6.92%, sys=12.52%, ctx=3200913, majf=7, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,3200903,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=208MiB/s (219MB/s), 208MiB/s-208MiB/s (219MB/s-219MB/s), io=12.2GiB (13.1GB), run=60001-60001msec

Disk stats (read/write):
nvme0n1: ios=0/3195342, merge=0/0, ticks=0/52710, in_queue=0, util=99.89%
Second discs looks about the same and seems to be in line with the table on page 2/14 of the ZFS-Benchmark-202011.


The zpool is set up with the config:

options zfs zfs_arc_max=4294967296
primarycache=metadata
compression=off

fio --ioengine=psync --filename=/test6/fio --size=9G --time_based --name=fio --group_rep
orting --runtime=60 --direct=1 --sync=1 --iodepth=1 --rw=write --threads --bs=4k --numjobs=1
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=41.5MiB/s][w=10.6k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=30672: Mon Jul 19 18:12:57 2021
write: IOPS=12.7k, BW=49.6MiB/s (52.0MB/s)(2978MiB/60001msec); 0 zone resets
clat (usec): min=45, max=39908, avg=78.04, stdev=174.95
lat (usec): min=46, max=39908, avg=78.15, stdev=174.95
clat percentiles (usec):
| 1.00th=[ 47], 5.00th=[ 48], 10.00th=[ 49], 20.00th=[ 50],
| 30.00th=[ 51], 40.00th=[ 51], 50.00th=[ 53], 60.00th=[ 61],
| 70.00th=[ 63], 80.00th=[ 65], 90.00th=[ 69], 95.00th=[ 75],
| 99.00th=[ 652], 99.50th=[ 685], 99.90th=[ 840], 99.95th=[ 906],
| 99.99th=[ 1057]
bw ( KiB/s): min=25600, max=59968, per=99.86%, avg=50751.12, stdev=7985.72, samples=119
iops : min= 6400, max=14992, avg=12687.74, stdev=1996.44, samples=119
lat (usec) : 50=26.19%, 100=69.41%, 250=0.05%, 500=1.22%, 750=2.93%
lat (usec) : 1000=0.18%
lat (msec) : 2=0.02%, 4=0.01%, 20=0.01%, 50=0.01%
cpu : usr=1.32%, sys=19.86%, ctx=1572346, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,762332,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=49.6MiB/s (52.0MB/s), 49.6MiB/s-49.6MiB/s (52.0MB/s-52.0MB/s), io=2978MiB (3123MB), run=60001
-60001msec

fio --ioengine=psync --filename=/test6/fio --size=9G --time_based --name=fio --group_rep
orting --runtime=60 --direct=1 --sync=1 --iodepth=1 --rw=write --threads --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=445MiB/s][w=114k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=5319: Mon Jul 19 18:14:34 2021
write: IOPS=112k, BW=439MiB/s (460MB/s)(25.7GiB/60006msec); 0 zone resets
clat (usec): min=56, max=44804, avg=284.06, stdev=369.64
lat (usec): min=56, max=44804, avg=284.29, stdev=369.65
clat percentiles (usec):
| 1.00th=[ 155], 5.00th=[ 184], 10.00th=[ 198], 20.00th=[ 219],
| 30.00th=[ 233], 40.00th=[ 247], 50.00th=[ 262], 60.00th=[ 277],
| 70.00th=[ 297], 80.00th=[ 322], 90.00th=[ 375], 95.00th=[ 424],
| 99.00th=[ 529], 99.50th=[ 594], 99.90th=[ 1369], 99.95th=[ 6718],
| 99.99th=[11338]
bw ( KiB/s): min=11728, max=14912, per=3.12%, avg=14033.72, stdev=549.08, samples=3831
iops : min= 2932, max= 3728, avg=3508.41, stdev=137.27, samples=3831
lat (usec) : 100=0.01%, 250=42.02%, 500=56.51%, 750=1.19%, 1000=0.14%
lat (msec) : 2=0.04%, 4=0.01%, 10=0.07%, 20=0.01%, 50=0.01%
cpu : usr=0.62%, sys=26.57%, ctx=15606366, majf=0, minf=424
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,6736889,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=439MiB/s (460MB/s), 439MiB/s-439MiB/s (460MB/s-460MB/s), io=25.7GiB (27.6GB), run=60006-60006
msec
I also ran the 10minute benchmarks, tested everything on a zvol and switched between 4k and 512 on the nvme drives but no noticeable changes in the results.

I would be really happy of someone spots an error in my process! :)
 
Well, there are a few things that I noticed.
First, that controller is a PCI gen 3 while the SSDs are PCI gen 4. That could potentially cost you some performance, depending on if the SSDs themselves perform well enough to make use of PCI gen 4 speeds.

The benchmarks that you showed use 4k as block size. When you benchmark storage, the block size pretty much determines in which limit you will run. Small block sizes will be limited by the maximum IOPS possible, while large block sizes (e.g. 1M or 4M) will be limited by the maximum bandwidth possible.

With those 4k benchmarks, you get around 13k IOPS. Which is not too bad for a numjob of 1.
write: IOPS=12.7k, BW=49.6MiB/s (52.0MB/s)(2978MiB/60001msec); 0 zone resets

Once you run the benchmarks with numjob 32, you get 112k IOPS which seems to be good. In the benchmark paper we only got ~42k IOPS when doing write tests in a mirror vdev pool.
write: IOPS=112k, BW=439MiB/s (460MB/s)(25.7GiB/60006msec); 0 zone resets


So in order to check which bandwidth you can get out of it, test it with a larger block size (1M or 4M). How many jobs and the iodepth also have an effect on the benchmark results.
Please also keep in mind that files and zvols do have different performance characteristics and VM disks are stored in zvol datasets.

If you do run benchmarks for IOPS (bs=4k) and bandwidth (bs=4M) with direct, sync and an iodepth and numjob of 1, you do get an idea of the lower bounds of what the storage can accomplish regarding IOPS and bandwidth. In reality, it will usually be quite a bit better with multiple processes (VMs) accessing the storage and by far not all operations will be direct and sync but cached.
 
  • Like
Reactions: vgee
Thanks for clarifying about the interplay between blocksize and bandwith/iops!

I reran the tests with blocksize=1M/4M
fio --ioengine=psync --filename=/test7/fio --size=9G --time_based --name=fio --group_reporting --runtime=60 --direct=1 --sync=1 --iodepth=1 --rw=write --threads --bs=4M --numjobs=1
...
write: IOPS=12.5k, BW=48.0MiB/s (51.4MB/s)(2939MiB/60001msec); 0 zone resets
fio --ioengine=psync --filename=/test7/fio --size=9G --time_based --name=fio --group_reporting --runtime=60 --direct=1 --sync=1 --iodepth=1 --rw=write --threads --bs=4M --numjobs=32
...
write: IOPS=113k, BW=441MiB/s (462MB/s)(25.8GiB/60002msec); 0 zone resets

The results are eerily close. If I just checked correctly pcie 3.0 4x has a bandwith of 3.94 GBps which is about 492.5 MB/s. So your remark about about pcie3 being the bottleneck was completely on point!

On a more embarassing note. I realized that I read most MB/s in the benchmark paper as MegaBYTE/s instead of MegaBIT/s. Which also explains why I expected higher results...

~ Thanks again
 
  • Like
Reactions: aaron
Uhm no PCIE 3.0 has 1Gbyte per lane so 4GB transferspeed on NVME (well 31.somtheing gbit/sec realworld). PCIE 3.0 is not limiting you in this benchmark but overall, 4gybte bandwidth vs theoretical real limit of 6.5 gbyte/sec according to samsung.
but only on sequential reads which wont be relevant outside of benches.

however 4k benchmarks will tank big time anyway.
also make shure ashift is 12.

to make different tests make some subvolumes with different blocksizes. make 8k 16k and 32k. and run fio on each accordingly.
i have a hunch that 16k will be the sweetspot.


also make shure scheduler for those blockdevices is set to none.
 
Good morning, I'm using proxmox from about 1 year installed on an Intel NUC5i5RYH with 16gb of ram and 1 sata ssd crucial 240 gb size, the wearout is 33% but the disk was buyed used with the nuc, I already have a Lenovo P340 Tiny that can fit 2 m2 nvme 2280, it has an i5 10th gen cpu and 32gb of ddr4 ram, I want this kind of pc because my homelab is intended to be production but in field such as domotic things like Home Assistant, Nginx, Mqtt broker, Bitwarden, Pihole and some other things I may add for example Plex server. Some other users told me not to worry about commercial nvme life if used with proxmox, anyway I'm searching info to buy a good m2 ssd, the only format option for the Lenovo P340, I ended up to select a Kingston Dc1000 480gb for the boot disk, would this last some years, let's say more than 5 years? I'm completely ignorant about zfs so maybe is better not to use that filesystem, I can add a 1 terabyte m2 ssd in the second slot using it only for vm and ct? For backup I will use for the moment an external usb disk but I would like to buy also a Nas. What are your tips for my situation given that I want to use my Lenovo P340 tiny as proxmox server, adding the right m2 disks? The other thing is I don't want to spend more than 160 euros on a 480gb disk, 240 euros max if I find a 1 tb one, unfortunately prices are risen up such for a Micron 7300 pro 960gb I find it only above 300 euros, too much for my budget for a single disk, so I tought about Kingston data center series. Sorry for the long post.
 
Last edited:
Hi all, I wonder if I could hijack with related SSD performance benchmarking - are my results within expectations? I have 2 identenical PVE 7.0-11, the only differnce being the HDD / SSD arrangement. The SSD's are enterprise SATA3 Intel S4520, the HDDs are 7.2K SAS. Full post here: https://forum.proxmox.com/threads/p...1-4-x-ssd-similar-to-raid-z10-12-x-hdd.99967/

Prep:
Code:
zfs create rpool/fio
zfs set primarycache=none rpool/fio

Code:
fio --ioengine=libaio --filename=/rpool/fio/testx --size=4G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw  --bs=4K --numjobs=64

SSD results:
Code:
FIO output:
read: IOPS=4022, BW=15.7MiB/s (16.5MB/s)
write: IOPS=4042, BW=15.8MiB/s (16.6MB/s)


# zpool iostat -vy rpool 5 1
                                                        capacity     operations     bandwidth
pool                                                  alloc   free   read  write   read  write
----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpool                                                  216G  27.7T  28.1K  14.5K  1.17G   706M
  raidz1                                               195G  13.8T  13.9K  7.26K   595M   358M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAV3P8EGN-part3      -      -  3.60K  1.73K   159M  90.3M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9Q3P8EGN-part3      -      -  3.65K  1.82K   150M  89.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730B9G3P8EGN-part3      -      -  3.35K  1.83K   147M  90.0M
    ata-INTEL_SSDSC2KB038TZ_BTYI13730BAT3P8EGN-part3      -      -  3.34K  1.89K   139M  88.4M
  raidz1                                              21.3G  13.9T  14.2K  7.21K   604M   348M
    sde                                                   -      -  3.39K  1.81K   149M  87.5M
    sdf                                                   -      -  3.35K  1.90K   139M  86.3M
    sdg                                                   -      -  3.71K  1.70K   163M  87.8M
    sdh                                                   -      -  3.69K  1.81K   152M  86.4M
----------------------------------------------------  -----  -----  -----  -----  -----  -----

HDD results:
Code:
FIO output:
read: IOPS=1382, BW=5531KiB/s
write: IOPS=1385, BW=5542KiB/s

$ zpool iostat -vy rpool 5 1
                                    capacity     operations     bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                              160G  18.0T  3.07K  2.71K   393M   228M
  mirror                          32.2G  3.59T    624    589  78.0M  40.2M
    scsi-35000c500de5c67f7-part3      -      -    321    295  40.1M  20.4M
    scsi-35000c500de75a863-part3      -      -    303    293  37.9M  19.7M
  mirror                          31.9G  3.59T    625    551  78.2M  49.9M
    scsi-35000c500de2bd6bb-part3      -      -    313    274  39.1M  24.2M
    scsi-35000c500de5ae5a7-part3      -      -    312    277  39.0M  25.7M
  mirror                          32.2G  3.59T    648    548  81.1M  45.9M
    scsi-35000c500de5ae667-part3      -      -    320    279  40.1M  23.0M
    scsi-35000c500de2bd2d3-part3      -      -    328    268  41.0M  23.0M
  mirror                          31.6G  3.59T    612    536  76.5M  45.5M
    scsi-35000c500de5ef20f-part3      -      -    301    266  37.7M  22.7M
    scsi-35000c500de5edbfb-part3      -      -    310    269  38.9M  22.8M
  mirror                          32.0G  3.59T    629    555  78.7M  46.5M
    scsi-35000c500de5c6f7f-part3      -      -    318    283  39.8M  23.1M
    scsi-35000c500de5c6c5f-part3      -      -    311    272  38.9M  23.4M
--------------------------------  -----  -----  -----  -----  -----  -----

I'd have thought the SSDs shuuld be about 10x more IOPS than the above - are my expectations out of whack? Any insights appreciated! Thanks!
 
To optimize performance in hyper-converged deployments with Proxmox VE and ZFS storage, the appropriate hardware setup is essential. This benchmark presents a possible setup and its resulting performance, with the intention of supporting Proxmox users in making better decisions.

Download PDF
https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020
__________________
Best regards,

Martin Maurer
Proxmox VE project leader
Seems like ZFS RAID10 performance is very poor...
I wonder the results of linux mdadm RAID10 F2 with this setup, any disadvantages?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!