Proxmox VE ZFS Benchmark with NVMe

mfreund

New Member
Jan 29, 2021
9
0
1
49
Hi,
here are my tests from a similar setup. My results don't differ, if i change the LBA size to 4k.
Any ideas?


Many thanks!

Michael

Code:
root@pve01:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     19322XXXXXXX         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300B20
/dev/nvme1n1     20302XXXXXXX         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme2n1     20302XXXXXXX         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0
/dev/nvme3n1     20292XXXXXXX         Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB    512   B +  0 B   11300DN0

zpool status
  pool: tank0
state: ONLINE
  scan: none requested
config:

        NAME                                             STATE     READ WRITE CKSUM
        tank0                                            ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            nvme-Micron_9300_MTFDHAL3T2TDR_ ONLINE       0     0     0
            nvme-Micron_9300_MTFDHAL3T2TDR_ ONLINE       0     0     0

errors: No known data errors

root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank0/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=397MiB/s][w=102k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=19744: Mon Feb  1 11:41:29 2021
  write: IOPS=110k, BW=429MiB/s (450MB/s)(251GiB/600002msec); 0 zone resets
    clat (usec): min=43, max=105385, avg=290.69, stdev=635.72
     lat (usec): min=43, max=105385, avg=290.85, stdev=635.72
    clat percentiles (usec):
     |  1.00th=[  188],  5.00th=[  204], 10.00th=[  215], 20.00th=[  237],
     | 30.00th=[  260], 40.00th=[  277], 50.00th=[  289], 60.00th=[  297],
     | 70.00th=[  310], 80.00th=[  322], 90.00th=[  343], 95.00th=[  359],
     | 99.00th=[  404], 99.50th=[  424], 99.90th=[  635], 99.95th=[ 1745],
     | 99.99th=[10814]
   bw (  KiB/s): min=10592, max=14720, per=3.12%, avg=13728.26, stdev=794.06, samples=38369
   iops        : min= 2648, max= 3680, avg=3432.05, stdev=198.51, samples=38369
  lat (usec)   : 50=0.01%, 100=0.01%, 250=25.98%, 500=73.86%, 750=0.07%
  lat (usec)   : 1000=0.02%
  lat (msec)   : 2=0.03%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.44%, sys=33.13%, ctx=238608555, majf=0, minf=381
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,65901303,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=429MiB/s (450MB/s), 429MiB/s-429MiB/s (450MB/s-450MB/s), io=251GiB (270GB), run=600002-600002msec



fio --ioengine=psync --filename=/dev/zvol/tank0/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1722MiB/s][w=430 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=14565: Mon Feb  1 11:24:15 2021
  write: IOPS=477, BW=1912MiB/s (2004MB/s)(1120GiB/600043msec); 0 zone resets
    clat (msec): min=4, max=267, avg=66.74, stdev=15.17
     lat (msec): min=4, max=268, avg=66.95, stdev=15.23
    clat percentiles (msec):
     |  1.00th=[   56],  5.00th=[   58], 10.00th=[   58], 20.00th=[   59],
     | 30.00th=[   61], 40.00th=[   62], 50.00th=[   62], 60.00th=[   63],
     | 70.00th=[   64], 80.00th=[   73], 90.00th=[   81], 95.00th=[   91],
     | 99.00th=[  148], 99.50th=[  165], 99.90th=[  190], 99.95th=[  203],
     | 99.99th=[  220]
   bw (  KiB/s): min=24526, max=73728, per=3.12%, avg=61161.46, stdev=10022.42, samples=38400
   iops        : min=    5, max=   18, avg=14.88, stdev= 2.46, samples=38400
  lat (msec)   : 10=0.01%, 20=0.01%, 50=0.12%, 100=97.29%, 250=2.59%
  lat (msec)   : 500=0.01%
  cpu          : usr=0.33%, sys=4.21%, ctx=2100526, majf=0, minf=345
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,286754,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1912MiB/s (2004MB/s), 1912MiB/s-1912MiB/s (2004MB/s-2004MB/s), io=1120GiB (1203GB), run=600043-600043msec
 
Last edited:

guletz

Famous Member
Apr 19, 2017
1,430
214
83
Brasov, Romania
Any ideas?

Hi,

It could be very likely that yours NVMe disks to use internal only 4k(LBA) even if you create a 512 Format on it(firmware will agregate many 512 in a 4 K block). Even more, if you read your NVMe data-sheet, you will see the manufacurer also test with 4 K (Performances secv. write with 4KB)


Good luck / Bafta!
 
Last edited:

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,945
250
83
here are my tests from a similar setup. My results don't differ, if i change the LBA size to 4k.
Any ideas?
Are you?

The nvme list output shows the namespaces on all SSDs to be 512b (column: Format). You will have to destroy the namespace (thus losing all data on it) and recreate them with 4k. Micron offers the msecli tool.
 

mfreund

New Member
Jan 29, 2021
9
0
1
49
Are you?

The nvme list output shows the namespaces on all SSDs to be 512b (column: Format). You will have to destroy the namespace (thus losing all data on it) and recreate them with 4k. Micron offers the msecli tool.
I did it, just formated them back to 512b to show my 2nd result is close to your 4k bandwith benchmark.
Will format them back and test again and do also test them with more namespaces.
I did it with storage executive on centos, because win10 ist not supported for creating namespaces.
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,945
250
83
I did it, just formated them back to 512b to show my 2nd result is close to your 4k bandwith benchmark.
Will format them back and test again and do also test them with more namespaces.
I did it with storage executive on centos, because win10 ist not supported for creating namespaces.
Okay, so those results were with 512b? May I ask what kind of hardware you are running on (besides the Micron NVMEs)? Because the IOPS in the first (bs=4k) test are quite a bit higher (110k) than in our benchmarks.
 

mfreund

New Member
Jan 29, 2021
9
0
1
49
Okay, so those results were with 512b? May I ask what kind of hardware you are running on (besides the Micron NVMEs)? Because the IOPS in the first (bs=4k) test are quite a bit higher (110k) than in our benchmarks.
Yes, those results are done with 512b. Please see my HW setup below.
Looks like ZFS don't benefit from more namespaces :(

Code:
root@pve01:~# nvme list
Node             SN       Model                                    Namespace Usage                      Format           FW Rev
---------------- -------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1              Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300B20
/dev/nvme1n1              Micron_9300_MTFDHAL3T2TDR                1           3.20  TB /   3.20  TB      4 KiB +  0 B   11300DN0
/dev/nvme2n1              Micron_9300_MTFDHAL3T2TDR                1         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme2n2              Micron_9300_MTFDHAL3T2TDR                2         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme2n3              Micron_9300_MTFDHAL3T2TDR                3         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme2n4              Micron_9300_MTFDHAL3T2TDR                4         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme3n1              Micron_9300_MTFDHAL3T2TDR                1         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme3n2              Micron_9300_MTFDHAL3T2TDR                2         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme3n3              Micron_9300_MTFDHAL3T2TDR                3         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0
/dev/nvme3n4              Micron_9300_MTFDHAL3T2TDR                4         798.86  GB / 798.86  GB      4 KiB +  0 B   11300DN0

root@pve01:~# zpool status
  pool: tank0
state: ONLINE
  scan: none requested
config:

        NAME                                             STATE     READ WRITE CKSUM
        tank0                                            ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            nvme-Micron_9300_MTFDHAL3T2TDR_  ONLINE       0     0     0
            nvme-Micron_9300_MTFDHAL3T2TDR_  ONLINE       0     0     0

errors: No known data errors

  pool: tank1
state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        tank1        ONLINE       0     0     0
          nvme2n1    ONLINE       0     0     0
          nvme2n2    ONLINE       0     0     0
          nvme2n3    ONLINE       0     0     0
          nvme2n4    ONLINE       0     0     0
          mirror-4   ONLINE       0     0     0
            nvme3n1  ONLINE       0     0     0
            nvme3n2  ONLINE       0     0     0
            nvme3n3  ONLINE       0     0     0
            nvme3n4  ONLINE       0     0     0

errors: No known data errors

ASRock Rack X470D4U2-2T
Ryzen 9 3900X
128GB ECC RAM
NVME drives are connected to this controller: https://www.delock.de/produkt/90405/merkmale.html
BTW I'm still looking for a good 4x controller, will test the ASRock Ultra Quad m.2 Card soon.

Code:
Results tank0 4k:

root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank0/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=422MiB/s][w=108k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=10407: Mon Feb  1 15:00:21 2021
  write: IOPS=118k, BW=460MiB/s (483MB/s)(270GiB/600003msec); 0 zone resets
    clat (usec): min=40, max=109970, avg=270.82, stdev=653.90
     lat (usec): min=40, max=109997, avg=270.99, stdev=653.91
    clat percentiles (usec):
     |  1.00th=[  180],  5.00th=[  194], 10.00th=[  204], 20.00th=[  223],
     | 30.00th=[  239], 40.00th=[  251], 50.00th=[  265], 60.00th=[  273],
     | 70.00th=[  285], 80.00th=[  297], 90.00th=[  314], 95.00th=[  334],
     | 99.00th=[  379], 99.50th=[  404], 99.90th=[  644], 99.95th=[ 4686],
     | 99.99th=[11338]
   bw (  KiB/s): min=11288, max=15728, per=3.12%, avg=14731.43, stdev=883.93, samples=38376
   iops        : min= 2822, max= 3932, avg=3682.85, stdev=220.98, samples=38376
  lat (usec)   : 50=0.01%, 100=0.01%, 250=38.84%, 500=61.01%, 750=0.05%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.04%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.49%, sys=35.21%, ctx=262738577, majf=0, minf=354
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,70716156,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=460MiB/s (483MB/s), 460MiB/s-460MiB/s (483MB/s-483MB/s), io=270GiB (290GB), run=600003-600003msec
root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank0/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4m --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1612MiB/s][w=403 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=19256: Mon Feb  1 15:11:19 2021
  write: IOPS=553, BW=2215MiB/s (2322MB/s)(1298GiB/600021msec); 0 zone resets
    clat (msec): min=3, max=447, avg=57.02, stdev=24.92
     lat (msec): min=3, max=449, avg=57.79, stdev=24.91
    clat percentiles (msec):
     |  1.00th=[   22],  5.00th=[   41], 10.00th=[   44], 20.00th=[   47],
     | 30.00th=[   50], 40.00th=[   52], 50.00th=[   54], 60.00th=[   55],
     | 70.00th=[   57], 80.00th=[   61], 90.00th=[   70], 95.00th=[   85],
     | 99.00th=[  211], 99.50th=[  239], 99.90th=[  296], 99.95th=[  326],
     | 99.99th=[  372]
   bw (  KiB/s): min=16286, max=106496, per=3.12%, avg=70852.21, stdev=15524.24, samples=38398
   iops        : min=    3, max=   26, avg=17.24, stdev= 3.81, samples=38398
  lat (msec)   : 4=0.01%, 10=0.26%, 20=0.61%, 50=32.28%, 100=64.40%
  lat (msec)   : 250=2.08%, 500=0.38%
  cpu          : usr=1.28%, sys=8.08%, ctx=3514721, majf=0, minf=320
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,332191,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2215MiB/s (2322MB/s), 2215MiB/s-2215MiB/s (2322MB/s-2322MB/s), io=1298GiB (1393GB), run=600021-600021msec

Code:
Results tank1 4k:

root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank1/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=362MiB/s][w=92.6k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=2985: Mon Feb  1 15:24:17 2021
  write: IOPS=98.8k, BW=386MiB/s (405MB/s)(226GiB/600002msec); 0 zone resets
    clat (usec): min=43, max=88584, avg=323.19, stdev=578.48
     lat (usec): min=43, max=88584, avg=323.35, stdev=578.48
    clat percentiles (usec):
     |  1.00th=[  178],  5.00th=[  206], 10.00th=[  239], 20.00th=[  260],
     | 30.00th=[  269], 40.00th=[  281], 50.00th=[  293], 60.00th=[  322],
     | 70.00th=[  355], 80.00th=[  388], 90.00th=[  416], 95.00th=[  437],
     | 99.00th=[  486], 99.50th=[  510], 99.90th=[  832], 99.95th=[ 5669],
     | 99.99th=[11600]
   bw (  KiB/s): min= 9944, max=13208, per=3.12%, avg=12351.10, stdev=589.23, samples=38374
   iops        : min= 2486, max= 3302, avg=3087.76, stdev=147.31, samples=38374
  lat (usec)   : 50=0.01%, 100=0.01%, 250=14.16%, 500=85.19%, 750=0.55%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.05%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.40%, sys=28.85%, ctx=217869579, majf=0, minf=385
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,59289460,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=386MiB/s (405MB/s), 386MiB/s-386MiB/s (405MB/s-405MB/s), io=226GiB (243GB), run=600002-600002msec



root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank1/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4M --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=2891MiB/s][w=722 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=23427: Mon Feb  1 15:34:31 2021
  write: IOPS=595, BW=2382MiB/s (2497MB/s)(1396GiB/600027msec); 0 zone resets
    clat (msec): min=2, max=601, avg=52.92, stdev=38.09
     lat (msec): min=2, max=601, avg=53.73, stdev=38.07
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   38], 10.00th=[   41], 20.00th=[   43],
     | 30.00th=[   44], 40.00th=[   45], 50.00th=[   46], 60.00th=[   47],
     | 70.00th=[   50], 80.00th=[   53], 90.00th=[   56], 95.00th=[   81],
     | 99.00th=[  268], 99.50th=[  321], 99.90th=[  422], 99.95th=[  506],
     | 99.99th=[  558]
   bw (  KiB/s): min= 8143, max=139264, per=3.13%, avg=76255.55, stdev=21664.45, samples=38369
   iops        : min=    1, max=   34, avg=18.55, stdev= 5.32, samples=38369
  lat (msec)   : 4=0.09%, 10=0.70%, 20=0.72%, 50=73.03%, 100=22.17%
  lat (msec)   : 250=2.04%, 500=1.21%, 750=0.06%
  cpu          : usr=1.40%, sys=12.70%, ctx=3106040, majf=0, minf=349
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,357261,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2382MiB/s (2497MB/s), 2382MiB/s-2382MiB/s (2497MB/s-2497MB/s), io=1396GiB (1498GB), run=600027-600027msec
 
Last edited:

Alwin Antreich

Active Member
Jan 15, 2021
199
27
28
37
antreich.com
ASRock Rack X470D4U2-2T

BTW I'm still looking for a good 4x controller, will test the ASRock Ultra Quad m.2 Card soon.
For PCIe? The board switches the x16 slot to 8x if the other 8x slot is occupied.

Looks like ZFS don't benefit from more namespaces :(
You mixed a stripe and a mirror vdev in one pool. That probably slows things down a bit. Try a NVMe with all namespaces as a stripe, compared to a NVMe with only one namespace. But that may just help on ZFS's side, as more workers are available.
 

mfreund

New Member
Jan 29, 2021
9
0
1
49
For PCIe? The board switches the x16 slot to 8x if the other 8x slot is occupied.
Yes, i want to use the other slot for a fast NIC.

You mixed a stripe and a mirror vdev in one pool. That probably slows things down a bit. Try a NVMe with all namespaces as a stripe, compared to a NVMe with only one namespace. But that may just help on ZFS's side, as more workers are available.

I striped the namespaces and mirrored the drives. I use these striped VDEVs also on Truenas. ZFS loves it!
 

mfreund

New Member
Jan 29, 2021
9
0
1
49
testing this now:
Code:
  pool: tank1
 state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        tank1        ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            nvme2n1  ONLINE       0     0     0
            nvme2n2  ONLINE       0     0     0
          mirror-1   ONLINE       0     0     0
            nvme2n3  ONLINE       0     0     0
            nvme2n4  ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            nvme3n1  ONLINE       0     0     0
            nvme3n2  ONLINE       0     0     0
          mirror-3   ONLINE       0     0     0
            nvme3n3  ONLINE       0     0     0
            nvme3n4  ONLINE       0     0     0
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,945
250
83
Hmm, wouldn't you want to have your mirrors between the two disks and not with two namespaces on the same disk?
Code:
mirror-0
  nvme2n1
  nvme3n1
mirror-1
  nvme2n2
  nvme3n2
....
 

mfreund

New Member
Jan 29, 2021
9
0
1
49
Hmm, wouldn't you want to have your mirrors between the two disks and not with two namespaces on the same disk?
Code:
mirror-0
  nvme2n1
  nvme3n1
mirror-1
  nvme2n2
  nvme3n2
....
Yes, thats what i already did in #26, this one is just for fun ;-)
 

mfreund

New Member
Jan 29, 2021
9
0
1
49
Yes, thats what i already did in #26, this one is just for fun ;-)
Results worse like expected, more IOPS in bandwidth just because of more involved VDEVs..

Code:
root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank1/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=4k --numjobs=32
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=334MiB/s][w=85.6k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=27414: Mon Feb  1 16:38:27 2021
  write: IOPS=88.8k, BW=347MiB/s (364MB/s)(203GiB/600002msec); 0 zone resets
    clat (usec): min=71, max=87095, avg=359.58, stdev=566.29
     lat (usec): min=71, max=87096, avg=359.73, stdev=566.30
    clat percentiles (usec):
     |  1.00th=[  215],  5.00th=[  239], 10.00th=[  251], 20.00th=[  273],
     | 30.00th=[  322], 40.00th=[  351], 50.00th=[  367], 60.00th=[  379],
     | 70.00th=[  392], 80.00th=[  404], 90.00th=[  424], 95.00th=[  441],
     | 99.00th=[  486], 99.50th=[  502], 99.90th=[  873], 99.95th=[ 5735],
     | 99.99th=[11600]
   bw (  KiB/s): min= 9000, max=13104, per=3.12%, avg=11103.90, stdev=475.25, samples=38396
   iops        : min= 2250, max= 3276, avg=2775.96, stdev=118.81, samples=38396
  lat (usec)   : 100=0.01%, 250=10.01%, 500=89.43%, 750=0.45%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.05%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.35%, sys=25.93%, ctx=192905578, majf=0, minf=442
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,53303802,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=347MiB/s (364MB/s), 347MiB/s-347MiB/s (364MB/s-364MB/s), io=203GiB (218GB), run=600002-600002msec
root@pve01:~# fio --ioengine=psync --filename=/dev/zvol/tank1/speedtest --size=9G --time_based --name=fio --group_reporting --runtime=200 --direct=1 --sync=1 --iodepth=1 --rw=write --bs=1M --numjobs=32
fio: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
...
fio-3.12
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=1859MiB/s][w=1859 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=24626: Mon Feb  1 16:41:58 2021
  write: IOPS=2133, BW=2133MiB/s (2237MB/s)(417GiB/200008msec); 0 zone resets
    clat (usec): min=870, max=258802, avg=14851.59, stdev=8797.89
     lat (usec): min=889, max=258870, avg=14998.60, stdev=8797.02
    clat percentiles (msec):
     |  1.00th=[   12],  5.00th=[   13], 10.00th=[   13], 20.00th=[   14],
     | 30.00th=[   14], 40.00th=[   14], 50.00th=[   15], 60.00th=[   15],
     | 70.00th=[   15], 80.00th=[   16], 90.00th=[   16], 95.00th=[   17],
     | 99.00th=[   29], 99.50th=[   53], 99.90th=[  165], 99.95th=[  176],
     | 99.99th=[  215]
   bw (  KiB/s): min=32768, max=81920, per=3.12%, avg=68247.89, stdev=9742.38, samples=12800
   iops        : min=   32, max=   80, avg=66.61, stdev= 9.52, samples=12800
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.02%, 10=0.05%, 20=98.64%, 50=0.74%
  lat (msec)   : 100=0.23%, 250=0.31%, 500=0.01%
  cpu          : usr=1.00%, sys=7.33%, ctx=4810796, majf=0, minf=350
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,426624,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2133MiB/s (2237MB/s), 2133MiB/s-2133MiB/s (2237MB/s-2237MB/s), io=417GiB (447GB), run=200008-200008msec
 

starlight

New Member
Feb 12, 2021
12
0
1
38
Hello,

hmm.. i have problems with proxmox , nvme and zfs... the random read/write is just horrible...

i testet some kingston dc1000 and wdblack... comparing to ext4 its just horrible.

here are the results:

PROXMOX ZFS
Code:
Sequential Write QD=8
Jobs: 1 (f=1): [W(1)][100.0%][w=948MiB/s][w=947 IOPS][eta 00m:00s]
Sequential Read QD=8
Jobs: 1 (f=1): [R(1)][100.0%][r=2164MiB/s][r=2164 IOPS][eta 00m:00s]
Random Read QD=192
Jobs: 6 (f=6): [r(6)][100.0%][r=83.3MiB/s][r=21.3k IOPS][eta 00m:00s]
Random Write QD-192
Per Device QD-32
Jobs: 6 (f=6): [w(6)][100.0%][w=21.2MiB/s][w=5435 IOPS][eta 00m:00s]

Single Outstanding IO for latency

Random Write QD-1
Jobs: 2 (f=2): [w(2)][100.0%][w=20.8MiB/s][w=5320 IOPS][eta 00m:00s]
Random Read QD=1
Jobs: 2 (f=2): [r(2)][100.0%][r=153MiB/s][r=39.1k IOPS][eta 00m:00s]

VMWARE (INSIDE linux VM)
Code:
Sequential Write QD=8
Jobs: 1 (f=1): [W(1)][100.0%][w=1351MiB/s][w=1351 IOPS][eta 00m:00s]
Sequential Read QD=8
Jobs: 1 (f=1): [R(1)][100.0%][r=1908MiB/s][r=1908 IOPS][eta 00m:00s]
Random Read QD=192
Jobs: 6 (f=6): [r(6)][100.0%][r=861MiB/s][r=220k IOPS][eta 00m:00s]
Random Write QD-192
Per Device QD-32
Jobs: 6 (f=6): [w(6)][100.0%][w=841MiB/s][w=215k IOPS][eta 00m:00s]

Single Outstanding IO for latency

Random Write QD-1
Jobs: 2 (f=2): [w(2)][100.0%][w=74.6MiB/s][w=19.1k IOPS][eta 00m:00s]
Random Read QD=1
Jobs: 2 (f=2): [r(2)][100.0%][r=52.0MiB/s][r=13.3k IOPS][eta 00m:00s]
 
Last edited:

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,945
250
83
hmm.. i have problems with proxmox , nvme and zfs... the random read/write is just horrible...
Can you please also post the commands used to run the benchmarks and not just the results. Also the exact models of disks used as well as the test system? CPU type, motherboard.

Since this could be a bit longer, would you please consider opening a new thread? :)
 
Last edited:
  • Like
Reactions: guletz

guletz

Famous Member
Apr 19, 2017
1,430
214
83
Brasov, Romania
hmm.. i have problems with proxmox , nvme and zfs... the random read/write is just horrible...

I think you have use this:

https://github.com/garyjlittle/scripts/blob/master/measure_device.sh

If is true, then take in consideration this:

- consider removing --direct=1 (which does nothing for plain file in ZFS)
- using --numjobs rather than --iodepth (which only works when
using linux aio, which requires *real* O_DIRECT, which ZFS does not
currently provides)
- in case of measure_device.sh, this script start to prefill the test files with 4k blocks :

@ zfs will agregate multiple 4K bloks in a 128k(default for zfs dataset)
@ and after that the script it start the final fio test(re-writing the pre-filled file) the performance test this will happens: it will
take for each new 4k write block to do RMW(read 128 k ... modify in RAM one 4 k block and write 128 K)

So you compare apples with oranges. Try to set recodsize 16-32K for your dataset test(also good for nvme) and not 128k(huge write amplification)

Good luck / Bafta!
 
Last edited:

starlight

New Member
Feb 12, 2021
12
0
1
38
hardware on this old server is 2x xeon 2680v4, 64gb ram, nothing on there just proxmox/vmware and a testing windows vm.

1. yes u are right i was/am using this script and i already modify it that it fits to zfs, but there is no "huge" impact on random read/write
2. i know that this script prefill with 4k blocks, but i think this is more a real life situation using existing data (for me)
3. i already change the recordsize, but there is no "huge" impact on random read/write. even worse the with recordsize of 16k the cpu goes up to 50%

i just started all this fio tests because the "real world" performance in VM is so awful. not sure if this is a problem with ZFS , because the VM performance is also awful with ext4. I only get 50% of random read/write in VM (windows) on proxmox with ext4 compared to exsi. so not sure if this is a driver fault or whatever ? the VM performance with ZFS is slighty better than the fio tests, but it is only 1/3 of vmware

if you have a suggestion how to test 4k random read/write with fio on proxmox cli, feel free to answer.

i will test later with some only one job, maybe that is the fault
 
Last edited:

starlight

New Member
Feb 12, 2021
12
0
1
38
Benchmarks:


Code:
DIRECT TO MOUNTPOINT ((( ZFS recordsize=default   CPU Load: read ~12% , write ~9% )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
read: IOPS=46.3k, BW=181MiB/s (190MB/s)(10.6GiB/60002msec)


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60
write: IOPS=9654, BW=37.7MiB/s (39.5MB/s)(2263MiB/60002msec); 0 zone resets


DIRECT TO MOUNTPOINT ((( ZFS recordsize=16K   CPU Load: read ~14% , write ~33% )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60
read: IOPS=243k, BW=948MiB/s (995MB/s)(55.6GiB/60001msec)
  

root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
write: IOPS=49.0k, BW=195MiB/s (205MB/s)(11.4GiB/60002msec); 0 zone resets


DIRECT TO MOUNTPOINT ((( ZFS recordsize=32K CPU Load: read ~12% , write ~20% )))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
read: IOPS=178k, BW=695MiB/s (729MB/s)(40.7GiB/60002msec)

root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
write: IOPS=34.6k, BW=135MiB/s (142MB/s)(8108MiB/60001msec); 0 zone resets



DIRECT TO MOUNTPOINT ((( ZFS recordsize=64K CPU Load: read ~12% , write ~12%)))

root@pve01:~# fio --filename=/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
read: IOPS=98.6k, BW=385MiB/s (404MB/s)(22.6GiB/60001msec)


root@pve01:~# fio --filename=/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
write: IOPS=18.7k, BW=72.0MiB/s (76.5MB/s)(4380MiB/60001msec); 0 zone resets




DIRECT TO MOUNTPOINT ((( EXT4  CPU Load: read ~2% , write ~9% )))

root@pve01:~# mkfs.ext4 /dev/nvme0n1p1
root@pve01:~# mount /dev/nvme0n1p1 /mnt/nvme01

root@pve01:~# fio --filename=/mnt/nvme01/test1 --rw=randread --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test1
read: IOPS=285k, BW=1113MiB/s (1167MB/s)(60.0GiB/55198msec)

root@pve01:~# fio --filename=/mnt/nvme01/test2 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=1 --numjobs=6 --size=10G --runtime=60 --group_reporting --name test2
write: IOPS=277k, BW=1084MiB/s (1136MB/s)(60.0GiB/56697msec); 0 zone resets


((( CPU Load: read/write ~2% )))

root@pve01:~# fio --filename=/mnt/nvme01/test1 --direct=1 --rw=randread --ioengine=libaio --bs=4k --iodepth=32 --numjobs=1 --size=10G --runtime=60 --group_reporting --name test1
read: IOPS=227k, BW=885MiB/s (928MB/s)(10.0GiB/11572msec)

root@pve01:~# fio --filename=/mnt/nvme01/test2 --direct=1 --rw=randwrite --ioengine=libaio --bs=4k --iodepth=32 --numjobs=1 --size=10G --runtime=60 --group_reporting --name test2
 write: IOPS=53.5k, BW=209MiB/s (219MB/s)(10.0GiB/49005msec); 0 zone resets
 
Last edited:

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
1,945
250
83
@starlight The DC1000 (which model exactly do you have?) have good read speed at around 3000MB/s but their write speed isn't that great at about 550MB/s.

Also be aware that the 4k tests that you did will give you an idea of the IOPS limit but not the bandwidth. For that, try to run benchmarks with a block size of 1M or 4M so that the benchmarks will be limited by bandwidth.

The IOPS do look okay, especially for reads (45k to 50k IOPS).
 
  • Like
Reactions: guletz

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!