confusion about storage setup - onboard/raid - zfs benchmarking

tekknus

Member
Sep 2, 2021
6
0
6
54
Hi, I have bought a used host for my windows vms (build teamcity, web and mssql) and debian containers (web, postgres) and can't figure out which storage setup fits best

HP DL380 Gen9 2xE5-2690 8x32GB RAM B140i onboard P440ar dedicated (HBA mode) 8xS3710 400GB

I switched P440 to hba mode and started installation of pve7.0 where i selected zfs raid-z2 over all 8 disks (with default settings) as installation target and ran fio

Code:
root@pve:~# fio --name=karl --size=10g --rw=randrw --direct=1
karl: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [m(1)][99.5%][r=76.6MiB/s,w=75.4MiB/s][r=19.6k,w=19.3k IOPS][eta 00m:01s]
karl: (groupid=0, jobs=1): err= 0: pid=35860: Wed Sep  1 16:59:47 2021
  read: IOPS=6452, BW=25.2MiB/s (26.4MB/s)(5123MiB/203254msec)
    clat (nsec): min=1890, max=45839k, avg=72009.83, stdev=139896.87
     lat (nsec): min=1921, max=45839k, avg=72098.07, stdev=139905.26
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    6], 10.00th=[    8], 20.00th=[   49],
     | 30.00th=[   53], 40.00th=[   56], 50.00th=[   59], 60.00th=[   62],
     | 70.00th=[   67], 80.00th=[   72], 90.00th=[   84], 95.00th=[   96],
     | 99.00th=[  570], 99.50th=[  676], 99.90th=[ 1074], 99.95th=[ 2704],
     | 99.99th=[ 4228]
   bw (  KiB/s): min= 4696, max=92064, per=99.87%, avg=25778.15, stdev=9775.75, samples=406
   iops        : min= 1174, max=23016, avg=6444.54, stdev=2443.94, samples=406
  write: IOPS=6444, BW=25.2MiB/s (26.4MB/s)(5117MiB/203254msec); 0 zone resets
    clat (usec): min=4, max=23221, avg=80.32, stdev=132.46
     lat (usec): min=4, max=23222, avg=80.46, stdev=132.48
    clat percentiles (usec):
     |  1.00th=[    8],  5.00th=[   11], 10.00th=[   19], 20.00th=[   56],
     | 30.00th=[   60], 40.00th=[   63], 50.00th=[   67], 60.00th=[   71],
     | 70.00th=[   75], 80.00th=[   82], 90.00th=[   96], 95.00th=[  112],
     | 99.00th=[  586], 99.50th=[  693], 99.90th=[ 1057], 99.95th=[ 2540],
     | 99.99th=[ 4228]
   bw (  KiB/s): min= 4432, max=91448, per=99.88%, avg=25747.87, stdev=9710.50, samples=406
   iops        : min= 1108, max=22862, avg=6436.96, stdev=2427.62, samples=406
  lat (usec)   : 2=0.01%, 4=0.86%, 10=7.01%, 20=2.81%, 50=7.38%
  lat (usec)   : 100=75.59%, 250=3.49%, 500=1.28%, 750=1.45%, 1000=0.03%
  lat (msec)   : 2=0.05%, 4=0.04%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=3.18%, sys=78.78%, ctx=74902, majf=1, minf=1650
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1311533,1309907,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=25.2MiB/s (26.4MB/s), 25.2MiB/s-25.2MiB/s (26.4MB/s-26.4MB/s), io=5123MiB (5372MB), run=203254-203254msec
  WRITE: bw=25.2MiB/s (26.4MB/s), 25.2MiB/s-25.2MiB/s (26.4MB/s-26.4MB/s), io=5117MiB (5365MB), run=203254-203254msec
Code:
root@pve:~# fio --name=karl --size=10g --rw=rw --direct=1
karl: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [M(1)][95.5%][r=219MiB/s,w=218MiB/s][r=55.0k,w=55.7k IOPS][eta 00m:01s]
karl: (groupid=0, jobs=1): err= 0: pid=4448: Wed Sep  1 16:54:24 2021
  read: IOPS=60.5k, BW=236MiB/s (248MB/s)(5123MiB/21692msec)
    clat (nsec): min=1920, max=4884.1k, avg=7882.67, stdev=38242.54
     lat (nsec): min=1937, max=4884.1k, avg=7921.05, stdev=38243.40
    clat percentiles (nsec):
     |  1.00th=[  1976],  5.00th=[  2040], 10.00th=[  2128], 20.00th=[  2224],
     | 30.00th=[  2384], 40.00th=[  2672], 50.00th=[  2736], 60.00th=[  2864],
     | 70.00th=[  2992], 80.00th=[  3376], 90.00th=[  5472], 95.00th=[  8640],
     | 99.00th=[128512], 99.50th=[156672], 99.90th=[659456], 99.95th=[733184],
     | 99.99th=[823296]
   bw (  KiB/s): min=127144, max=323584, per=99.79%, avg=241336.74, stdev=55834.50, samples=43
   iops        : min=31786, max=80896, avg=60334.23, stdev=13958.68, samples=43
  write: IOPS=60.4k, BW=236MiB/s (247MB/s)(5117MiB/21692msec); 0 zone resets
    clat (usec): min=3, max=4159, avg= 7.80, stdev=26.11
     lat (usec): min=3, max=4159, avg= 7.87, stdev=26.11
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    7], 80.00th=[    8], 90.00th=[   11], 95.00th=[   14],
     | 99.00th=[   26], 99.50th=[   37], 99.90th=[  461], 99.95th=[  494],
     | 99.99th=[ 1074]
   bw (  KiB/s): min=124168, max=323752, per=99.80%, avg=241062.70, stdev=55996.99, samples=43
   iops        : min=31042, max=80938, avg=60265.67, stdev=13999.21, samples=43
  lat (usec)   : 2=1.43%, 4=43.26%, 10=47.03%, 20=5.52%, 50=1.16%
  lat (usec)   : 100=0.27%, 250=1.11%, 500=0.11%, 750=0.09%, 1000=0.02%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=11.85%, sys=77.64%, ctx=5815, majf=7, minf=38
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1311533,1309907,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=236MiB/s (248MB/s), 236MiB/s-236MiB/s (248MB/s-248MB/s), io=5123MiB (5372MB), run=21692-21692msec
  WRITE: bw=236MiB/s (247MB/s), 236MiB/s-236MiB/s (247MB/s-247MB/s), io=5117MiB (5365MB), run=21692-21692msec

for me this looked slowish, so i repositioned the two backplane cables from P440 to the onboard B140 and ran the same tests again

Code:
root@pve:~# fio --name=karl --size=10g --rw=randrw --direct=1
karl: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [m(1)][98.7%][r=63.3MiB/s,w=62.1MiB/s][r=16.2k,w=15.9k IOPS][eta 00m:02s]
karl: (groupid=0, jobs=1): err= 0: pid=752864: Wed Sep  1 16:29:41 2021
  read: IOPS=8537, BW=33.3MiB/s (34.0MB/s)(5123MiB/153627msec)
    clat (usec): min=2, max=13960, avg=53.12, stdev=65.60
     lat (usec): min=2, max=13960, avg=53.20, stdev=65.61
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    6], 10.00th=[    7], 20.00th=[   41],
     | 30.00th=[   44], 40.00th=[   47], 50.00th=[   49], 60.00th=[   51],
     | 70.00th=[   53], 80.00th=[   57], 90.00th=[   68], 95.00th=[   80],
     | 99.00th=[  478], 99.50th=[  635], 99.90th=[  709], 99.95th=[  717],
     | 99.99th=[  750]
   bw (  KiB/s): min=20008, max=87824, per=99.99%, avg=34144.14, stdev=7543.00, samples=307
   iops        : min= 5002, max=21956, avg=8536.03, stdev=1885.76, samples=307
  write: IOPS=8526, BW=33.3MiB/s (34.9MB/s)(5117MiB/153627msec); 0 zone resets
    clat (usec): min=4, max=12757, avg=61.69, stdev=67.38
     lat (usec): min=4, max=12757, avg=61.81, stdev=67.40
    clat percentiles (usec):
     |  1.00th=[    8],  5.00th=[   10], 10.00th=[   16], 20.00th=[   47],
     | 30.00th=[   51], 40.00th=[   54], 50.00th=[   56], 60.00th=[   59],
     | 70.00th=[   61], 80.00th=[   67], 90.00th=[   81], 95.00th=[   95],
     | 99.00th=[  490], 99.50th=[  660], 99.90th=[  734], 99.95th=[  750],
     | 99.99th=[  832]
   bw (  KiB/s): min=19352, max=87096, per=99.99%, avg=34102.32, stdev=7469.56, samples=307
   iops        : min= 4838, max=21774, avg=8525.58, stdev=1867.41, samples=307
  lat (usec)   : 4=0.81%, 10=7.54%, 20=2.98%, 50=32.74%, 100=52.81%
  lat (usec)   : 250=1.88%, 500=0.29%, 750=0.92%, 1000=0.02%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=3.58%, sys=86.10%, ctx=31738, majf=0, minf=202
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1311533,1309907,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=33.3MiB/s (34.0MB/s), 33.3MiB/s-33.3MiB/s (34.0MB/s-34.0MB/s), io=5123MiB (5372MB), run=153627-153627msec
  WRITE: bw=33.3MiB/s (34.9MB/s), 33.3MiB/s-33.3MiB/s (34.9MB/s-34.9MB/s), io=5117MiB (5365MB), run=153627-153627msec
Code:
root@pve:~# fio --name=karl --size=10g --rw=rw --direct=1
karl: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [M(1)][100.0%][r=411MiB/s,w=409MiB/s][r=105k,w=105k IOPS][eta 00m:00s]
karl: (groupid=0, jobs=1): err= 0: pid=1374249: Wed Sep  1 16:32:03 2021
  read: IOPS=103k, BW=401MiB/s (421MB/s)(5123MiB/12761msec)
    clat (nsec): min=1633, max=1053.3k, avg=3967.77, stdev=19182.71
     lat (nsec): min=1647, max=1053.3k, avg=3999.20, stdev=19182.93
    clat percentiles (nsec):
     |  1.00th=[  1880],  5.00th=[  1896], 10.00th=[  1912], 20.00th=[  1944],
     | 30.00th=[  1976], 40.00th=[  2128], 50.00th=[  2192], 60.00th=[  2224],
     | 70.00th=[  2288], 80.00th=[  2544], 90.00th=[  2928], 95.00th=[  3504],
     | 99.00th=[ 43264], 99.50th=[ 46848], 99.90th=[456704], 99.95th=[489472],
     | 99.99th=[675840]
   bw (  KiB/s): min=341748, max=474840, per=99.72%, avg=409944.48, stdev=44864.64, samples=25
   iops        : min=85437, max=118710, avg=102486.20, stdev=11216.28, samples=25
  write: IOPS=103k, BW=401MiB/s (420MB/s)(5117MiB/12761msec); 0 zone resets
    clat (usec): min=3, max=676, avg= 5.07, stdev= 4.45
     lat (usec): min=3, max=676, avg= 5.14, stdev= 4.45
    clat percentiles (nsec):
     |  1.00th=[ 4048],  5.00th=[ 4080], 10.00th=[ 4128], 20.00th=[ 4128],
     | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4384],
     | 70.00th=[ 4512], 80.00th=[ 5536], 90.00th=[ 6112], 95.00th=[ 8512],
     | 99.00th=[11968], 99.50th=[17792], 99.90th=[54016], 99.95th=[57088],
     | 99.99th=[68096]
   bw (  KiB/s): min=339193, max=474312, per=99.75%, avg=409571.56, stdev=44414.88, samples=25
   iops        : min=84798, max=118578, avg=102392.88, stdev=11103.75, samples=25
  lat (usec)   : 2=16.12%, 4=31.81%, 10=49.39%, 20=1.06%, 50=1.36%
  lat (usec)   : 100=0.19%, 250=0.01%, 500=0.05%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=16.33%, sys=78.06%, ctx=1629, majf=0, minf=17
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1311533,1309907,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=401MiB/s (421MB/s), 401MiB/s-401MiB/s (421MB/s-421MB/s), io=5123MiB (5372MB), run=12761-12761msec
  WRITE: bw=401MiB/s (420MB/s), 401MiB/s-401MiB/s (420MB/s-420MB/s), io=5117MiB (5365MB), run=12761-12761msec

..which returned better results

1) but why? maybe is my benchmark unsuitable?
2) should I use the dedicated controller or the onboard and why?
3) is raidz2 the correct choice for me?
4) should i touch the "advanced settings" at raidz2 creation?

thank you for your advice
 
Hi,

for VM storage you want to have as many IOPS as possible. With RaidZ2 you get approximately the IOPS of a single drive. If you switch to mirrored vdevs (4 vdevs, each 2 drives) you will get the IOPS of 4 disks - factor four more.

Btw: this topic has been discussed several times...

Best regards and good luck
 
Thank you for your reply udo!

I read about the pro/cons from raidz2 and mirroring and decided to go with raidz2 because of the ability to lose 2 random disks instead of 1 random disk at mirroring and better space efficiency -> but i will repeat my tests with 4 mirror vdevs to see if there is a difference too

my real question was:
is it okay to use the onboard controller or should i go with the dedicated raid controller in hba mode and why is there a performance difference?
why are people buying dedicated hbas? just because they support more disks?
 
I read about the pro/cons from raidz2 and mirroring and decided to go with raidz2 because of the ability to lose 2 random disks instead of 1 random disk at mirroring and better space efficiency -> but i will repeat my tests with 4 mirror vdevs to see if there is a difference too
Also keep in mind that raidz2 will give you alot of padding overhead if you don't increase your vollbocksize. With the default 8K volblocksize you will loose 25% of raw capacity to parity + 25% to padding. So you need to atlast increse it to 16K so you only loose 25% to parity + 8% to padding.
With a 8 disk striped mirror you might also want to increase the volblocksize to 16K for better performance if you don't got a workload that got alot of IO that is smaller than 16K.
my real question was:
is it okay to use the onboard controller or should i go with the dedicated raid controller in hba mode and why is there a performance difference?
why are people buying dedicated hbas? just because they support more disks?
That depends...
You onboard controller is probably part of your mainboards chipset so it is sharing the same few (4?) PCIe lanes with all USB ports, NICs and so on. So it is very likely that the link between chipset and CPU just can'T handle all that bandwidth.
And with your external raid controller you should make sure that it is not just "HBA mode" you want it to be "IT mode" (initiator target mode).
Normally I would guess that your external raid card should be faster because it got 8x PCIe 3.0 lanes and you really need that because 4x PCIe 3.0 or 8x PCIe 2.0 can't handle the combined bandwidth of 8 SATA SSDs accessed at the same time. But not sure how that works with your mainboard because it is dual socket. Normally you should use the PCIe Slot that is directly connected to your CPU but you got 2 CPUs, so it might be that 1 CPU is fast because it can directly connect to the HBA but the other CPU is slow because it got no direct connection to the HBA. Not sure if the link between the two sockets could be a bottleneck if you want to access the SSDS with 4400 MB/s.

1) but why? maybe is my benchmark unsuitable?
With ZFS you get alot of write/read amplification. So if fio tells you it can only write with 25.2 MiB/s the server actually might be writing with 500 MiB/s but you just got a factor 20 write amplification.

Also your read tests are useless because you are only benchmarking your RAM. If you really want to see read performance you need to disable ARC caching by using zfs set primarycache=metadata YourPool (and later zfs set primarycache=any YourPool to restore it).

You also might want to use "--sync=1" to tell fio to do sync writes if you are using psync as ioengine. And "--direct=1" + "--refill_buffers" is important to bypass the linux page file caching.

And if you don't use "--numjobs" greater than 1 you won't see any benefit using a pool with multiple drives. A single SSD should be faster than a pool of 8 SSDs for single sync writes. Because using multiple SSDs as a pool won't make your pool faster (it probably make it slower because of additional overhead), it just increases the bandwidth.
 
Last edited:
  • Like
Reactions: UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!