How do you squeeze max performance with software raid on NVME drives?

bomzh · May 23, 2023

Hello everybody!

First of all, I do not want to run into flames on this thread, but I'm looking for your advises.
Second - I'm a big-big fan of Proxmox (using since PVE 2.x) and I'm also a big fan of ZFS (using since it became avail on FreeBSD first).

I suppose my question is very common these days as we all grow in all terms - better hardware, modern CPUs, hundreds gigabytes of RAM, fast enterprise-class NVME drives.. so I will ask this straight -

What is the best option for software-RAID model when you need to squeeze max speed/IOPS out of fast NVME drives together with PVE?

As we all know, Proxmox suggests to use ZFS in most cases and I must say this is a very superior filesystem with many-many features, it is very mature, stable, safe - I really love it from the beginning, hands down. I really pushed ZFS into every project I got my hands on where my opinion was taken in place - take a good HBA controller, use server-class SAS disks, ECC RAM, back everything with L2ARC or ZFS Special Device and you get a nice setup for almost everything.

But there's one setup that can't get me sleeping - use of ZFS on top of NVME disks. In particular, use of ZFS on server-grade hardware such as Supermicro All-NVME platforms with several NVME disks such as Intel D7-P5600. So to keep long story short today I have -

A brand-new Supermicro server, dual Xeon 4214 CPU, 512GB of ECC RAM and 4 x Intel P7-5620 3.2TB NVME U.2 disks. And honestly I must say ZFS in any setup on this server (all kinds of RAID(z) setups possible with 4 disks) really sucks in terms of speed, IOPS and CPU load. I tested many zpool/zfs options, all tests done with FIO (I do not want to go into detail, this is going to be a very long write-up), but comparing to simple Linux MDADM with or without LVM on top ZFS is far behind. I clearly understand why this happens - ZFS just ensures your data is safe all the way and there're many topics on the Internet that ZFS on top of NVME cannot get you maximum out of such storage in terms of speed and IOPS counts - this is just a trade-off if you want all of ZFS features. At the same time I realize why Proxmox does not support MDADM officially - ok, I'm fine with it, it is their decision and we have to obey it, especially when there's an easy way to overcome this limitation.

So my simple question is - when you're buying fast NVME drives and want to take advantage of them - how do you layout your filesystem and what do you use?

I'm pretty sure I'm not the only person hit this problem, so let's discuss!

shanreich · May 24, 2023

We have a quite comprehensive section about ZFS RAID levels in our documentation [1]. This might already give you the answers you need. If you want to go for IOPS I would suggest trying out mirrored vdevs (RAID 10) over RAIDZ, as IOPS with RAIDZ simply doesn't scale (always provides the IOPS of one disk).

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_zfs_raid_considerations

shanreich · May 24, 2023

Sorry, I was a bit overeager after reading the first half of your post or so. As you already mentioned, ZFS doesn't provide maximum performance compared to stuff like simple LVM with ext4/xfs/... . Since ZFS does a lot of extra stuff (compression, integrity checking, ...) it will naturally always be slower than other setups that do not include all those bells and whistles. If you want to squeeze the last bit of performance out of your disks then going with another setup would be better.

What would be interesting: Which fio commands did you use for benchmarking the disks? Did you have a look at our ZFS benchmark paper [1]? The respective fio commands used are included there, do you get comparable results? Just as a sanity check.

I would usually recommend going with ZFS regardless of the performance penalties because of the many features you are getting for it, but of course it depends on your requirements (which I do not know). Since you seem to be using very solid hardware it poses the question if you will really have issues with the storage performance in practice (again, depends on what you want to achieve with your setup).

A colleague has a nice write-up on all the upsides ZFS offers - in case you want to read a bit more about that [2].

For alternative setups, there are probably others that are more suited to chiming in, since I do not have that much experience outside of ZFS / LVM + ext4/xfs/...

[1] https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020
[2] https://forum.proxmox.com/threads/performance-comparison-between-zfs-and-lvm.124295/#post-541384

bomzh · May 25, 2023

Hi @shanreich!

Thanks for pointing. I've read the mentioned documentation and many others, including Proxmox's document regarding ZFS on NVME tests.

Just for sanity I did simple identical test for both MDADM+LVM and ZFS - all on 2xIntel P7-5620 3.2TB NVME U.2 RAID-1 storage.
FIO command taken from https://www.proxmox.com/en/downloads/item/proxmox-ve-zfs-benchmark-2020

For ZFS I did tests on ZVOL.

Two servers, absolutely identical in terms of hardware. ARC to ZFS is not limited, so it is 256GB per defaults.

Code:

* CPU - 2 x:
Model name:                      Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
Stepping:                        7
CPU MHz:                         3200.000
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4400.00
Virtualization:                  VT-x

* NVME - 2 x INTEL SSDPF2KE032T1

* RAM - 512GB ECC

Code:

root@srv3:/nvmepool# zpool status -v
  pool: nvmepool
 state: ONLINE
config:

    NAME         STATE     READ WRITE CKSUM
    nvmepool     ONLINE       0     0     0
      mirror-0   ONLINE       0     0     0
        nvme2n1  ONLINE       0     0     0
        nvme3n1  ONLINE       0     0     0

errors: No known data errors

root@srv3:/nvmepool# zpool get all nvmepool
NAME      PROPERTY                       VALUE                          SOURCE
nvmepool  size                           2.91T                          -
nvmepool  capacity                       0%                             -
nvmepool  altroot                        -                              default
nvmepool  health                         ONLINE                         -
nvmepool  guid                           13061767153409312432           -
nvmepool  version                        -                              default
nvmepool  bootfs                         -                              default
nvmepool  delegation                     on                             default
nvmepool  autoreplace                    off                            default
nvmepool  cachefile                      -                              default
nvmepool  failmode                       wait                           default
nvmepool  listsnapshots                  off                            default
nvmepool  autoexpand                     off                            default
nvmepool  dedupratio                     1.00x                          -
nvmepool  free                           2.90T                          -
nvmepool  allocated                      9.07G                          -
nvmepool  readonly                       off                            -
nvmepool  ashift                         12                             local
nvmepool  comment                        -                              default
nvmepool  expandsize                     -                              -
nvmepool  freeing                        0                              -
nvmepool  fragmentation                  0%                             -
nvmepool  leaked                         0                              -


root@srv3:/nvmepool# zfs get all nvmepool/fio
NAME          PROPERTY              VALUE                  SOURCE
nvmepool/fio  type                  volume                 -
nvmepool/fio  creation              Thu May 25 10:32 2023  -
nvmepool/fio  used                  103G                   -
nvmepool/fio  available             2.81T                  -
nvmepool/fio  referenced            9.07G                  -
nvmepool/fio  compressratio         1.00x                  -
nvmepool/fio  reservation           none                   default
nvmepool/fio  volsize               100G                   local
nvmepool/fio  volblocksize          8K                     default
nvmepool/fio  checksum              on                     default
nvmepool/fio  compression           off                    default
nvmepool/fio  readonly              off                    default
nvmepool/fio  createtxg             23                     -
nvmepool/fio  copies                1                      default
nvmepool/fio  refreservation        103G                   local
nvmepool/fio  guid                  10221065018182282380   -
nvmepool/fio  primarycache          all                    default
nvmepool/fio  secondarycache        all                    default
nvmepool/fio  usedbysnapshots       0B                     -
nvmepool/fio  usedbydataset         9.07G                  -
nvmepool/fio  usedbychildren        0B                     -
nvmepool/fio  usedbyrefreservation  94.1G                  -
nvmepool/fio  logbias               latency                default
nvmepool/fio  objsetid              152                    -
nvmepool/fio  dedup                 off                    default
nvmepool/fio  mlslabel              none                   default
nvmepool/fio  sync                  standard               default
nvmepool/fio  refcompressratio      1.00x                  -
nvmepool/fio  written               9.07G                  -
nvmepool/fio  logicalused           9.04G                  -
nvmepool/fio  logicalreferenced     9.04G                  -
nvmepool/fio  volmode               default                default
nvmepool/fio  snapshot_limit        none                   default
nvmepool/fio  snapshot_count        none                   default
nvmepool/fio  snapdev               hidden                 default
nvmepool/fio  context               none                   default
nvmepool/fio  fscontext             none                   default
nvmepool/fio  defcontext            none                   default
nvmepool/fio  rootcontext           none                   default
nvmepool/fio  redundant_metadata    all                    default
nvmepool/fio  encryption            off                    default
nvmepool/fio  keylocation           none                   default
nvmepool/fio  keyformat             none                   default
nvmepool/fio  pbkdf2iters           0                      default

Code:

root@srv1:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 nvme2n1p1[0] nvme3n1p1[1]
      3125483840 blocks super 1.2 [2/2] [UU]
      bitmap: 1/24 pages [4KB], 65536KB chunk

unused devices: <none>

root@srv1:~# pvs
  PV           VG                                        Fmt  Attr PSize PFree
  /dev/md127   localvg                                   lvm2 a--  2.91t <2.28t

root@srv1:~# lvs|grep test
  test                                           localvg                                   -wi-ao----  50.00g

1) ZFS test on RAID-1 NVME:

Bash:

fio --ioengine=psync --filename=/dev/zvol/nvmepool/fio --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=readwrite --threads=32 --bs=128k --numjobs=32

Code:

Jobs: 32 (f=32): [M(32)][100.0%][r=1714MiB/s,w=1686MiB/s][r=13.7k,w=13.5k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=1376948: Thu May 25 10:45:22 2023
  read: IOPS=12.5k, BW=1557MiB/s (1632MB/s)(912GiB/600004msec)
    clat (usec): min=19, max=14499, avg=144.11, stdev=147.19
     lat (usec): min=20, max=14500, avg=144.26, stdev=147.27
    clat percentiles (usec):
     |  1.00th=[   48],  5.00th=[   59], 10.00th=[   67], 20.00th=[   78],
     | 30.00th=[   87], 40.00th=[   97], 50.00th=[  108], 60.00th=[  122],
     | 70.00th=[  141], 80.00th=[  172], 90.00th=[  249], 95.00th=[  355],
     | 99.00th=[  619], 99.50th=[  955], 99.90th=[ 1811], 99.95th=[ 2057],
     | 99.99th=[ 2704]
   bw (  MiB/s): min=  670, max= 2260, per=100.00%, avg=1557.07, stdev= 8.10, samples=38368
   iops        : min= 5364, max=18080, avg=12453.71, stdev=64.79, samples=38368
  write: IOPS=12.5k, BW=1557MiB/s (1633MB/s)(912GiB/600004msec); 0 zone resets
    clat (usec): min=197, max=25580, avg=2412.81, stdev=1043.91
     lat (usec): min=198, max=25592, avg=2421.61, stdev=1044.32
    clat percentiles (usec):
     |  1.00th=[  840],  5.00th=[ 1336], 10.00th=[ 1582], 20.00th=[ 1827],
     | 30.00th=[ 1958], 40.00th=[ 2089], 50.00th=[ 2212], 60.00th=[ 2343],
     | 70.00th=[ 2573], 80.00th=[ 2933], 90.00th=[ 3523], 95.00th=[ 3982],
     | 99.00th=[ 4883], 99.50th=[ 7111], 99.90th=[13698], 99.95th=[14353],
     | 99.99th=[15795]
   bw (  MiB/s): min=  902, max= 1872, per=100.00%, avg=1557.72, stdev= 6.61, samples=38368
   iops        : min= 7220, max=14982, avg=12458.78, stdev=52.81, samples=38368
  lat (usec)   : 20=0.01%, 50=0.79%, 100=20.98%, 250=23.25%, 500=4.14%
  lat (usec)   : 750=0.84%, 1000=0.63%
  lat (msec)   : 2=15.79%, 4=31.21%, 10=2.16%, 20=0.21%, 50=0.01%
  cpu          : usr=0.58%, sys=2.02%, ctx=22677836, majf=18, minf=9259
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7471627,7474564,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1557MiB/s (1632MB/s), 1557MiB/s-1557MiB/s (1632MB/s-1632MB/s), io=912GiB (979GB), run=600004-600004msec
  WRITE: bw=1557MiB/s (1633MB/s), 1557MiB/s-1557MiB/s (1633MB/s-1633MB/s), io=912GiB (980GB), run=600004-600004msec

2) MDADM + LVM test on RAID-1:

Bash:

fio --ioengine=psync --filename=/dev/mapper/localvg-test --size=9G --time_based --name=fio --group_reporting --runtime=600 --direct=1 --sync=1 --iodepth=1 --rw=readwrite --threads=32 --bs=128k --numjobs=32

Code:

Jobs: 32 (f=32): [M(32)][100.0%][r=2558MiB/s,w=2536MiB/s][r=20.5k,w=20.3k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=32): err= 0: pid=438320: Thu May 25 10:46:57 2023
  read: IOPS=19.8k, BW=2478MiB/s (2598MB/s)(1452GiB/600002msec)
    clat (usec): min=58, max=5257, avg=705.10, stdev=299.93
     lat (usec): min=58, max=5257, avg=705.22, stdev=299.93
    clat percentiles (usec):
     |  1.00th=[  249],  5.00th=[  326], 10.00th=[  383], 20.00th=[  465],
     | 30.00th=[  529], 40.00th=[  586], 50.00th=[  652], 60.00th=[  717],
     | 70.00th=[  799], 80.00th=[  906], 90.00th=[ 1090], 95.00th=[ 1270],
     | 99.00th=[ 1680], 99.50th=[ 1876], 99.90th=[ 2343], 99.95th=[ 2573],
     | 99.99th=[ 3130]
   bw (  MiB/s): min= 1937, max= 2922, per=100.00%, avg=2478.81, stdev= 4.25, samples=38368
   iops        : min=15496, max=23376, avg=19829.88, stdev=33.96, samples=38368
  write: IOPS=19.8k, BW=2479MiB/s (2599MB/s)(1452GiB/600002msec); 0 zone resets
    clat (usec): min=64, max=7686, avg=901.87, stdev=651.96
     lat (usec): min=66, max=7689, avg=906.79, stdev=651.83
    clat percentiles (usec):
     |  1.00th=[  109],  5.00th=[  149], 10.00th=[  192], 20.00th=[  297],
     | 30.00th=[  437], 40.00th=[  611], 50.00th=[  783], 60.00th=[  963],
     | 70.00th=[ 1156], 80.00th=[ 1401], 90.00th=[ 1778], 95.00th=[ 2147],
     | 99.00th=[ 2933], 99.50th=[ 3261], 99.90th=[ 3949], 99.95th=[ 4228],
     | 99.99th=[ 4883]
   bw (  MiB/s): min= 1980, max= 2855, per=100.00%, avg=2479.58, stdev= 3.73, samples=38368
   iops        : min=15846, max=22844, avg=19836.00, stdev=29.83, samples=38368
  lat (usec)   : 100=0.39%, 250=8.04%, 500=21.10%, 750=26.48%, 1000=18.09%
  lat (msec)   : 2=22.48%, 4=3.37%, 10=0.04%
  cpu          : usr=0.60%, sys=1.82%, ctx=35912816, majf=2, minf=8882
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=11894255,11897831,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2478MiB/s (2598MB/s), 2478MiB/s-2478MiB/s (2598MB/s-2598MB/s), io=1452GiB (1559GB), run=600002-600002msec
  WRITE: bw=2479MiB/s (2599MB/s), 2479MiB/s-2479MiB/s (2599MB/s-2599MB/s), io=1452GiB (1559GB), run=600002-600002msec

So to sum the tests:
ZFS gives 1557 megabyte/s and 12.5k IOPS
MDADM+LVM gives 2479 megabyte/s and almost 20k IOPS.

czechsys · May 25, 2023

What you want? Block or file storage on such raid?

bomzh · May 25, 2023

czechsys said:
What you want? Block or file storage on such raid?

I want to run virtual machines using PVE on such RAID storage, so that would be block storage (either ZFS ZVOL or LVM volumes).
My original question is clear enough - which storage model to use with PVE to get maximum speed/IOPS out of modern NVME disks in RAID-type setups.

bobmc · May 25, 2023

did you run your speed tests on a raid10 setup? afaik that would be the highest performing config at the expense of capacity

bomzh · May 25, 2023

bobmc said:
did you run your speed tests on a raid10 setup? afaik that would be the highest performing config at the expense of capacity

Yes, I tried all variants of ZFS RAID and RAID-z that I could compose out of 4xNVME drives. At the present moment we decided to not use ZFS on top of NVME in our setup and use MDADM+LVM instead. We use ZFS actively on SAS drives thought and we're happy with it's performance for the tasks!

davemcl · Nov 11, 2023

@bomzh How are you finding MDADM+LVM on your NVME storage? Did you stick with it?

LnxBil · Nov 11, 2023

I'm very interessted in running ZFS after the huge improvement recently announced. It could tip the scales towards ZFS.

davemcl · Nov 11, 2023

LnxBil said:
I'm very interessted in running ZFS after the huge improvement recently announced. It could tip the scales towards ZFS.

Thanks, I hadnt seen that one.
Keeping an eye on this one too
https://github.com/openzfs/zfs/pull/10018

Deleted member 205422 · Nov 11, 2023

LnxBil said:
I'm very interessted in running ZFS after the huge improvement recently announced. It could tip the scales towards ZFS.

For the OP's case it would not do anything for him with reads though.

bomzh · Nov 12, 2023

davemcl said:
@bomzh How are you finding MDADM+LVM on your NVME storage? Did you stick with it?

Hi!

Yes, we stick with MDADM+LVM on top of NVMe for now. MDADM+LVM still outperforms ZFS raid variants a lot and not causing system load as much as ZFS does.

davemcl · Nov 15, 2023

Bit of a gap
https://openbenchmarking.org/result/2311156-TCST-231104749

Thats thin-pool - I used a 64K chunk size, not sure if thats the PVE default.
Created with...

Code:

pvcreate /dev/md0
vgcreate SSD-RAID-10 /dev/md0
lvcreate -l 97%FREE -n SSD-RAID-10 SSD-RAID-10
lvconvert --type thin-pool --poolmetadatasize 2048M --chunksize 64 SSD-RAID-10/SSD-RAID-10

bomzh · Nov 15, 2023

@davemcl, what's the NVME model you used in these tests?
Your tests just confirm what I see on our systems.

davemcl · Nov 15, 2023

bomzh said:
@davemcl, what's the NVME model you used in these tests?
Your tests just confirm what I see on our systems.

4 x Intel P5620 3.2TB
https://www.solidigm.com/products/data-center/d7/p5620.html

mikeyo · Feb 20, 2024

bomzh said:
Hi!

Yes, we stick with MDADM+LVM on top of NVMe for now. MDADM+LVM still outperforms ZFS raid variants a lot and not causing system load as much as ZFS does.

@bomzh Would you mind sharing how you configured your nvme storage please for optimum RAID nvme performance?
I am about to install Proxmox on 2 x WD BLACK PCIe4 nvme drives which will be used for VMs.

bomzh · Mar 4, 2024

mikeyo said:
@bomzh Would you mind sharing how you configured your nvme storage please for optimum RAID nvme performance?
I am about to install Proxmox on 2 x WD BLACK PCIe4 nvme drives which will be used for VMs.

Sorry for late response.
Nothing specific within our configuration: Just created plain MDADM device out of nvme drives with defaults, then added this MDADM device as LVM PV, then created VG GROUP using the PV and then use it as LVM-thin type of storage within Proxmox.

Search

Search

How do you squeeze max performance with software raid on NVME drives?

bomzh

Member

shanreich

Proxmox Staff Member

shanreich

Proxmox Staff Member

bomzh

Member

czechsys

Renowned Member

bomzh

Member

bobmc

Renowned Member

bomzh

Member

davemcl

Member

LnxBil

Distinguished Member

davemcl

Member

Deleted member 205422

Guest

bomzh

Member

davemcl

Member

bomzh

Member

davemcl

Member

mikeyo

New Member

bomzh

Member