Benchmarking ZFS for Database Performance (MSSQL)

This thread should provide some information on how you might be able to optimize I/O performance for your VMs.
Please: If you post answers with test-outputs - use the 'inline spoiler' ;)

For those that are interested in the charts and values see: docs.google.com/spreadsheets

Warning: We are no experts in benchmarking hardware. If you see some issue - please point it out in the answers.

Note: We did not yet find time to run benchmarks on a MSSQL-instance inside a VM.. we might add them later on.

If you have some idea on how to improve performance of RND64k-Read in a Windows-VM (>14.5GB/s) => please give us a hint ;)

Thanks to @Dunuin for the post on how to benchmark SSD's and additional information in other threads - we appreciate the information:)

Our Goals:
  • Optimize I/O performance for MSSQL (on Windows- or Linux-VM)
  • Use ZFS to allow for Replication (and possible HA-Setup)
  • The setup should provide high 64k-Read throughput as we see high read-values in our monitoring (prometheus & grafana) of the existing server
  • Have the disks in a RAID10-like array (2x ZFS-mirrors in a zpool)
  • As it is recommended to run MSSQL on a 64k FS - we mainly focused on 64k-RND-Read

Setup:

Hardware:
  • Dell PowerEdge R7615 Server
  • 2x System disks NVMe
  • 4x 3,84 TB Enterprise NVMe Read-Optimized U.2 4th Gen (fdisk -lshows 4096-byte physical-sectors) directly available to ZFS
  • RAM: 386GB DDR5-4800
  • CPU: 1x AMD EPYC 9274F, 4,05 GHz, 24C/48T
PVE:
  • Version: 8.4.1
  • Debian 12, Kernel 6.8.12-11-pve
  • No other traffic/load/VMs on the tested server
VMs:
  • Windows: Windows Server 2025 Standard, 200GB RAM (DDR5), 16 Cores (4GHz), Windows Updates 2025-07, UEFI Boot, System-disk = same zfs pool as data, memory ballooning off, virtio-win 0.1.271, FS: NTFS 64k
  • Linux: Debian 12.7 netinstall, 200GB RAM (DDR5), 16 Cores (4GHz), Kernel: 6.1.0-37-amd64, System-disk = same zfs pool as data, 2GB Swap, memory balooing off, FS: EXT4 and XFS tested
Benchmarks:
  • The Sheet with results and charts: docs.google.com/spreadsheets
  • On linux we used the tool 'fio' (apt install fio)
    The fio config we used: gist.github.com/NiceRath
    You might want to lower the iodepth and numjobs to the point where you do not see any improvement - we just eye-balled it
  • On windows we used the tool 'crystal benchmark'
  • On ZFS we disabled caching for all tests (unless stated specifically): zfs set primarycache=metadata <pool>
  • The I/O delay was observed via prometheus node-exporter and grafana
What we tested:
  • Disk-layer (1 disk, ZFS-ashift, for comparison: MD-R10)
  • ZFS-kernel-module parameter-tweaking
  • some zpool options
  • ZFS volsize for Linux-VM
  • VM-disk options for Windows
  • Some VM-disk options for Linux
  • Guest-FS for Linux (ext4, xfs)
Resources:

Interesting to us:
  • Importance of optimizing block-size: physical sectors <= ZFS ashift <= ZFS recordsize <= ZFS volblocksize <= VM FS block-size
  • We were not yet able to get the Windows VM above 14.5GB/s RND64k-Read - not sure if we are missing something
  • VM-disk type SCSI seems to be the best
  • VM-disk option 'discard' gives a nice boost when using ZFS under it
  • VM-disk caching could help you in edge-cases - but it seems to be a 'double-edged-sword'
  • VM-disk caching seems to behave very differently on Windows- vs Linux-VMs
  • Tweaking the ZFS-kernel-module parameters showed some speed-improvement but did also significantly increase I/O delay
  • The VM-FS can make a huge difference (vs VM raw block-device)
  • It's not that easy to get a 64k FS mounted on linux.. :confused: (getconf PAGE_SIZE)
  • It can be hard to optimize ZFS recordsize and volblocksize for specific applications (MSSQL in our case) and general usage (basic VM I/O).
    The only way we see is to create partitions on all physical disks and create multiple zpool's from these.
    But this might also lead to some loss in performance.
  • Just a though: If the DB can run directly on Linux, application-layer redundancy can be configured for it and the HA-licensing it acceptable - it might be 'simpler' and more performant to use multiple bare-metal servers. (less layers between DB and disks..)
    But sadly that's not the case for us as the app that uses the DB has dropped support for running it on Linux.. :confused:

  • Future: QEMU/VirtIO disk-multi-threading might be interesting
  • Future: ZFS Direct-IO might be able to boost NVMe's (ZFS release 2.3.0 => Available in PVE 9.0 :D)

Have a nice day (:
 
Last edited:
  • Like
Reactions: ucholak