Poor ZFS performance On Supermicro vs random ASUS board

What model is this mobo? I'll build a small system soon using an X10SRL-F and WD RE/Gold 1T disks and a pair of the older Intel DC3500s as SLOG. I'll report some performance data here if I don't forget...
# prtdiag
System Configuration: Supermicro X10SLL-F
BIOS Configuration: American Megatrends Inc. 3.0 04/24/2015
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version Location Tag
-------------------------------- --------------------------
Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz SOCKET 0

==== Memory Device Sockets ================================

Type Status Set Device Locator Bank Locator
----------- ------ --- ------------------- ----------------
unknown empty 0 P1-DIMMA1 P0_Node0_Channel0_Dimm0
DDR3 in use 0 P1-DIMMA2 P0_Node0_Channel0_Dimm1
unknown empty 0 P1-DIMMB1 P0_Node0_Channel1_Dimm0
DDR3 in use 0 P1-DIMMB2 P0_Node0_Channel1_Dimm1

==== On-Board Devices =====================================

==== Upgradeable Slots ====================================

ID Status Type Description
--- --------- ---------------- ----------------------------
0 available PCI Exp. Gen 2 x8 PCH SLOT 4 PCI-E 2.0 X4(IN X8)
1 in use PCI Exp. Gen 3 x8 CPU SLOT 5 PCI-E 3.0 X8, Mellanox Technologies MT25408 [ConnectX VPI - IB SDR / 10GigE] (hermon)
2 in use PCI Exp. Gen 3 x16 CPU SLOT 6 PCI-E 3.0 X8(IN X16), LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (mpt_sas)

prtconf | grep Memory
Memory size: 16344 Megabytes
 
#zfs set sync=disabled pool
and do test again
Code:
root@vmc3-1:/nvmepool/VMs# zfs get sync nvmepool/VMs
NAME          PROPERTY  VALUE     SOURCE
nvmepool/VMs  sync      disabled  local
Code:
root@vmc3-1:/nvmepool/VMs# smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       PLEXTOR PX-512M8PeGN
...
Data Units Written:                 741,709 [379 GB]
...
root@vmc3-1:/nvmepool/VMs# fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
...
  write: io=4096.0MB, bw=28475KB/s, iops=7118, runt=147299msec
...
root@vmc3-1:/nvmepool/VMs# smartctl -a /dev/nvme1n1
...
Data Units Written:                 933,022 [477 GB]
...
98 GB have been written to the disk while only 4 GB was written to zfs.
 
Could you run new tests with saner record sizes like 4k or 8k (using new datasets for test)?
 
Code:
root@vmc3-1:/nvmepool/VMs# zfs get sync nvmepool/VMs
NAME          PROPERTY  VALUE     SOURCE
nvmepool/VMs  sync      disabled  local
Code:
root@vmc3-1:/nvmepool/VMs# smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-1-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       PLEXTOR PX-512M8PeGN
...
Data Units Written:                 741,709 [379 GB]
...
root@vmc3-1:/nvmepool/VMs# fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
...
  write: io=4096.0MB, bw=28475KB/s, iops=7118, runt=147299msec
...
root@vmc3-1:/nvmepool/VMs# smartctl -a /dev/nvme1n1
...
Data Units Written:                 933,022 [477 GB]
...
98 GB have been written to the disk while only 4 GB was written to zfs.
It is important for the validity of the benchmark to choose a file size which is at least twice the size of the RAM or you may otherwise test your RAM
 
It is important for the validity of the benchmark to choose a file size which is at least twice the size of the RAM or you may otherwise test your RAM
My server has 320 GB RAM, but nvme disk is only 450 GB. And how it explains that zfs writes to disk 25 times more data.
 
My server has 320 GB RAM, but nvme disk is only 450 GB. And how it explains that zfs writes to disk 25 times more data.
You ran the test using 4k record size in fio. Try the test with datasets using 4k or 8k record sizes. The default is 128k, meaning with a size less than or equal to that all single writes will at least write out 128k, hence the "write amplification".
 
My server has 320 GB RAM, but nvme disk is only 450 GB. And how it explains that zfs writes to disk 25 times more data.
I have been in the same situation at work with our new HCI SAN where each node contains 1 TB of RAM and since each node contains 25 TB storage it requires a large part of the storage to make the test;-)
 
You ran the test using 4k record size in fio. Try the test with datasets using 4k or 8k record sizes. The default is 128k, meaning with a size less than or equal to that all single writes will at least write out 128k, hence the "write amplification".
Ok. I set recordsize to 4k and turned off compression, but zfs still writes to disks several times more than it needs.
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
nvmepool 1.86T 6.15G 1.85T - 2% 0% 1.00x ONLINE -
raidz1 1.86T 6.15G 1.85T - 2% 0%
nvme0n1 - - - - - -
nvme1n1 - - - - - -
nvme2n1 - - - - - -
nvme3n1 - - - - - -
Code:
root@vmc3-1:/nvmepool# zfs get all nvmepool/VMs
NAME          PROPERTY              VALUE                  SOURCE
nvmepool/VMs  type                  filesystem             -
nvmepool/VMs  creation              Mon Dec 25 23:32 2017  -
nvmepool/VMs  used                  32.9K                  -
nvmepool/VMs  available             1.34T                  -
nvmepool/VMs  referenced            32.9K                  -
nvmepool/VMs  compressratio         1.00x                  -
nvmepool/VMs  mounted               yes                    -
nvmepool/VMs  quota                 none                   default
nvmepool/VMs  reservation           none                   default
nvmepool/VMs  recordsize            4K                     local
nvmepool/VMs  mountpoint            /nvmepool/VMs          default
nvmepool/VMs  sharenfs              off                    default
nvmepool/VMs  checksum              on                     default
nvmepool/VMs  compression           off                    local
nvmepool/VMs  atime                 on                     default
nvmepool/VMs  devices               on                     default
nvmepool/VMs  exec                  on                     default
nvmepool/VMs  setuid                on                     default
nvmepool/VMs  readonly              off                    default
nvmepool/VMs  zoned                 off                    default
nvmepool/VMs  snapdir               hidden                 default
nvmepool/VMs  aclinherit            restricted             default
nvmepool/VMs  createtxg             2000                   -
nvmepool/VMs  canmount              on                     default
nvmepool/VMs  xattr                 on                     default
nvmepool/VMs  copies                1                      default
nvmepool/VMs  version               5                      -
nvmepool/VMs  utf8only              off                    -
nvmepool/VMs  normalization         none                   -
nvmepool/VMs  casesensitivity       sensitive              -
nvmepool/VMs  vscan                 off                    default
nvmepool/VMs  nbmand                off                    default
nvmepool/VMs  sharesmb              off                    default
nvmepool/VMs  refquota              none                   default
nvmepool/VMs  refreservation        none                   default
nvmepool/VMs  guid                  4750642255138679942    -
nvmepool/VMs  primarycache          all                    default
nvmepool/VMs  secondarycache        all                    default
nvmepool/VMs  usedbysnapshots       0B                     -
nvmepool/VMs  usedbydataset         32.9K                  -
nvmepool/VMs  usedbychildren        0B                     -
nvmepool/VMs  usedbyrefreservation  0B                     -
nvmepool/VMs  logbias               latency                default
nvmepool/VMs  dedup                 off                    default
nvmepool/VMs  mlslabel              none                   default
nvmepool/VMs  sync                  standard               default
nvmepool/VMs  dnodesize             legacy                 default
nvmepool/VMs  refcompressratio      1.00x                  -
nvmepool/VMs  written               32.9K                  -
nvmepool/VMs  logicalused           12K                    -
nvmepool/VMs  logicalreferenced     12K                    -
nvmepool/VMs  volmode               default                default
nvmepool/VMs  filesystem_limit      none                   default
nvmepool/VMs  snapshot_limit        none                   default
nvmepool/VMs  filesystem_count      none                   default
nvmepool/VMs  snapshot_count        none                   default
nvmepool/VMs  snapdev               hidden                 default
nvmepool/VMs  acltype               off                    default
nvmepool/VMs  context               none                   default
nvmepool/VMs  fscontext             none                   default
nvmepool/VMs  defcontext            none                   default
nvmepool/VMs  rootcontext           none                   default
nvmepool/VMs  relatime              off                    default
nvmepool/VMs  redundant_metadata    all                    default
nvmepool/VMs  overlay               off                    default
Code:
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme0n1 | grep Written
Data Units Written:                 770,603 [394 GB]
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme1n1 | grep Written
Data Units Written:                 1,165,102 [596 GB]
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme2n1 | grep Written
Data Units Written:                 586,572 [300 GB]
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme3n1 | grep Written
Data Units Written:                 1,466,897 [751 GB]
root@vmc3-1:/nvmepool# fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/112.7MB/0KB /s] [0/28.7K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=25610: Mon Dec 25 23:36:36 2017
  write: io=4096.0MB, bw=104409KB/s, iops=26102, runt= 40172msec
  cpu          : usr=7.80%, sys=75.96%, ctx=77528, majf=0, minf=822
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=104408KB/s, minb=104408KB/s, maxb=104408KB/s, mint=40172msec, maxt=40172msec
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme0n1 | grep Written
Data Units Written:                 786,397 [402 GB]
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme1n1 | grep Written
Data Units Written:                 1,180,849 [604 GB]
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme2n1 | grep Written
Data Units Written:                 602,366 [308 GB]
root@vmc3-1:/nvmepool# smartctl -a /dev/nvme3n1 | grep Written
Data Units Written:                 1,482,637 [759 GB]
So ((402+604+308+759)-(394+596+300+751)/4*3) = 24 GB instead of 4 GB.
What configuration do you recommend for VMs with DBMS?
 
That looks better. Also see this: https://github.com/zfsonlinux/zfs/issues/6555

The rule of thumb for DBs is trying to match the record size of the database engine. For example InnoDB uses 16k page sizes for data and 128k for logs so it's generally recommended to use that as record size. But it's workload dependent, with mostly-read workloads prefetch and larger record sizes might help.

EDIT: fixed page size info
 
You ran the test using 4k record size in fio. Try the test with datasets using 4k or 8k record sizes. The default is 128k, meaning with a size less than or equal to that all single writes will at least write out 128k, hence the "write amplification".

This is wrong. For dataset, 128 k must be understood as maximum, not as default. zfs use variable record size for datasets.
 
So ((402+604+308+759)-(394+596+300+7

Maybe is the nvme problem. If the minimum block size for nvme is let say 256 k, the write amplifier is done by nvme, and not by zfs.

The best way to test your zfs setup is to use your real load. Any X tool that you think is cool can not put a load as your real load that you will use in production. As an example for DB you can record yours sql ops and then apply on your zfs storage.
Can you guarantee that your fio tests will show the same performance compared with your real DB load?
 
Last edited:
@guletz: Yes, you're correct, I forgot that for a moment that this is the enforced maximum record size. However, I'd like to remind you of 2 things: first, changing the record size normalized the write amplification for docent; second, all tuning guides, backed by real life experience recommend the alignment of the record size with that of the actual workload. How can you explain these?

EDIT: an explanation could be that when fio first creates the 4G test file, it gets written by zfs using 128k blocks. It remains at that size for further operations and causes the observed amplification for successive random writes. This behavior partly supports the need to change the record size.
 
Last edited:
an explanation could be that when fio first creates the 4G test file, it gets written by zfs using 128k blocks. It remains at that size for further operations and causes the observed amplification for successive random writes. This behavior partly supports the need to change the record size.

Yes, it could be true in what you say. Only using testing with different record size could reveal the corect value for THIS FIO test. But in the end, if the real load will show bad performance ..... you must start again from the beginning.
 
But in the end, if the real load will show bad performance
If the real load uses the same block/record/page size (eg. 16k test and InnoDB workload as discussed before) this should be an adequate indication of expected performance.
 
If the real load uses the same block/record/page size (eg. 16k test and InnoDB workload as discussed before) this should be an adequate indication of expected performance.

If you are lucky, yes it could be .... but sometimes this is not enough, especially when you are using DBs. And you must also take in account what other non-DB load do you have on the same zfs storage(maybe you have in addition a VM file-server or something else). All this load will interact with zfs. And for this reasons with FIO/what-ever you can not emulate a real-word production scenario.

16k for InnoDB is only the first step, but this is not sufficient in most of the cases(especially for a virtualised enviroment).
 
Last edited:
No synthetic test can perfectly emulate real life systems, that should be obvious. And no, it's not based on luck but theory and backing experience. Don't forget we're using VMs on separate datasets. Naturally we can't detach those from the rest of the load, but one should be prudent and run dedicated VMs for DBs at least. There are many ways a DB server can be optimized for a certain load or in general, but that's not the scope of this thread. We're talking about the record size now.
 
EDIT: an explanation could be that when fio first creates the 4G test file, it gets written by zfs using 128k blocks. It remains at that size for further operations and causes the observed amplification for successive random writes. This behavior partly supports the need to change the record size.

I think is not entirly true for the @docent case. You must take in account this facts:

- he has a raidz1(4 x nvme), so any data who is written to the pool will use(~): (the size of file) + 25%(the size of file for parity) + 2 x metadata/disk + checksums(around 8% x size of file)
- in the case of 128k blocks he has a problem, because for each 128k write, zfs need to split this on 3 nvme disks=42.66 k for each disk(without parity disk), so this is bad because he need to write an incomplete block on each disk
- this is very bad especialy for VM, or for small blocks(for 16k, 16K/3 = NOT integer )
- so I think one side of this problem(write amplification) is the bad zpool design(as a dumb rule, it is desirable to have 128 k/(no of total disk - no. of parity disk )) = must be INTEGER at least

For much better performance, he could try instead of 4 x nvme raidz1 to use a raid10(who is a decent setup for any kind of DB), but he will lose some usable storage capacity.
 
Unfortunatly, RAID10 does not help.
Code:
root@vmc3-1:~# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
nvmepool   952G  4.13G   948G         -     3%     0%  1.00x  ONLINE  -
  mirror   476G  1.92G   474G         -     3%     0%
    nvme0n1      -      -      -         -      -      -
    nvme1n1      -      -      -         -      -      -
  mirror   476G  2.21G   474G         -     3%     0%
    nvme2n1      -      -      -         -      -      -
    nvme3n1      -      -      -         -      -      -
Code:
NAME          PROPERTY              VALUE                  SOURCE
nvmepool/VMs  type                  filesystem             -
nvmepool/VMs  creation              Thu Dec 28 14:28 2017  -
nvmepool/VMs  used                  4.08G                  -
nvmepool/VMs  available             918G                   -
nvmepool/VMs  referenced            4.08G                  -
nvmepool/VMs  compressratio         1.00x                  -
nvmepool/VMs  mounted               yes                    -
nvmepool/VMs  quota                 none                   default
nvmepool/VMs  reservation           none                   default
nvmepool/VMs  recordsize            4K                     local
nvmepool/VMs  mountpoint            /nvmepool/VMs          default
nvmepool/VMs  sharenfs              off                    default
nvmepool/VMs  checksum              on                     default
nvmepool/VMs  compression           off                    local
nvmepool/VMs  atime                 on                     default
nvmepool/VMs  devices               on                     default
nvmepool/VMs  exec                  on                     default
nvmepool/VMs  setuid                on                     default
nvmepool/VMs  readonly              off                    default
nvmepool/VMs  zoned                 off                    default
nvmepool/VMs  snapdir               hidden                 default
nvmepool/VMs  aclinherit            restricted             default
nvmepool/VMs  createtxg             69                     -
nvmepool/VMs  canmount              on                     default
nvmepool/VMs  xattr                 on                     default
nvmepool/VMs  copies                1                      default
nvmepool/VMs  version               5                      -
nvmepool/VMs  utf8only              off                    -
nvmepool/VMs  normalization         none                   -
nvmepool/VMs  casesensitivity       sensitive              -
nvmepool/VMs  vscan                 off                    default
nvmepool/VMs  nbmand                off                    default
nvmepool/VMs  sharesmb              off                    default
nvmepool/VMs  refquota              none                   default
nvmepool/VMs  refreservation        none                   default
nvmepool/VMs  guid                  7298388468966252894    -
nvmepool/VMs  primarycache          all                    default
nvmepool/VMs  secondarycache        all                    default
nvmepool/VMs  usedbysnapshots       0B                     -
nvmepool/VMs  usedbydataset         4.08G                  -
nvmepool/VMs  usedbychildren        0B                     -
nvmepool/VMs  usedbyrefreservation  0B                     -
nvmepool/VMs  logbias               latency                default
nvmepool/VMs  dedup                 off                    default
nvmepool/VMs  mlslabel              none                   default
nvmepool/VMs  sync                  standard               default
nvmepool/VMs  dnodesize             legacy                 default
nvmepool/VMs  refcompressratio      1.00x                  -
nvmepool/VMs  written               4.08G                  -
nvmepool/VMs  logicalused           4.04G                  -
nvmepool/VMs  logicalreferenced     4.04G                  -
nvmepool/VMs  volmode               default                default
nvmepool/VMs  filesystem_limit      none                   default
nvmepool/VMs  snapshot_limit        none                   default
nvmepool/VMs  filesystem_count      none                   default
nvmepool/VMs  snapshot_count        none                   default
nvmepool/VMs  snapdev               hidden                 default
nvmepool/VMs  acltype               off                    default
nvmepool/VMs  context               none                   default
nvmepool/VMs  fscontext             none                   default
nvmepool/VMs  defcontext            none                   default
nvmepool/VMs  rootcontext           none                   default
nvmepool/VMs  relatime              off                    default
nvmepool/VMs  redundant_metadata    all                    default
nvmepool/VMs  overlay               off                    default
Code:
root@vmc3-1:~# smartctl -a /dev/nvme0n1 | grep Written
Data Units Written:                 844,445 [432 GB]
root@vmc3-1:~# smartctl -a /dev/nvme1n1 | grep Written
Data Units Written:                 1,238,874 [634 GB]
root@vmc3-1:~# smartctl -a /dev/nvme2n1 | grep Written
Data Units Written:                 665,452 [340 GB]
root@vmc3-1:~# smartctl -a /dev/nvme3n1 | grep Written
Data Units Written:                 1,545,699 [791 GB]
Code:
root@vmc3-1:/nvmepool/VMs# fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/114.9MB/0KB /s] [0/29.4K/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=54200: Thu Dec 28 14:33:25 2017
  write: io=4096.0MB, bw=108971KB/s, iops=27242, runt= 38490msec
  cpu          : usr=7.53%, sys=78.84%, ctx=97099, majf=0, minf=701
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=108971KB/s, minb=108971KB/s, maxb=108971KB/s, mint=38490msec, maxt=38490msec
Code:
root@vmc3-1:~# smartctl -a /dev/nvme0n1 | grep Written
Data Units Written:                 871,945 [446 GB]
root@vmc3-1:~# smartctl -a /dev/nvme1n1 | grep Written
Data Units Written:                 1,266,374 [648 GB]
root@vmc3-1:~# smartctl -a /dev/nvme2n1 | grep Written
Data Units Written:                 696,220 [356 GB]
root@vmc3-1:~# smartctl -a /dev/nvme3n1 | grep Written
Data Units Written:                 1,576,467 [807 GB]
So, ((807+356+648+446)-(791+340+634+432))=60 GB.
While FIO writes to zfs 4GB, zfs write to nvme 60/2=30GB of data.
 
Last edited:
While FIO writes to zfs 4GB, zfs write to nvme 60/2=30GB of data.

But is much better compared with raidz1, isin't it?

Try to check for each nvme disk what is phisical sector size:

Code:
cat /sys/block/sdX/queue/physical_block_size

And use for zpool atime=off !!! Then run again fio test!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!