Poor ZFS performance On Supermicro vs random ASUS board

Code:
root@pve-klenova:~# pveperf
CPU BOGOMIPS:      38401.52
REGEX/SECOND:      456470
HD SIZE:           680.38 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     74.37
DNS EXT:           72.99 ms
DNS INT:           20.93 ms (elson.sk)

Code:
root@pve-klenova:~# fio testdisk
iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64
fio-2.16
Starting 1 process
Jobs: 1 (f=1): [m(1)] [98.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:10s]
iometer: (groupid=0, jobs=1): err= 0: pid=19162: Sun Dec 10 05:38:23 2017
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read : io=3279.8MB, bw=5701.9KB/s, iops=925, runt=589014msec
    slat (usec): min=10, max=818339, avg=797.08, stdev=8327.33
    clat (usec): min=47, max=19429K, avg=54596.60, stdev=284850.32
     lat (usec): min=62, max=19429K, avg=55394.65, stdev=286119.28
    clat percentiles (usec):
     |  1.00th=[ 1544],  5.00th=[ 1624], 10.00th=[ 1688], 20.00th=[ 1784],
     | 30.00th=[ 1896], 40.00th=[ 2096], 50.00th=[ 2736], 60.00th=[ 4192],
     | 70.00th=[ 5472], 80.00th=[17024], 90.00th=[128512], 95.00th=[342016],
     | 99.00th=[798720], 99.50th=[1105920], 99.90th=[1597440], 99.95th=[1712128],
     | 99.99th=[13959168]
  write: io=835856KB, bw=1419.8KB/s, iops=232, runt=589014msec
    slat (usec): min=26, max=19424K, avg=1088.55, stdev=65189.27
    clat (usec): min=4, max=19429K, avg=53384.60, stdev=256005.30
     lat (usec): min=42, max=19429K, avg=54474.33, stdev=265318.93
    clat percentiles (usec):
     |  1.00th=[ 1544],  5.00th=[ 1624], 10.00th=[ 1688], 20.00th=[ 1768],
     | 30.00th=[ 1896], 40.00th=[ 2096], 50.00th=[ 2704], 60.00th=[ 4192],
     | 70.00th=[ 5472], 80.00th=[16512], 90.00th=[126464], 95.00th=[342016],
     | 99.00th=[798720], 99.50th=[1105920], 99.90th=[1581056], 99.95th=[1712128],
     | 99.99th=[13959168]
    lat (usec) : 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
    lat (usec) : 750=0.01%, 1000=0.01%
    lat (msec) : 2=36.36%, 4=22.18%, 10=19.59%, 20=2.55%, 50=4.35%
    lat (msec) : 100=3.66%, 250=4.67%, 500=3.54%, 750=1.95%, 1000=0.51%
    lat (msec) : 2000=0.61%, >=2000=0.02%
  cpu          : usr=0.83%, sys=3.74%, ctx=82774, majf=7, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=544974/w=136862/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=3279.8MB, aggrb=5701KB/s, minb=5701KB/s, maxb=5701KB/s, mint=589014msec, maxt=589014msec
  WRITE: io=835855KB, aggrb=1419KB/s, minb=1419KB/s, maxb=1419KB/s, mint=589014msec, maxt=589014msec

so where is the problem?
 
Hello @chalan

i have indeed solved the problem.

1. Update ZFS to 0.7.3, this alone helped a lot.
2. Dont use WD-Red drives with ZFS and Supermicro, they dont work very well together. Use WD-Gold Drives if you want performance.
3. If you really want performance buy HW raid card with BBU and ditch ZFS (i love the FS, still use it for cold storage, but performance wise it cant hold against my LSI megaraid 9270-4i card now).
4 OR buy datacenter/enterprise grade ssd. ill post some pveperf so you can see the difference

We are now using Samsung enterprise SSD
http://www.samsung.com/semiconductor/products/flash-storage/enterprise-ssd/MZ7KM480HAHP?ia=832

Or Intel Datacenter SSD
https://www.intel.com/content/www/u...-s4600-series/dc-s4600-480gb-2-5inch-3d1.html

When we cant get the samsung ones. Consumer ssds are really a shit ones, i did so many tests on them and have to agree with ppl here. The enterprise grade ssd usually have 10x more performance than the consumer ones.

HW Raid card Optimized
4x WD-Gold Raid-10
Code:
[root@px0001:~]# pveperf
CPU BOGOMIPS:      153607.36
REGEX/SECOND:      2808142
HD SIZE:           1475.46 GB (/dev/mapper/pve-root)
BUFFERED READS:    349.50 MB/sec
AVERAGE SEEK TIME: 10.63 ms
FSYNCS/SECOND:     6964.75
DNS EXT:           7.79 ms
DNS INT:           9.55 ms (xcroco.com)

ZFS With SSD in Front, Raid-Z 1
3x WD-Gold
1x Samsung MZ7KM480
Code:
[root@px0006:~]# zpool status
  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 4h15m with 0 errors on Sun Dec 10 04:39:23 2017
config:

        NAME                                                   STATE     READ WRITE CKSUM
        storage                                                ONLINE       0     0     0
          raidz1-0                                             ONLINE       0     0     0
            ata-WDC_WD4002FYYZ-01B7CB1_K7G945SB                ONLINE       0     0     0
            ata-WDC_WD4002FYYZ-01B7CB1_K7G92X4B                ONLINE       0     0     0
            ata-WDC_WD4002FYYZ-01B7CB1_K7G92R7B                ONLINE       0     0     0
        logs
          ata-SAMSUNG_MZ7KM480HAHP-00005_S2HSNX0J401503-part1  ONLINE       0     0     0
        cache
          ata-SAMSUNG_MZ7KM480HAHP-00005_S2HSNX0J401503-part2  ONLINE       0     0     0

[root@px0006:~]# pveperf /storage/
CPU BOGOMIPS:      44801.36
REGEX/SECOND:      1739704
HD SIZE:           16.95 GB (storage)
FSYNCS/SECOND:     2814.27
DNS EXT:           5.77 ms
DNS INT:           9.33 ms (xcroco.com)
 
Last edited:
ok i buy another 2x4TB WD GOLD drives make a mirror pool and performance still totaly bad

root@pve-klenova:~# pveperf /vmdata
CPU BOGOMIPS: 38400.00
REGEX/SECOND: 429228
HD SIZE: 3596.00 GB (vmdata)
FSYNCS/SECOND: 119.18
DNS EXT: 62.68 ms
DNS INT: 22.25 ms (elson.sk)

root@pve-klenova:~# pveperf
CPU BOGOMIPS: 38400.00
REGEX/SECOND: 435348
HD SIZE: 680.01 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 51.85
DNS EXT: 35.34 ms
DNS INT: 28.23 ms (elson.sk)


root@pve-klenova:~# zpool status
pool: rpool
state: ONLINE
scan: resilvered 29,2G in 1h16m with 0 errors on Sun Dec 10 15:01:57 2017
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD10EFRX-68PJCN0_WD-WCC4J2021886-part2 ONLINE 0 0 0
ata-WDC_WD10EFRX-68JCSN0_WD-WMC1U6546808-part2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J2AK75T9 ONLINE 0 0 0
ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J1JE0SFR ONLINE 0 0 0

errors: No known data errors

pool: vmdata
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
vmdata ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD4002FYYZ-01B7CB1_K3GN7LYL ONLINE 0 0 0
ata-WDC_WD4002FYYZ-01B7CB1_K7GAE87L ONLINE 0 0 0

errors: No known data errors

im confused and i think zfs is a crap... or should i really buy another BIG ssd for ZIL? :)
 
The SSD makes all the difference, but better underlying drives make the difference too man.

Buy a 240GB DATACENTER SSD the datacenter part is important

then you just do (lets assume the ssd is /dev/sdc)
Code:
blkdiscard /dev/sdc

first create partitions
parted /dev/sdc mklabel gpt
parted /dev/sdc mkpart primary zfs 2 20002
parted /dev/sdc mkpart primary zfs 20002 100%

then you just
zpool add rpool log /dev/sdc1
zpool add rpool cache /dev/sdc2

But lets be real here, if you really care about performance, the hw raid card is your only option.

Also if you noticed, the raid 1 configuration has double the performance of raid 10 configuration, it should be the other way around. The wd-red drives are really crap imo, they are only good for cold storage (cheap $/TB, but slow af)
 
Last edited:
Please, run your benchmarks with fio, not pveperf. You also have to differentiate between benchmark-performance and real-life-usage-performance. A benchmark does not say anything if you test the wrong stuff. There is not only one use case for disks, it's a lot of stuff: e.g. a lot of small files or a lot of large files, etc. Saying something is very slow is very easy - as is to say that something is very fast. You can just outperform one SATA-SSD with 4 SATA-Disks if you consider only reading sequential data.

If you only test direct-io, sync random writes on a disk (or a disk group), it will always be very slow. This is the worst case for disks. Therefore, every BBU-HBA caches exactly this and is therefore much, much faster (until the cache is full or the throughput is more than the backing disks can handle). This scenario will be improved by using ZIL on ZFS, yet only exactly this use case.

For server class harddisk performance, please only use at least 10k drives with SAS only, no SATA or some other consumer grade hardware.
 
  • Like
Reactions: GadgetPig
I really need live migration functions of ZFS without shared storge.
So i wanted to use ZFS in production with 4 SATA disks, but now I have seen this thread and i'm not sure anymore.

Are you willing to test ZFS on top of HW RAID? I wonder is it's comparable to HW RAID with thin LVM?
 
how can i run fio test on the vmdata zfspool? i have

Code:
root@pve-klenova:~# cat testdisk
# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern

[iometer]
bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10
rw=randrw
rwmixread=80
direct=0
size=4g
ioengine=libaio
# IOMeter defines the server loads as the following:
# iodepth=1     Linear
# iodepth=4     Very Light
# iodepth=8     Light
# iodepth=64    Moderate
# iodepth=256   Heavy
iodepth=64
 
copy from rpool to vmdata inside a VM is about 13MB/s max 30MB/s
copy from rpool to rpool another partition is about 30MB/s
copy from vmdata to vmdata another partition is about 57MB/s
 
Last edited:
Hi,
I have similar problem with zfs.
Can anybody explain me, why fio shows bw=30249KB/s, but at the same time iostat shows 797297.60 wkB/s?
Code:
proxmox-ve: 5.1-31 (running kernel: 4.13.13-1-pve)
pve-manager: 5.1-40 (running version: 5.1-40/ea05b379)
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
Code:
CPU BOGOMIPS:      289389.44
REGEX/SECOND:      1206179
HD SIZE:           54.63 GB (/dev/mapper/pve-root)
BUFFERED READS:    189.16 MB/sec
AVERAGE SEEK TIME: 0.45 ms
FSYNCS/SECOND:     1150.62
DNS EXT:           56.86 ms
DNS INT:           1.11 ms (alias.ru)
Code:
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                  8:0    0 223.6G  0 disk
├─sda1               8:1    0     1M  0 part
├─sda2               8:2    0   256M  0 part /boot/efi
└─sda3               8:3    0 223.3G  0 part
  ├─pve-swap       253:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       253:1    0  55.8G  0 lvm  /
  ├─pve-data_tmeta 253:2    0    72M  0 lvm
  │ └─pve-data     253:4    0 143.6G  0 lvm
  └─pve-data_tdata 253:3    0 143.6G  0 lvm
    └─pve-data     253:4    0 143.6G  0 lvm
sr0                 11:0    1  1024M  0 rom
nvme3n1            259:0    0   477G  0 disk
nvme2n1            259:1    0   477G  0 disk
nvme0n1            259:2    0   477G  0 disk
nvme1n1            259:3    0   477G  0 disk
Code:
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
nvmepool   476G  4.01G   472G         -     1%     0%  1.00x  ONLINE  -
  nvme1n1   476G  4.01G   472G         -     1%     0%
Code:
  pool: nvmepool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        nvmepool    ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0

errors: No known data errors
Code:
NAME           USED  AVAIL  REFER  MOUNTPOINT
nvmepool      4.01G   457G   192K  /nvmepool
nvmepool/VMs  4.00G   457G  4.00G  /nvmepool/VMs
Code:
root@vmc3-1:/nvmepool/VMs# fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.16
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [98.6% done] [0KB/56808KB/0KB /s] [0/14.3K/0 iops] [eta 00m:02s]
test: (groupid=0, jobs=1): err= 0: pid=27291: Thu Dec 21 18:23:29 2017
  write: io=4096.0MB, bw=30249KB/s, iops=7562, runt=138657msec
  cpu          : usr=3.59%, sys=79.82%, ctx=288301, majf=0, minf=2786
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=30249KB/s, minb=30249KB/s, maxb=30249KB/s, mint=138657msec, maxt=138657msec
Code:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    2.00    0.00     8.00     0.00     8.00     0.00    0.80    0.80    0.00   0.80   0.16
nvme3n1           0.00     0.00    2.00    0.00     8.00     0.00     8.00     0.00    0.80    0.80    0.00   0.80   0.16
nvme1n1           0.00     0.00    2.40 6287.00     8.00 797297.60   253.54     9.29    1.47    1.67    1.47   0.16  99.36
nvme0n1           0.00     0.00    2.00    0.00     8.00     0.00     8.00     0.00    0.80    0.80    0.00   0.80   0.16
sda               0.00     1.40    8.40    1.00     9.80    80.00    19.11     0.00    0.43    0.38    0.80   0.43   0.40
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    2.40     0.00    80.00    66.67     0.00    0.67    0.00    0.67   0.33   0.08
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
 
Its simple. ZFS pool keeps metadata in few different places.

So then you do write - ZFS do :
1. Sync write (ZIL)
2. Sync metadata update
3. Data write
4. All metadata update

That`s why you see fio as program write speed and iostat as disk usage differently.
 
Look at numbers above.
When fio writes to zfs volume 1 MB, zfs writes to disk 26 MB.
Why?
 
#zfs set sync=disabled pool
and do test again
That is an effective way to test the ARC in memory. What exactly is the point?

Anyway, just for comparison, on a Dell 720xd, 3 mirrors striped (~raid10 with 6 disks), using Samsung 850 Pro partitions in a mirror as SLOG, PVE 4.4, zfs 0.6.5.9:
Code:
# zfs create rpool/temp1
# pveperf /rpool/temp1/
CPU BOGOMIPS:      115075.20
REGEX/SECOND:      2373800
HD SIZE:           3706.06 GB (rpool/temp1)
FSYNCS/SECOND:     5666.95
DNS EXT:           47.00 ms
DNS INT:           0.68 ms (xxx.local)

Spinning disks are 6x TOSHIBA MG03SCA200 (2TB 7200 NLSAS).
 
Last edited:
That is an effective way to test the ARC in memory. What exactly is the point?

Anyway, just for comparison, on a Dell 720xd, 3 mirrors striped (~raid10 with 6 disks), using Samsung 850 Pro partitions in a mirror as SLOG, PVE 4.4, zfs 0.6.5.9:
Code:
# zfs create rpool/temp1
# pveperf /rpool/temp1/
CPU BOGOMIPS:      115075.20
REGEX/SECOND:      2373800
HD SIZE:           3706.06 GB (rpool/temp1)
FSYNCS/SECOND:     5666.95
DNS EXT:           47.00 ms
DNS INT:           0.68 ms (xxx.local)

Spinning disks are 6x TOSHIBA MG03SCA200 (2TB 7200 NLSAS).

If you don`t have external LOG device (ZIL) all sync writes in the pool are written two times + meta changes. And it is not related to ARC (read cache)
 
If you don`t have external LOG device (ZIL) all sync writes in the pool are written two times + meta changes. And it is not related to ARC (read cache)
ZIL is not an external log device, but you can put it to a separate disk, hence its usual name (Separate intent LOG). You're right on the other account, my mistake (ARC is for reads only). However, what you test is still the RAM write speed for the most part, since zfs will not block sync write calls, waiting for data to be committed on stable storage when sync=disabled is in effect. So in this scenario we can't get any closer to solving OP's problem. I think metadata are written in the same TG as the associated data, but I'm not sure about that.
 
That is an effective way to test the ARC in memory. What exactly is the point?

Anyway, just for comparison, on a Dell 720xd, 3 mirrors striped (~raid10 with 6 disks), using Samsung 850 Pro partitions in a mirror as SLOG, PVE 4.4, zfs 0.6.5.9:
Code:
# zfs create rpool/temp1
# pveperf /rpool/temp1/
CPU BOGOMIPS:      115075.20
REGEX/SECOND:      2373800
HD SIZE:           3706.06 GB (rpool/temp1)
FSYNCS/SECOND:     5666.95
DNS EXT:           47.00 ms
DNS INT:           0.68 ms (xxx.local)

Spinning disks are 6x TOSHIBA MG03SCA200 (2TB 7200 NLSAS).
Another comparison:
SuperMicro MB, 2 mirrors striped 4xWD red 1TB (RAID 10) and Intel DC s3510 as SLOG on OmniosCE omnios-r151022
Code:
$ sudo zfs create vMotion/test
$ sudo pveperf /vMotion/test/
CPU BOGOMIPS:      Not available
REGEX/SECOND:      1148085
HD SIZE:           510.56 GB (vMotion)
FSYNCS/SECOND:     4935.08
DNS EXT:           28.29 ms
 
Last edited:
Another comparison:
SuperMicro MB
What model is this mobo? I'll build a small system soon using an X10SRL-F and WD RE/Gold 1T disks and a pair of the older Intel DC3500s as SLOG. I'll report some performance data here if I don't forget...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!