Optimizing ZFS backend performance

jinjer

Renowned Member
Oct 4, 2010
204
7
83
Hello,

I think I have a problem with ZFS performance, which is much below what I see advertised on the forum and considering the hardware I'm using. Unfortunately I cannot see the issue, so I hope that someone will be smarter than me.

The problem is the IOPS I can get from a ZFS pool with 6 1TB sata disks and 2 SSD for ZIL and cache. I'm nowhere near the expected 600 iops from the disks alone and the 2000-3000 when using the zil. I actually get ~60 with the ZIL turned off and ~160 with the ZIL on.

The pool is configured as a 3 vdevs of 2 disks in a mirror and two SSD partitioned in a mirrored ZIL + 2xcache partitions:
Code:
# pveperf /rpool/t/
CPU BOGOMIPS:      57529.56
REGEX/SECOND:      2119840
HD SIZE:           1537.23 GB (rpool/t)
FSYNCS/SECOND:     161.78

If I disable sync on the test partition (zfs set sync=disabled rpool/t) I get an astonishing 20000 IOPS which tells me that the ZIL device is not working at all.
Code:
# pveperf /rpool/t/
CPU BOGOMIPS:      57529.56
REGEX/SECOND:      2221426
HD SIZE:           1537.22 GB (rpool/t)
FSYNCS/SECOND:     20918.49


PVE: pve-manager/3.4-6/102d4547 (running kernel: 2.6.32-39-pve)
ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5
This is the ZFS config
Code:
# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 4h54m with 0 errors on Thu Jan 28 19:29:30 2016
config:

        NAME                              STATE     READ WRITE CKSUM
        rpool                             ONLINE       0     0     0
          mirror-0                        ONLINE       0     0     0
            scsi-35000c5007a49788d-part2  ONLINE       0     0     0
            scsi-35000c5007a496f40-part2  ONLINE       0     0     0
          mirror-1                        ONLINE       0     0     0
            scsi-35000c5007a4ddce6        ONLINE       0     0     0
            scsi-35000c5007a497529        ONLINE       0     0     0
          mirror-2                        ONLINE       0     0     0
            scsi-35000c5007a4983e0        ONLINE       0     0     0
            scsi-35000c5007a4a292a        ONLINE       0     0     0
        logs
          mirror-3                        ONLINE       0     0     0
            scsi-3500a075110b740af-part1  ONLINE       0     0     0
            scsi-35e83a9703a5a01e8-part1  ONLINE       0     0     0
        cache
          scsi-3500a075110b740af-part2    ONLINE       0     0     0
          scsi-35e83a9703a5a01e8-part2    ONLINE       0     0     0

errors: No known data errors

And here's the iostat of zpool. As you see cache seems unused too:
Code:
# zpool iostat -v
                                     capacity     operations    bandwidth
pool                              alloc   free   read  write   read  write
--------------------------------  -----  -----  -----  -----  -----  -----
rpool                             1.14T  1.58T     16     47  81.6K   194K
  mirror                           390G   538G      4     14  23.9K  47.6K
    scsi-35000c5007a49788d-part2      -      -      1      4  16.9K  83.2K
    scsi-35000c5007a496f40-part2      -      -      1      4  16.3K  83.2K
  mirror                           391G   537G      5     14  29.1K  46.0K
    scsi-35000c5007a4ddce6            -      -      2      5  16.2K  46.7K
    scsi-35000c5007a497529            -      -      2      5  16.0K  46.7K
  mirror                           390G   538G      5     16  28.6K  53.3K
    scsi-35000c5007a4983e0            -      -      2      5  16.2K  54.1K
    scsi-35000c5007a4a292a            -      -      2      5  15.4K  54.1K
logs                                  -      -      -      -      -      -
  mirror                          40.4M  7.90G      0      1      0  47.4K
    scsi-3500a075110b740af-part1      -      -      0      1     25  47.4K
    scsi-35e83a9703a5a01e8-part1      -      -      0      1     25  47.4K
cache                                 -      -      -      -      -      -
  scsi-3500a075110b740af-part2     462M   194G      0      1  2.52K  11.7K
  scsi-35e83a9703a5a01e8-part2     458M  91.3G      0      0  1.41K  12.0K
--------------------------------  -----  -----  -----  -----  -----  -----

Any hints?
 
Well, I use(d) samsung 850 pro (128GB) + Crucial MX200.

Now I have a Crucial MX200 (256GB) + a OCZ Vertex 150 (indilix). Both SSD can handle 400MB+ writes and 60'000 IOPS during write (benchmarked using atto).

The MX200 was suggested to me on IRC (I don't remember whether the #zfsonlinux or the ##proxmox channels tough because of the supercap).
 
Now I have a Crucial MX200 (256GB) + a OCZ Vertex 150 (indilix). Both SSD can handle 400MB+ writes and 60'000 IOPS during write (benchmarked using atto).

The MX200 was suggested to me on IRC (I don't remember whether the #zfsonlinux or the ##proxmox channels tough because of the supercap).

mx200 sucks for syncronous writes. (I think that you atto benchmark use buffered write).
it's around 950iops 4k block max.

Almost any consumer ssd will fail for syncronous write. you need enterprise grade ssd.
 
mx200 sucks for syncronous writes. (I think that you atto benchmark use buffered write).
it's around 950iops 4k block max.

Almost any consumer ssd will fail for syncronous write. you need enterprise grade ssd.

What would be your suggestion for (consumer) or cheap enterprise drive ?

I'm not sure about your numbers tough.

The MX200 has abour 5000 IOPS with 256 jobs and 950 with 8 jobs (fio test):

--direct=1 --sync=1 --rw=write --bs=4k --numjobs=8 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

The OCZ looks better.
I can get 10'000 iops from it with fio (--direct=1 --sync=1 --rw=write --bs=4k --numjobs=128 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test)

I get 950 with 8 jobs, so maybe this is the problem.

However, I only get 150-160 IOPS from the zpool total. This looks a lot less than the 1'000 I could possibly get from the SSDs alone.
 
Back to the real question:

Does your VMs use synchronous writes? It not, the SSD ist not used and you end up with your raw performance of the SATA drives, which is not very good.
 
I'm using writeback and writetrough. It's windows and linux guests.

I also have samba shares serving home directories to windows domain users.

I know, the server is doing a lot of things, still I'm trying to dimension it properly and understand why the disks are underperforming.

I'm waiting for a couple of DC S3500 to be delivered for a comparison.
 
The IO wait is not bad on average. However, the server is "empty" with just two VM right now. I'm concerned about the disk io when the full load is applied.

Here's the first server. 6x2.5" SATA 7.2K 1TB "enterprise sata"

iowait-1.png

Here's a second server with only 4x3.5" SATA 7.2K (WD Black 1TB)

iowait-2.png

And here's a third server with 8x2.5" SATA 7.2K (HGST 1TB 2.5" disks) + 8GB zil from OCZ Vertex 2. The OCZ has been benchmarked to 10K sync 4K iops via fio. This third server runs a lot of things

iowait-3.png
 
Last edited:
I have an update, using the Intel DC S3500 SSD. They're both capable of 5'000 iops (fio sync write, 4k, 4 jobs). With 16 jobs they go up to 10'000 iops.

Now... pveperf still sucks:

CPU BOGOMIPS: 57529.56
REGEX/SECOND: 2205679
HD SIZE: 1528.62 GB (rpool/t)
FSYNCS/SECOND: 269.22
DNS EXT: 130.79 ms
DNS INT: 0.87 ms (wz)

It's an upgrade from the 160iops, but nowhere near what should be the expected rate. I guess it was not an SSD issue after all.

Any other ideas ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!