ZFS performance regression with Proxmox

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
It could be a few things.
Here are FIO benchmarks against a 12 disk SSD backplane in ZFS RAID 10. 70%/30% read/write split with 16 concurrent jobs holding an IO depth of 16 ops each.
Code:
fio --filename=test --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=posixaio --bsrange=4k-128k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=test --size=8G

Results:
Code:
Run status group 0 (all jobs):
   READ: io=61893MB, aggrb=1031.4MB/s, minb=1031.4MB/s, maxb=1031.4MB/s, mint=60012msec, maxt=60012msec
  WRITE: io=26485MB, aggrb=451912KB/s, minb=451912KB/s, maxb=451912KB/s, mint=60012msec, maxt=60012msec

Summary: 1GB/s read, 452 MB/s write under extreme random IO load.

Well it does not seem all that fast to me with 12 x SSD, but depends on drives that you are using and IO delay that you are getting. Please share that with us, so we can put those results into context.

Here is the same test as yours from inside the VM with host on 10 x 7200 RPM disks + SLOG on intel dc s3500. I added clean disk and created ext4 on it. The results do not seem good to me, compared to SW MDADM RAID with LVM on top. IO WAIT is especially high comparing to MDADM (i remember from memory).

Code:
Run status group 0 (all jobs):
   READ: bw=319MiB/s (335MB/s), 319MiB/s-319MiB/s (335MB/s-335MB/s), io=18.8GiB (20.1GB), run=60108-60108msec
  WRITE: bw=137MiB/s (144MB/s), 137MiB/s-137MiB/s (144MB/s-144MB/s), io=8229MiB (8628MB), run=60108-60108msec

Also, if we do some more real world testing, testing with sync / flush, by adding direct to fio:
Code:
   READ: bw=225MiB/s (236MB/s), 225MiB/s-225MiB/s (236MB/s-236MB/s), io=13.2GiB (14.2GB), run=60080-60080msec
  WRITE: bw=96.5MiB/s (101MB/s), 96.5MiB/s-96.5MiB/s (101MB/s-101MB/s), io=5796MiB (6077MB), run=60080-60080msec

What do you get when you add direct=1?
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
I pulled two HDDs at random and they all seem to be: ST31000524NS.
Seems to me that they have 512 bytes per sector natively and not emulated (512n). What do you think?
https://www.seagate.com/staticfiles...nterprise/Constellation 3_5 in/100516232d.pdf

Intel ssds used for slog and l2arc have 4k.

Default pool ashift is 12 (or 4k).
I guess I could reduce ashift to 10 or 9 and test again, if that is what you are hinting at, but am worried that later drives, that will replace these 1TB will then be to slow. 1TB were added so the servers can go into production cheaply and then get upgraded once customer needs and is willing to pay for more storage.
 

guletz

Famous Member
Apr 19, 2017
1,556
245
83
Brasov, Romania
So your hdd use 512 block but your pool is using 4k (ashift 12). So this is not good. But is also important how is your pool setup used for your tests, and also the VM block size. If I remember, your hdd model (constelation) is using 512 emulated (4 k hardware). For each 512 block it will write 4 k (sorry if my memory is not reliable, I am out of keyboard ).
If I am not wrong and your hdd is not 4k native, then your problem is not zfs, only bad hdd. Also I can say that I have the chance to use hdd (seagate constelation) and it was very very slow with zfs. And I replace this disk with a hgst hdd (nas class) and all was good (I had have used a zfs mirror, and one hdd was broken)

But tomorrow I will see more about yours hdd and I will post my opinion (sorry for delay)
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
@guletz ,
there is no hurry and tnx for caring. Please do check for default block size.

Disks came from IBM for IBM servers a few years back and are supposed to be enterprise grade.
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
sgdisk
Code:
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 1953525168 sectors, 931.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 0282C0E5-F6B1-6A43-B428-7DF23847F5EF
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 1953525134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3437 sectors (1.7 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      1953507327   931.5 GiB   BF01  zfs-55e8af795562e7ec
   9      1953507328      1953523711   8.0 MiB     BF07

again with smartctl:
Code:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-11-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST31000524NS         45W8867 59Y1812IBM
Serial Number:    9WK258AK
LU WWN Device Id: 5 000c50 02ca4ece1
Firmware Version: BB28
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Feb 20 10:24:30 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


Please bare in mind that disk can be lying about it's sector size, but the specs in PDF also state that they have 512 bytes sectors.
 

guletz

Famous Member
Apr 19, 2017
1,556
245
83
Brasov, Romania
Yes, it is seems that your HDDs are 512 native. I think that is very unlikly that a enterprise HDD will lying about this.

So if you want optimal performance you will need to use ashift=9.
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
Will recreate zpool with ashift 9, and test again with volblocksize 8k.

But I wonder how much performance penalty I will have, if I change these 1 TB disks with newer bigger ones, that have 512 emulation and native 4k.
 

guletz

Famous Member
Apr 19, 2017
1,556
245
83
Brasov, Romania
Avoid any 512 emulation, and use/buy only 4K native. At the moment when you will need to replace the disks, you will create a new pool using the new disks, and then you will copy the old pool data to the new pool. This is the best you can do.

But I wonder how much performance penalty I will have

It will be a visible and important degradation of performance. I have had this situation, and after I see how bad is, I start with a new pool/ashift12, like I write!
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
I created new pool with ashift 9. Did lot's of testing. Performance is even worse than with ashift 12.
Here are some of the results.

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb Ashift 9, volblocksize=8k
Code:
3145728000 bytes (3.1 GB) copied, 35.2439 s, 89.3 MB/s
  WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s (80.2MB/s-80.2MB/s), io=3000MiB (3146MB), run=39206-39206msec
  WRITE: bw=53.5MiB/s (56.1MB/s), 53.5MiB/s-53.5MiB/s (56.1MB/s-56.1MB/s), io=3000MiB (3146MB), run=56096-56096msec
     fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
   READ: bw=53.1MiB/s (55.7MB/s), 53.1MiB/s-53.1MiB/s (55.7MB/s-55.7MB/s), io=6141MiB (6440MB), run=115681-115681msec
  WRITE: bw=17.7MiB/s (18.6MB/s), 17.7MiB/s-17.7MiB/s (18.6MB/s-18.6MB/s), io=2051MiB (2150MB), run=115681-115681msec
   fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=1000M --readwrite=randrw --rwmixread=75
   READ: bw=44.3MiB/s (46.5MB/s), 44.3MiB/s-44.3MiB/s (46.5MB/s-46.5MB/s), io=750MiB (786MB), run=16913-16913msec
  WRITE: bw=14.8MiB/s (15.5MB/s), 14.8MiB/s-14.8MiB/s (15.5MB/s-15.5MB/s), io=250MiB (263MB), run=16913-16913msec
    fio --filename=test --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=posixaio --bsrange=4k-128k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=test --direct=1 --size=8G
   READ: bw=303MiB/s (318MB/s), 303MiB/s-303MiB/s (318MB/s-318MB/s), io=17.8GiB (19.1GB), run=60036-60036msec
  WRITE: bw=130MiB/s (136MB/s), 130MiB/s-130MiB/s (136MB/s-136MB/s), io=7794MiB (8172MB), run=60036-60036msec

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb Ashift 9, volblocksize=4k
Code:
3145728000 bytes (3.1 GB) copied, 35.9771 s, 87.4 MB/s
  WRITE: bw=89.0MiB/s (94.4MB/s), 89.0MiB/s-89.0MiB/s (94.4MB/s-94.4MB/s), io=3000MiB (3146MB), run=33337-33337msec
  WRITE: bw=66.6MiB/s (69.8MB/s), 66.6MiB/s-66.6MiB/s (69.8MB/s-69.8MB/s), io=3000MiB (3146MB), run=45055-45055msec
    fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=1000M --readwrite=randrw --rwmixread=75
   READ: bw=47.3MiB/s (49.6MB/s), 47.3MiB/s-47.3MiB/s (49.6MB/s-49.6MB/s), io=750MiB (786MB), run=15832-15832msec
  WRITE: bw=15.8MiB/s (16.6MB/s), 15.8MiB/s-15.8MiB/s (16.6MB/s-16.6MB/s), io=250MiB (263MB), run=15832-15832msec
    fio --filename=test --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=posixaio --bsrange=4k-128k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=test --direct=1 --size=8G
   READ: bw=343MiB/s (360MB/s), 343MiB/s-343MiB/s (360MB/s-360MB/s), io=20.1GiB (21.6GB), run=60108-60108msec
  WRITE: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=8825MiB (9254MB), run=60108-60108msec

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb Ashift 9, volblocksize=512b
Code:
3145728000 bytes (3.1 GB) copied, 61.3125 s, 51.3 MB/s
  WRITE: bw=49.9MiB/s (52.3MB/s), 49.9MiB/s-49.9MiB/s (52.3MB/s-52.3MB/s), io=3000MiB (3146MB), run=60097-60097msec
  WRITE: bw=37.1MiB/s (38.9MB/s), 37.1MiB/s-37.1MiB/s (38.9MB/s-38.9MB/s), io=3000MiB (3146MB), run=80946-80946msec
    fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=1000M --readwrite=randrw --rwmixread=75
   READ: bw=1081KiB/s (1107kB/s), 1081KiB/s-1081KiB/s (1107kB/s-1107kB/s), io=364MiB (382MB), run=345135-345135msec
  WRITE: bw=360KiB/s (369kB/s), 360KiB/s-360KiB/s (369kB/s-369kB/s), io=121MiB (127MB), run=345135-345135msec ... to slow to wait
fio --filename=test --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=posixaio --bsrange=4k-128k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=test --direct=1 --size=8G
   READ: bw=21.9MiB/s (22.9MB/s), 21.9MiB/s-21.9MiB/s (22.9MB/s-22.9MB/s), io=1324MiB (1389MB), run=60515-60515msec
  WRITE: bw=9497KiB/s (9725kB/s), 9497KiB/s-9497KiB/s (9725kB/s-9725kB/s), io=561MiB (588MB), run=60515-60515msec


Seems the bigger shift and volblocksize are, better are the test results in my case.
Sweet spot is at ashift 12 with volblocksize 8k - the default values. Increasing those just wastes space and yields not much more performance.

I am done with testing. No matter what ashift or volblocksizes are, IO WAIT with ZFS is much higher (and speeds lower) than with MDADM and LVM on the same hardware.

I will pay the price for replication and differential backups and use SSD disks, where they are needed.


I know we veered of course from official post, where he was bench-marking Debian VS proxmox, so I might as well install Debian and do some test on it, to compare to what i get on PM.
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
I just destroyed testing machines, but pool was created manually as well as via GUI always as RAID 10.
Stripping over 5 mirror of hdds. Basically 5 mirror vdevs + slog and l2arc in striping mode.

I will reinstall them for production soon and will have the same layout, just root will be on HDDs and better SSDs for log and cache.
I can paste the confing then and test again, but there will be nothing different or surprising. :)
 

LnxBil

Famous Member
Feb 21, 2015
6,273
773
163
Saarland, Germany
We operate ZFS on Linux systems that sustain 1 million IOPS and have observed real world, sustained throughput of 17GB/s (bytes not bits) on one of our Oracle databases.

Impressive numbers. Have you had any problems with Oracle on a support case? ZFS is officially not supported on non-Solaris platforms.
 

denos

Active Member
Jul 27, 2015
82
39
38
Impressive numbers. Have you had any problems with Oracle on a support case? ZFS is officially not supported on non-Solaris platforms.
Although we have Oracle support, we know we're running an unsupported configuration so we focus on MetaLink articles. The biggest gotcha so far has been been that even with all IO set to ASYNCH, Oracle Automatic Diagnostic Repository (ADR) still does Direct IO (which ZFS doesn't support). The resolution is unfortunately a "dirty hack" involving an interposer to remap the ADR Direct IO calls to Async.

It's definitely not for the faint of heart, but we get filesystem compression (typically 4x), great IO and crash-consistent ZFS snapshots sent to our DR hourly. Clones are also a dream.
 

LnxBil

Famous Member
Feb 21, 2015
6,273
773
163
Saarland, Germany
Although we have Oracle support, we know we're running an unsupported configuration so we focus on MetaLink articles. The biggest gotcha so far has been been that even with all IO set to ASYNCH, Oracle Automatic Diagnostic Repository (ADR) still does Direct IO (which ZFS doesn't support). The resolution is unfortunately a "dirty hack" involving an interposer to remap the ADR Direct IO calls to Async.

Yes, I'm familiar with the hacks, I also tried them and spend a whole day stracing and writing interposers for oracle binaries :-D

It's definitely not for the faint of heart, but we get filesystem compression (typically 4x), great IO and crash-consistent ZFS snapshots sent to our DR hourly. Clones are also a dream.

Yes, that's really a dream. Have you tried running an RAC on ZFS yet?
 

mailinglists

Well-Known Member
Mar 14, 2012
613
62
48
how is your pool:

mirror0 ? (zpool status -v)

It looked like this, but the dev names were different and I have added log and cache. So nothing unusual here.
Code:
        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdb3    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
          mirror-3  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
          mirror-4  ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!