ZFS performance regression with Proxmox

gamebrigada

New Member
Feb 28, 2018
8
1
1
32
Hello all,
I've been struggling with getting good performance out of ZFS in proxmox. I have noticed that there is a huge performance deficit in random read and write performance.

Drive Layout:
8 Samsung 970 PRO 1TB drives in span of mirrors. Basically Raid 10.

Pool optimizations:
sync=Disabled
Ashift=13
Recordsize=8K

Testing methodology:
FIO

[random-read-write]
randrepeat=1
ioengine=libaio
gtod_reduce=1
name=test
filename=test
bs=4k
iodepth=64
size=4G
readwrite=randrw
rwmixread=75
directory=/NVME/test

[sequential-write]
rw=write
size=4G
directory=/NVME/test
ioengine=libaio
bs=8k
numjobs=8

Results in Clean Debian Stretch install

Random Read Write

READ: io=3070.4MB, aggrb=318378KB/s, minb=318378KB/s, maxb=318378KB/s, mint=9875msec, maxt=9875msec
WRITE: io=1025.8MB, aggrb=106361KB/s, minb=106361KB/s, maxb=106361KB/s, mint=9875msec, maxt=9875msec

Results in Clean Proxmox install

Random Read Write

READ: io=3070.4MB, aggrb=63124KB/s, minb=63124KB/s, maxb=63124KB/s, mint=49806msec, maxt=49806msec
WRITE: io=1025.8MB, aggrb=21088KB/s, minb=21088KB/s, maxb=21088KB/s, mint=49806msec, maxt=49806msec


For reference, sequential write is unaffected.

Sequential Write under Debian

WRITE: io=32768MB, aggrb=2653.7MB/s, minb=339592KB/s, maxb=351428KB/s, mint=11935msec, maxt=12351msec

Sequential Write under Proxmox

WRITE: io=32768MB, aggrb=2566.3MB/s, minb=328475KB/s, maxb=340474KB/s, mint=12319msec, maxt=12769msec

Installing Proxmox on top of Debian did not help this cause. Performance still dropped.

Anything I can do to increase performance?
 
8 Samsung 970 PRO

You cannot get great performance from cheap consumer SSD (in this case, "PRO" is misleading, as this is still a consumer/workstation SSD and not designed for server/datacenter usage).
 
Tom, where would the difference in speed benchmarks come from?
In both cases is the _same_ hardware.
 
Have you recreated the pool on each test or used the same pool over and over?

Read-Performance tests on ZFS are confusing if there is no underlying data (in case of zeros). This can lead to false read performance data, because you do not actually read on disk. This is true for every thin-provisioned storage.

I have also run your test on PVE and Debian Stretch ZFS and get similar but reversed results. I just bootet of a Debian-Stretch live cd which includes ZFS from stretch-backports (0.7.12-1) in order to import the PVE pool.

I created a zvol of size 4G, filled it from /dev/urandom and tested on it. Same test, only different kernel and ZFS in my case, the Debian kernel performs worse:

Proxmox VE:

Code:
   READ: io=786792KB, aggrb=54045KB/s, minb=54045KB/s, maxb=54045KB/s, mint=14558msec, maxt=14558msec
  WRITE: io=261784KB, aggrb=17982KB/s, minb=17982KB/s, maxb=17982KB/s, mint=14558msec, maxt=14558msec

Debian:

Code:
   READ: io=786792KB, aggrb=7222KB/s, minb=7222KB/s, maxb=7222KB/s, mint=108929msec, maxt=108929msec
  WRITE: io=261784KB, aggrb=2403KB/s, minb=2403KB/s, maxb=2403KB/s, mint=108929msec, maxt=108929msec
 
  • Like
Reactions: Stoiko Ivanov
LnxBil,
That's weird. That's gotta be a problem with the Live CD. Must be somehow performance limited, because those numbers are terrible.

I'm starting to think this is a much deeper problem. ZFS is such a mess.

Tom,
For the record, the same configuration in a Solaris zpool grants MUCH higher performance. I had to bump the random-read-write test to 8 jobs, and it pushed 1GB/s read 300MB/s write. Bumping the Sequential write test to 512k blocks to match the recordsize (which I set to 512k because it didn't change the random-read-write performance), pushed 8GB/s in writes.

I am NOWHERE near the limit of these drives in Proxmox. I think I'm hitting some kind of performance limitations that were fixed by oracle after the fork.
 
I'm starting to think this is a much deeper problem. ZFS is such a mess.

Tom,
For the record, the same configuration in a Solaris zpool grants MUCH higher performance. I had to bump the random-read-write test to 8 jobs, and it pushed 1GB/s read 300MB/s write. Bumping the Sequential write test to 512k blocks to match the recordsize (which I set to 512k because it didn't change the random-read-write performance), pushed 8GB/s in writes.

I am NOWHERE near the limit of these drives in Proxmox. I think I'm hitting some kind of performance limitations that were fixed by oracle after the fork.

Yes, ZFS was created on Solaris and therefore it is the best option to use on Solaris. We still have the Solaris Portability Layer (SPL) that translates or proxies the ZFS stuff into the Linux and soon the FreeBSD world.

Did you run your read test on real data? Keep in mind that ZFS does not write zeros to disk, so filling a file with zeros does not actually write, so if you read from this file, you will not test your actual throughput, but only the ZFS software itself. I have no idea what fio writes in its files, but if is a compressible pattern, your numbers will be to high. Best is to either use actual data or data read from /dev/urandom.
 
Yes, ZFS was created on Solaris and therefore it is the best option to use on Solaris. We still have the Solaris Portability Layer (SPL) that translates or proxies the ZFS stuff into the Linux and soon the FreeBSD world.

Did you run your read test on real data? Keep in mind that ZFS does not write zeros to disk, so filling a file with zeros does not actually write, so if you read from this file, you will not test your actual throughput, but only the ZFS software itself. I have no idea what fio writes in its files, but if is a compressible pattern, your numbers will be to high. Best is to either use actual data or data read from /dev/urandom.

Fio writes incompressible random data. Compression was also disabled in ZFS. I specifically chose FIO because dd would give inconsistent numbers.
 
:-(
Yeah ZFS is horribly slow for us to, but it allows us differential (network) backups, replication and official SW RAID amongst other nice things and these make it up for it's slowness.

Anecdote: I used to think providers like Intel (for DC SSD series) lie about their IOPS and throughput because I first used SSDs only with ZFS, but after having SSDs elsewhere, where no ZFS was used, I noticed it's just ZFS that is slow.

Currently ZFS is so slow for us, that we HAVE to use SSD disks, where we previously used HDDs.
I just bought a 12 bay supermicro server, which I populate with 10 HDDs and 2 SSDs, to check if this would make it fast enough.
However this is another story in itself, because ProxMox wont't boot if I install to all disks on this LSI controller.
 
Ashift=13
Recordsize=8K

If you chose a bad value like 8K, for a raid10 (4 × mirror of 2 ssd) then the results are the same. So any block write or read wil be = recordsize = 8 k in zfs pool. This 8k block will be split by 4 mirrors => 8k / 4 = 2 K/each mirror => 2k/each ssd.

But the minimum block on ssd is 8k (ashift 13). So it will be write 2k data + 6k (padding) = 8k. The same ideea for reads.

Also you must take in account that is also important the block size of ssd. Could be 4k or more (on some Intel dc models is 8k)


So if you can try to setup a good recordsize for your pool you can get decent results ;)
 
Just keep in mind that the value of recordsize is the maximum that zfs cares about. In other words, the recordsize is dynamic to the maximum of this value. If your workload will use less, this would be ok. If you're also using compression, which is recommended in general, this may give you completely other results.
In short, don't use this in a static manner. Try higher values and use also compression and test with you specific workload.

Cheers Knuuut
 
So when using ZFS RAID 10 with 10 (HDD) disks, which means 5 mirrors, each write is split by 5, so one could get better performance if recordsize / 5 would match block size of HDDs?
 
Just keep in mind that the value of recordsize is the maximum that zfs cares about. In other words, the recordsize is dynamic to the maximum of this value. If your workload will use less, this would be ok. If you're also using compression, which is recommended in general, this may give you completely other results.
In short, don't use this in a static manner. Try higher values and use also compression and test with you specific workload.

Cheers Knuuut


Yes for a dataset the recordsize is variable. But for a zvol, volblocksize is not variable.
 
I did a few tests.

Conclusion:
Increasing ashift or zvolblocksize does pretty much nothing for performance in our case of hosted VMs, just eats lots more disk space. I observed high IO wait when I put some VMs on it and the tests below reflect low numbers.
ZFS is really slow, even with 10 x HDDs in 5 mirror vdevs +/- SSD cache compared to SW MDADM RAID + LVM on worse hardware.

Here pasted tests were done inside single VM running alone on PM hosts with LSI in IT mode.

Tests i am sharing here are listed below. I used 4k blocks for read write, because almost all (our) VMs filesystems default to 4k
Code:
 dd if=/dev/zero of=brisi bs=10M count=300 oflag=dsync
 fio --filename=brisi --sync=1 --rw=write --bs=10M --numjobs=1 --iodepth=1 --size=3000MB --name=test
 fio --filename=brisi --sync=1 --rw=write --bs=512k --numjobs=1 --iodepth=1 --size=3000MB --name=test
 fio --randrepeat=1 --ioengine=libaio --direct=1/0 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb volblocksize=8k
Code:
3145728000 bytes (3.1 GB) copied, 18.7525 s, 168 MB/s
  WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=3000MiB (3146MB), run=24976-24976msec 10M blok
  WRITE: bw=91.6MiB/s (96.1MB/s), 91.6MiB/s-91.6MiB/s (96.1MB/s-96.1MB/s), io=3000MiB (3146MB), run=32738-32738msec 512k blok
   fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
   READ: bw=145MiB/s (152MB/s), 145MiB/s-145MiB/s (152MB/s-152MB/s), io=6141MiB (6440MB), run=42265-42265msec
  WRITE: bw=48.5MiB/s (50.9MB/s), 48.5MiB/s-48.5MiB/s (50.9MB/s-50.9MB/s), io=2051MiB (2150MB), run=42265-42265msec
iodelay peak 14%

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb Ashift 13, volblocksize=8k
Code:
3145728000 bytes (3.1 GB) copied, 27.547 s, 114 MB/s
  WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=3000MiB (3146MB), run=27989-27989msec 10M blok
  WRITE: bw=74.0MiB/s (77.6MB/s), 74.0MiB/s-74.0MiB/s (77.6MB/s-77.6MB/s), io=3000MiB (3146MB), run=40527-40527msec 512k blok
   fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
   READ: bw=141MiB/s (148MB/s), 141MiB/s-141MiB/s (148MB/s-148MB/s), io=6141MiB (6440MB), run=43457-43457msec
  WRITE: bw=47.2MiB/s (49.5MB/s), 47.2MiB/s-47.2MiB/s (49.5MB/s-49.5MB/s), io=2051MiB (2150MB), run=43457-43457msec
iodelay peak 17%

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb volblocksize=16k
Code:
3145728000 bytes (3.1 GB) copied, 17.7266 s, 177 MB/s
  WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=3000MiB (3146MB), run=24942-24942msec 10M blok
  WRITE: bw=104MiB/s (109MB/s), 104MiB/s-104MiB/s (109MB/s-109MB/s), io=3000MiB (3146MB), run=28861-28861msec 512k blok
   fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
   READ: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=6141MiB (6440MB), run=50995-50995msec
  WRITE: bw=40.2MiB/s (42.2MB/s), 40.2MiB/s-40.2MiB/s (42.2MB/s-42.2MB/s), io=2051MiB (2150MB), run=50995-50995msec

10 x 1 TB 7200 2x slog 8g s3500, cache 12gb Ashift 13, volblocksize=16k
Code:
3145728000 bytes (3.1 GB) copied, 28.6073 s, 110 MB/s
  WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=3000MiB (3146MB), run=27291-27291msec 10M blok
  WRITE: bw=76.6MiB/s (80.4MB/s), 76.6MiB/s-76.6MiB/s (80.4MB/s-80.4MB/s), io=3000MiB (3146MB), run=39146-39146msec 512k blok
   fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=brisi --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
   READ: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=6141MiB (6440MB), run=49668-49668msec
  WRITE: bw=41.3MiB/s (43.3MB/s), 41.3MiB/s-41.3MiB/s (43.3MB/s-43.3MB/s), io=2051MiB (2150MB), run=49668-49668msec
 
It could be a few things.
1) Original Debian install likely wasn't based on ZFS 0.7.12 (latest). After cutting over to the new PVE kernel a ZFS upgrade process will run for 1+ hours in the background impacting performance. If you want to watch for it, hit 'K' (capital) in htop to show kernel processes. Also depending on the original kernel, you may now be getting significant performance loss from Spectre / Meltdown / branch prediction mitigation.

2) ZFS defaults may have changed as part of a ZFS upgrade. Capture and compare the differences in files at /sys/module/zfs/parameters.

For those commenting on performance: ZFS performance isn't great out of the box but there are hundreds of parameters that can be adjusted to tune it for your use case. We operate ZFS on Linux systems that sustain 1 million IOPS and have observed real world, sustained throughput of 17GB/s (bytes not bits) on one of our Oracle databases.

Here are FIO benchmarks against a 12 disk SSD backplane in ZFS RAID 10. 70%/30% read/write split with 16 concurrent jobs holding an IO depth of 16 ops each.
Code:
fio --filename=test --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=posixaio --bsrange=4k-128k --rwmixread=70 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=test --size=8G

Results:
Code:
Run status group 0 (all jobs):
   READ: io=61893MB, aggrb=1031.4MB/s, minb=1031.4MB/s, maxb=1031.4MB/s, mint=60012msec, maxt=60012msec
  WRITE: io=26485MB, aggrb=451912KB/s, minb=451912KB/s, maxb=451912KB/s, mint=60012msec, maxt=60012msec

Summary: 1GB/s read, 452 MB/s write under extreme random IO load.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!