ZFS poor performance

PretoX

Well-Known Member
Apr 5, 2016
44
10
48
38
Hi everyone,

here's some tests, maybe someone can explain how to make zfs faster:

test server configuration:

  • CPU: 32 x Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz (2 Sockets)
  • RAM: 128Gb
  • 2x1Tb enterprize SSD: rootfs + cache + zil partitions (sda3 and sdb2) for zfs
  • 2x2Tb Hybrid SHDD
  • P710H mini embedded (2Gb cache)

Configuration: Netinstall Debian 9 was installed and PM5 over it - it allows to partition drives like I want it.
Hybrid drives were build in HW Raid 1 as embedded controller has no jbod mode and 2 raid 0 merged in md or z-mirror were much slower as in not shown in results.
HDD configurations tested:
1. zfs created directly on /dev/sdc (Solaris /usr & Apple ZFS)
2. zfs created on /dev/sdc1 (gparted created primary partition)
3. ext4
4. xfs

Both zfs had cache and log parts on separate ssd's. sync = disabled, compression left default as it does not affect io, only cpu load, dedup=off

fio
tool is used for testing. Configuration file:
# cat vm-data.rand-read-write.ini
[readtest]
blocksize=4k
rw=randread
ioengine=libaio
iodepth=32

[writetest]
blocksize=4k
rw=randwrite
ioengine=libaio
iodepth=32

it was run as:
fio vm-data.rand-read-write.ini --size=3G --filename /<mount point>/test

File size 3Gb was chosen after experiment testing when zfs experienced huge performance degradation if test had more than 2Gb file

To be completely sure nothing will affect results after each run reboot was made. Each test had 3 iterations.

So results:
HW + ZFS(sdc) + C + L + sync (no directio)
Read Write
Run: 1 331164 11433
Run: 2 447599 8350
Run: 3 316630 10258

HW + ZFS(sdc1) + C + L + sync (no directio)
Read Write
Run: 1 273232 13052
Run: 2 356658 14071
Run: 3 596572 14037

mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sdc1
EXT4 Read Write
Run: 1 438245 720835
Run: 2 458293 640547
Run: 3 397840 554997

XFS
Read Write
Run: 1 348 349
Run: 2 - -
Run: 3 - -
PM1.png
PM2.png

Conclusions from results:
1. ZFS from default proxmox installation is about 30% slower when zfs made manually on native linux partition.
2. XFS is soooo slow that can be used in prod - that's the result I can't explain, 3Gb file used several hours to run. I didn't show it on chart.
3. ext4 runs much faster on writes that makes heavily loaded servers run much smoother than running on zfs especially when backups run.
4. If you still want ZFS - build it manually, not by PM installer

So I want to ask help from everyone who uses ZFS as a storage - what drive/raid configurations you use to make zfs running with no problems? Especially when PM5 is live and proposed run zfs....
 
first thanks for using fio, this a right tool for benchmarking disks

concernining zfs vs ext4 performance, at the higher level, it depends on where you put the focus for your setup, ZFS is focused on data safety first. You can have checksumming and other ZFS features for free.
It is exactly the same with VM cache modes, there is always a tradeoff between safety (nocache, writethrough mode) and performance (writeback)

I cannot comment on the zfs numbers itselves because you would need to explain what is your zpool setup.
 
What do I think.

Let say single drive can do 100 Mbps sequence write. Usually file system can use all its speed and the write speed +- keeps at the limit with no interrupts. What ZFS do with writes? It put the data to the cache (not ARC) and flush it later. ZFS can accept more then 100 Mbps write speed. Then the write cache is full ZFS stops accepting write request (hold it) while write cache is not flushed. So on heavy writes you can see a wave of write speed up and down (from programs side).

My 2 disk mirror can do 1 Gb/s at least proxmox GUI statistic shows it.
1gbps_mirror_zfs.png
 
a few thought to this benchmark:

1) you have no units on your graphs/results are you measuring kb/s KB/s iops ?
2) as far as i can see you do not use sync with the ext4 test? this does not seem right because then the whole thing gets cached in ram?
(which would explain why the performance is much higher)
3) you use a hw controller for zfs, this is not recommended, as zfs depends on disk cache flushing etc. so it likes to directly control the disks
which the hw raid controller masks completely, this would also explain why you get such miserable performance on zfs, your hw raid controller has 2 gb cache and performance dwindles with sizes > 2gb
4) you use shdd which also have an internal cache which is not really transparent for zfs
5) you write enterprise ssds and hybrid shdds but no concrete models, but there are "good" ssds and "bad" ssds for zfs
 
ZFS on top of HW-raid1 might have serious impact on results. But what I see as bigger problem are those hybrid SHDDs:

It might be difficult to achieve comparable results, as one can never be sure what part of drive (ssd or hdd) is just being used. Moreover, I'm not sure if common utilities (i.e. parted magic, hdderase, etc) can revert SSD-part of those hybrid drives to factory-default state.

I'll try to find some time to do similar tests with just pure 2xSSD setup...
 
I cannot comment on the zfs numbers itselves because you would need to explain what is your zpool setup.
zpool setup as recommended https://pve.proxmox.com/wiki/ZFS:_Tips_and_Tricks except I've set sync to disabled
perf is in KB/s
as I said I can't use jbod mode, and 2 drives set raid0 in mirror show even worse results

So ZFS get's about ~13MB actual write data but zpool shows ~100MB written, which is about as correct if remember that zfs makes 6 write operations on each write.
About EXT4 monitoring io during test showed that write was done to hdd directly as IO oad stopped right after test finished

SSD drives were used only for cache and log, disabling those got 30% speed degradation.
 
Last edited:
as I said I can't use jbod mode, and 2 drives set raid0 in mirror show even worse results
both options do not give zfs the correct access to the disks and will lead to not optimal results,
only direct disk access for zfs is recommended

also i noticed you use a zil but if you disable sync writes for zfs, the zil will never be used, why have it in the first place?

also with a 4k fio test you should really measure iops not kb/s,
if you want to measure sequential write speed you should do so with higher blocksize
 
both options do not give zfs the correct access to the disks and will lead to not optimal results,
only direct disk access for zfs is recommended

also i noticed you use a zil but if you disable sync writes for zfs, the zil will never be used, why have it in the first place?

Just built zfs again:
# zpool status
pool: VMDATA
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
VMDATA ONLINE 0 0 0
sdc ONLINE 0 0 0
logs
sdb2 ONLINE 0 0 0
cache
sda3 ONLINE 0 0 0

errors: No known data errors
Best results I could get:
# fio vm-data.rand-read-write.ini --size=3G --filename /VMDATA/test
readtest: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
writetest: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.16
Starting 2 processes
readtest: Laying out IO file(s) (1 file(s) / 3072MB)
Jobs: 1 (f=1): [_(1),w(1)] [98.6% done] [0KB/161.4MB/0KB /s] [0/41.3K/0 iops] [eta 00m:03s]
readtest: (groupid=0, jobs=1): err= 0: pid=42022: Wed Sep 6 22:57:55 2017
read : io=3072.0MB, bw=509512KB/s, iops=127378, runt= 6174msec
slat (usec): min=2, max=4127, avg= 6.47, stdev=20.12
clat (usec): min=3, max=6280, avg=243.86, stdev=201.23
lat (usec): min=9, max=6329, avg=250.33, stdev=206.26
clat percentiles (usec):
| 1.00th=[ 141], 5.00th=[ 143], 10.00th=[ 145], 20.00th=[ 149],
| 30.00th=[ 151], 40.00th=[ 153], 50.00th=[ 155], 60.00th=[ 169],
| 70.00th=[ 189], 80.00th=[ 330], 90.00th=[ 454], 95.00th=[ 636],
| 99.00th=[ 1048], 99.50th=[ 1208], 99.90th=[ 1832], 99.95th=[ 1960],
| 99.99th=[ 3696]
lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=74.47%
lat (usec) : 500=17.66%, 750=4.38%, 1000=2.25%
lat (msec) : 2=1.20%, 4=0.03%, 10=0.01%
cpu : usr=16.98%, sys=74.58%, ctx=3536, majf=0, minf=78
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=786432/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
writetest: (groupid=0, jobs=1): err= 0: pid=42023: Wed Sep 6 22:57:55 2017
write: io=3072.0MB, bw=15108KB/s, iops=3777, runt=208215msec
slat (usec): min=8, max=111427, avg=259.34, stdev=1570.47
clat (usec): min=2, max=118110, avg=8209.76, stdev=9960.66
lat (usec): min=14, max=118320, avg=8469.09, stdev=10157.32
clat percentiles (usec):
| 1.00th=[ 454], 5.00th=[ 588], 10.00th=[ 780], 20.00th=[ 1128],
| 30.00th=[ 4192], 40.00th=[ 5728], 50.00th=[ 7200], 60.00th=[ 8096],
| 70.00th=[ 9536], 80.00th=[11200], 90.00th=[14144], 95.00th=[16512],
| 99.00th=[64256], 99.50th=[73216], 99.90th=[90624], 99.95th=[97792],
| 99.99th=[113152]
lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
lat (usec) : 500=3.60%, 750=6.00%, 1000=8.90%
lat (msec) : 2=6.15%, 4=4.96%, 10=43.38%, 20=23.97%, 50=1.30%
lat (msec) : 100=1.70%, 250=0.03%
cpu : usr=2.35%, sys=30.50%, ctx=561824, majf=0, minf=390
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=786432/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: io=3072.0MB, aggrb=509512KB/s, minb=509512KB/s, maxb=509512KB/s, mint=6174msec, maxt=6174msec
WRITE: io=3072.0MB, aggrb=15108KB/s, minb=15108KB/s, maxb=15108KB/s, mint=208215msec, maxt=208215msec

Scheduller is NOOP by default
 
I'm in a similar situation with a few Dell servers with H710 where and JBOD or direct attached disks are unavailable. The closest to that is a single-drive logical disk in RAID0 on the controller for each individual disk. Do not use arrays with multiple disks, let ZFS do all the RAID functions. 2 different array implementations will interfere and seriously degrade performance. Single disk RAID0 is not really raid.

In my experience, using the write cache with BBU and no read cache or prefetch (that's what the ARC is for in ZFS) on the adapter in setups like that eliminates the usual iowait I see wherever I use ZFS with disks directly attached or in JBOD, and I see no associated performance degradation. Naturally ZFS loses full control over the disks but can still use all other features. Make sure you have a working BBU and use sync=standard. This won't hinder the consistency of IO flushes or transactions, but you have a similar IOPS throughput to that of the adapter.

IIRC the hybrid drives contain 2 separate logical disks, one being the spinning platter and another one is the SSD. Never used them in servers so I have no idea how the OS or the adapter sees them, but my guess is the SSD parts are not used at all.
 
This is really a mood point if you can't do real jbod anyway.. You will destroy your data at some point with a hardware raid (raid0) ..
 
This is really a mood point if you can't do real jbod anyway..
Hi,
this is lsi-crapware... so you can use raid-0 for single disks only.
And the internal controller of an Dell R620 are not flashable in IT-Mode!
You will destroy your data at some point with a hardware raid (raid0) ..
Ähhh, why??
The data protection should came from the on-top zfs raid level - or do I miss something?

Udo
 
You can find post in FreeBSD forums about raid controllers with no IT-mode and corruption of ZFS. It is DANGEROUS
 
I found a very interesting situation, maybe someone with this kind of problem can try it.
1. When I just completed the pve installation and moved the data, I found zfs, the speed is unstable and very slow, and the CPU usage is also high.
2. I have tried other file systems, and there is no such problem
3. After various combination tests, it is found that ZFS /btrfs in pve has similar problems.

The result of the final test is, "did not create swap when pve installation ?", when I build the swap by myself, these problems disappear. It is currently in good use. even i didn't use swap space.

But ZFS is still very slow when using apt/unpacking, only about 1/3 of ext4, this seems to be an old question, does anyone know the answer??
 
Last edited:
PVE comes with no swap when using ZFS for the system disks, as you shouldn't put swap on top of ZFS.

What hardware are you using (HDD/SSD models, disk controller model)? QLC SSDs and SMR HDDs are for example terrible. And no HW raid cards should be used. What pool layout? How much raw capacity and how much ARC? Does the ashift match the physical sector size? Is the volblocksize reasonable for the pool layout?
 
Last edited:
PVE comes with no swap when using ZFS for the system disks, as you shouldn't put swap on top of ZFS.

What hardware are you using (HDD/SSD models, disk controller model)? QLC SSDs and SMR HDDs are for example terrible. And no HW raid cards should be used. What pool layout? How much raw capacity and how much ARC? Does the ashift match the physical sector size? Is the volblocksize reasonable for the pool layout?
my computer is very simple, just one SATA / SSD for ZFS root system. and try to pull the other one USB/HDD for ZFS.
my setting just by ZFS default . of couse i had try many parameters.
and Yes, we know we don't need swap if have enough RAM. but the face is we need swap for PVE system if you use ZFS or btrfs even you do not used.
 
Last edited:
and Yes, we know we don't need swap if have enough RAM. but the face is we need swap for PVE system if you use ZFS or btrfs even you do not used.
Then you should change the "hdsize" when installing PVE so it keeps some disk space unallocated. Then you can use that unallocated disk space later to manually create a swap partition outside of ZFS. A swap file on a dataset or a zvol as a sap partition might crash your whole PVE server as soon as RAM gets full: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189
 
Then you should change the "hdsize" when installing PVE so it keeps some disk space unallocated. Then you can use that unallocated disk space later to manually create a swap partition outside of ZFS. A swap file on a dataset or a zvol as a sap partition might crash your whole PVE server as soon as RAM gets full: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189
good idea. but my currently status was keep "0"(used) in swap space. and the other one i create swap by copy other guys as below, any idea ?

zfs create -V 16G -b $(getconf PAGESIZE) -o compression=zle \
-o logbias=throughput -o sync=always \
-o primarycache=metadata -o secondarycache=none \
-o com.sun:auto-snapshot=false rpool/swap
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!