Hello everyone,
we've finally decided to test internally a 2 node cluster pve installation with zfs and replication,
and vm performance is considerably slower than what we've got used to with hardware raid and lvm,
Our test installation is meant to replace an old single node installation running on a Dell PowerEdge R730
with a PERC H730 Mini controller with a raid-5 of 3
Samsung 1tb SSDs.
The new cluster nodes are two Dell PowerEdge R440 with 128GB of ram and a mirror zpool running on these
Kingston 2tb SSDs.
On the nodes we run a bunch of services but the most critical of all is an oracle database that runs on a Oracle Linux OS with a 4K blocksize ext4 disk,
we noticed a substantial performance decrease during high IOPS tasks after we moved the vm on the zfs nodes,
to put it in perspective, the same VM takes between 15-30% more time to complete the same tasks on zfs compared to when it was running on lvm.
After running some tests with fio and directly performing tasks on the db on the same vm running on both zfs and raid+lvm,
I saw that the issue is mainly in random read-write workloads, sequential read and write operation performance is mostly the same.
The fio cmds i used to run the tests are the same used in
this post.
After reading various posts on this forum and other websites these are the parameters can have an effect on performance,
so far I tried changing:
recordsize=8K
, set to 8K because that's what OracleDB uses as blocksize
atime=off
, most sources cited that setting access times to off can lead to a slight improvement in IOs
sync=disabled
, I tried setting this parameter both on and off and I did not notice any meaningful improvements
also since I do not have a SLOG device I'd rather say on the safe side and keep this on.
zfs_arc_max=51539607552
, currently set to 50GB.
ashift=12
, I left the default setting because i read that most ssds have a pagesize of 4K,
however this website says that the pagesize of the Kingston DC600M is 16KB,
do you think that creating the zpool with ashift=14 can lead to any sort of improvement?
volblocksize=16K
, the
OpenZFS documentation states that tuning can help with random workloads,
I did not change this parameter from the default 16KB, right now it's the same as the supposed pagesize of my ssd disks,
Would setting this to the blocksize used by the db (8K) or the filesystem (4K) make any sort of difference?
Also one thing that I noticed is that there is a lot of difference in performance if I run the same fio command from inside the vm or directly on the host,
that makes me think that maybe the issue is not zfs in itself but somewhere between the vm vfs and the vm config itself,
which is this one:
scsi2: zfs-2tb:vm-218-disk-0,cache=writeback,discard=on,iothread=1,size=128G,ssd=1
These are the properties of the zpool on which the vm is running.
Code:
root@pve02:~# zfs get all zfs-2tb
NAME PROPERTY VALUE SOURCE
zfs-2tb type filesystem -
zfs-2tb creation Wed Jan 22 11:41 2025 -
zfs-2tb used 1.35T -
zfs-2tb available 336G -
zfs-2tb referenced 4.04G -
zfs-2tb compressratio 1.92x -
zfs-2tb mounted yes -
zfs-2tb quota none default
zfs-2tb reservation none default
zfs-2tb recordsize 8K local
zfs-2tb mountpoint /zfs-2tb default
zfs-2tb sharenfs off default
zfs-2tb checksum on default
zfs-2tb compression on local
zfs-2tb atime off local
zfs-2tb devices on default
zfs-2tb exec on default
zfs-2tb setuid on default
zfs-2tb readonly off default
zfs-2tb zoned off default
zfs-2tb snapdir hidden default
zfs-2tb aclmode discard default
zfs-2tb aclinherit restricted default
zfs-2tb createtxg 1 -
zfs-2tb canmount on default
zfs-2tb xattr on default
zfs-2tb copies 1 default
zfs-2tb version 5 -
zfs-2tb utf8only off -
zfs-2tb normalization none -
zfs-2tb casesensitivity sensitive -
zfs-2tb vscan off default
zfs-2tb nbmand off default
zfs-2tb sharesmb off default
zfs-2tb refquota none default
zfs-2tb refreservation none default
zfs-2tb guid 7036852244761846984 -
zfs-2tb primarycache all default
zfs-2tb secondarycache all default
zfs-2tb usedbysnapshots 0B -
zfs-2tb usedbydataset 4.04G -
zfs-2tb usedbychildren 1.35T -
zfs-2tb usedbyrefreservation 0B -
zfs-2tb logbias latency default
zfs-2tb objsetid 54 -
zfs-2tb dedup off default
zfs-2tb mlslabel none default
zfs-2tb sync standard local
zfs-2tb dnodesize legacy default
zfs-2tb refcompressratio 1.00x -
zfs-2tb written 4.04G -
zfs-2tb logicalused 821G -
zfs-2tb logicalreferenced 4.02G -
zfs-2tb volmode default default
zfs-2tb filesystem_limit none default
zfs-2tb snapshot_limit none default
zfs-2tb filesystem_count none default
zfs-2tb snapshot_count none default
zfs-2tb snapdev hidden default
zfs-2tb acltype off default
zfs-2tb context none default
zfs-2tb fscontext none default
zfs-2tb defcontext none default
zfs-2tb rootcontext none default
zfs-2tb relatime on default
zfs-2tb redundant_metadata all default
zfs-2tb overlay on default
zfs-2tb encryption off default
zfs-2tb keylocation none default
zfs-2tb keyformat none default
zfs-2tb pbkdf2iters 0 default
zfs-2tb special_small_blocks 0 default
zfs-2tb prefetch all default
I'll attach the output of the fio tests below as posing them in codeblocks here would probably make the post too long.
Maybe some of you can either point me in the right direction or at least confirm that with this setup the current results are somewhat expected.