[SOLVED] Working with ZFS - Caching Strategies ?

Q-wulf

Renowned Member
Mar 3, 2013
613
39
93
my test location
TL;DR: Questions at the bottom

Sidenote: I typically use Ceph for all my professional and personal Proxmox needs. (other then the occasional ZFS Raid1 for the OS-Disks).

I have a small personal project going, and its not going as expected at all.

Some Specs:
32 GB Ram
2x 3 TB HDD
Proxmox installed via installer as ZFS based Raid0 to utilize the space fully.
Latest updates with openvswitch

Code:
zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
rpool                      2.53T  2.74T    96K  /rpool
rpool/ROOT                 3.35G  2.74T    96K  /rpool/ROOT
rpool/ROOT/pve-1           3.35G  2.74T  3.35G  /
rpool/data                 2.52T  2.74T    96K  /rpool/data
rpool/data/vm-100-disk-1     64K  2.74T    64K  -
rpool/data/vm-1001-disk-1  2.52T  2.74T  2.52T  -
rpool/swap                 8.50G  2.75T   255M  -

Currently all required functionally is sitting in a single Debian VM and semi-working, its bad for multiple reasons, but specifically being unable to set limits on a specific services access to server resources and the extreme IO delay (95% at times) i am encountering.



I'm looking to split this into 5 VM's:
  • VM-1 is storing media-data, anywhere from 3-50 GB per file.
  • VM-2 is responsible for handling fileuploads (Upload Server)
  • VM-3 is responsible for Mediadata cutting /compression (Compression/cutting Server)
  • VM-4 is responsible for displaying and okaying said compresssion (Plex server)
  • VM-5 is responsible for distributing said Media Data to multiple services and and then deleting it from VM-2
I was thinking of achieving this by providing every single VM a private IP on the internal OVSWITCH based vmbr1, then setting up NFS access to VM2 for VM's 1,3,4 and 5.



Question 1: Is this the best way to handle this kind of Setup with regards to ZFS ?
Should i use a single vDisk per VM?
Should i use multiple vDisks per VM (as in OS-Disk and Data Disk?
I feel like i should use seperate vDisks for OS and Media Data. Not sure tho.
Is NFS the best way to share the sama dataset between multiple VM's ?

Question 2: What Caching Mode is best to choose for the vDisks in Proxmox when using ZFS ?

Direct Sync, Write through, Write Back, None ?
Use IOThread ?
I kinda feel i should use something like sync with iothread=on on OS-vDisks and No Cache on Data-vDisks (given the fact that the files stored and accessed on Data-vDisks are larger then the Servers Max Ram)
Not sure on this.

Question 3: How do i stop ZFS to consume all my Ram for its ZFS Caching ?

I don't use dedupe; seems useless to dedupe 5x2Gb of Debian OS-Data at the expense of Gigabytes worth of Ram.
ZFS probably sticks most of my Ram into Arch Caching anyways (which i do not need for the Large Files because a 50GB file does not fit 32 GB Ram). At present it consumes 27 GB out of the 32.

I tried to limit it using
https://pve.proxmox.com/wiki/ZFS_on_Linux#_limit_zfs_memory_usage
nano /etc/modprobe.d/zfs.conf
Code:
options zfs zfs_arc_min=10737418240
options zfs zfs_arc_max=12884901888
update-initramfs -u
^^10-12 GB. 1 GB per TB of disk space, 4 GB for other stuff. Not sure how that is going as of yet, will need to do more tests.

Not sure if it is the best method tho.


Question4: How do i stop a Debian VM to also Cache a Data-vDisk ?
It looks right now, as if Debian is using the 10GB i had assigned to its VM to also Cache the Media Data (or parts of it).
That seems redundant and counter productive
Since ZFS Arc should be used for the small files, which means they are kept in ram twice ( ZFS Arc + Debian VM Ram)
And Large Files probably make zero Sense to Cache, since a single file is most likely larger then the Ram assigned to ZFS or the Debian VM.



Any help appreciated.
 
Last edited:
So, i made some progress ...

I stumbled upon this explanation regarding the caching modes:
https://access.redhat.com/documenta...uning_Optimization_Guide-BlockIO-Caching.html

IMHO it is lot better explained compared to the proxmox guide:
https://pve.proxmox.com/wiki/Performance_Tweaks#Disk_Cache

Basically: what i am going to use are are the following options:
Cache=none --> Data-vDisks
Cache=writeback -> OS-vDisks (e.g. caching small writes by logs and whatnot)
I also disabled Swapping on my Test-VM's.

I noticed that these VM's (debian and a cent07 based Rockstor-NAS) consumed a lot less resources when receiving large writes. So thats a plus.

TL;DR Question at the bottom


Another thing i had not considered is the fact that all my personal and work proxmox servers are running on a SSD/NVME, with the exception of this hetzner server, which runs 2 hdd in zfs based raid0.

So i ran pveperf to check whats going on:
root@xxx:~# pveperf /rpool/
CPU BOGOMIPS: 54400.16
REGEX/SECOND: 2043020
HD SIZE: 2365.98 GB (rpool)
FSYNCS/SECOND: 209.93
DNS EXT: 25.74 ms
DNS INT: 47.14 ms (xxx.net)

Thats some pretty shitty Fsyncs compared to my SSD/NVME fsyncs.

These are 2x TOSHIBA DT01ACA3 / Hitachi HDS723030BLE640 (Dual branded), should be doing sequential R/W of 160 MB/s and random R/W of 70-ish. This test basically says all about it: http://www.storagereview.com/hitachi_deskstar_7k3000_3tb_review_hds723030ala640 not the io-monster. So i guess it is to be expected (next time spring for the SSD/HDD auction server)

for reference sake:
zfs get all rpool
NAME PROPERTY VALUE SOURCE
rpool type filesystem -
rpool creation Mon Jan 9 0:55 2017 -
rpool used 2.96T -
rpool available 2.31T -
rpool referenced 96K -
rpool compressratio 1.00x -
rpool mounted yes -
rpool quota none default
rpool reservation none default
rpool recordsize 128K default
rpool mountpoint /rpool default
rpool sharenfs off default
rpool checksum on default
rpool compression on local
rpool atime off local
rpool devices on default
rpool exec on default
rpool setuid on default
rpool readonly off default
rpool zoned off default
rpool snapdir hidden default
rpool aclinherit restricted default
rpool canmount on default
rpool xattr on default
rpool copies 1 default
rpool version 5 -
rpool utf8only off -
rpool normalization none -
rpool casesensitivity sensitive -
rpool vscan off default
rpool nbmand off default
rpool sharesmb off default
rpool refquota none default
rpool refreservation none default
rpool primarycache all default
rpool secondarycache all default
rpool usedbysnapshots 0 -
rpool usedbydataset 96K -
rpool usedbychildren 2.96T -
rpool usedbyrefreservation 0 -
rpool logbias latency default
rpool dedup off default
rpool mlslabel none default
rpool sync standard local
rpool refcompressratio 1.00x -
rpool written 96K -
rpool logicalused 2.94T -
rpool logicalreferenced 40K -
rpool filesystem_limit none default
rpool snapshot_limit none default
rpool filesystem_count none default
rpool snapshot_count none default
rpool snapdev hidden default
rpool acltype off default
rpool context none default
rpool fscontext none default
rpool defcontext none default
rpool rootcontext none default
rpool relatime off default
rpool redundant_metadata all default
rpool overlay off default


Then i stumbled upon this thread here:
https://forum.proxmox.com/threads/poor-performance-with-zfs.21568/

There is talk about using

zfs set sync=disabled rpool . which leads to dataloss on crash

Manual description:
sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.

I wonder if i can get away with it. If i crash the following happens:
1) Downloads will need to be resumed/redone anyway.
2) If i was cutting a media-project I'd need to turn on outosafe. If it crashes during compression / transcode, then i'd need to restart it anyways.
3) all other options are read (besides logs, which afaik i can live with loosing)
Anything else i am missing ?





Question 5: Besides sync=disabled (or hardware changes), is there any other way to tune ZFS to handle more synchronous IO ?


Question 6: Is a NAS-Style VM (like Rockstor / NFS-Server, etc) the best way to make the same dataset available for multiple VM's ? Or is there something you can do with ZFS natively (e.g. "magic") ?
 
not sure if I can give you any direct advice except that you may be trying to squeeze blood from a stone- there is only so much running virtualization on two 3.5" spindles can net you.

that said-
Question 5: Besides sync=disabled (or hardware changes), is there any other way to tune ZFS to handle more synchronous IO ?

start here: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
not all of it is applicable to ZOL but most does.
relevent tunables:
options zfs zfs_vdev_sync_write_min_active=
options zfs zfs_vdev_sync_write_max_active=
options zfs zfs_vdev_sync_read_min_active=
options zfs zfs_vdev_sync_read_max_active=
options zfs zfs_vdev_async_read_min_active=
options zfs zfs_vdev_async_read_max_active=
options zfs zfs_vdev_async_write_min_active=
options zfs zfs_vdev_async_write_max_active=​

adding l2arc (4-5x arc) and ZIL SSDs could help too.

Question 6: Is a NAS-Style VM (like Rockstor / NFS-Server, etc) the best way to make the same dataset available for multiple VM's ? Or is there something you can do with ZFS natively (e.g. "magic") ?

maybe. simple RAID0/1 using BTRFS can be faster then ZOL, but its hard to imagine it would be noticable. ZFS on BSD/OpenIndiana should be marginally faster too.
 
not sure if I can give you any direct advice except that you may be trying to squeeze blood from a stone- there is only so much running virtualization on two 3.5" spindles can net you.
[...]
adding l2arc (4-5x arc) and ZIL SSDs could help too.

Yeah i figured as much. I am already looking for a better deal to come up at hetzners server auctions. If i am quite honest, I did not even think about it when renting this server, i was fixated on the maximum space available. I should have known better to be quite honest.


maybe. simple RAID0/1 using BTRFS can be faster then ZOL, but its hard to imagine it would be noticable. ZFS on BSD/OpenIndiana should be marginally faster too.

There may be a misunderstanding here.
The Hosts File system remained ZFS.
Only on the Guest I switched from 1 vDisk using Cache=directsync and EXT4 to 2 vDisks (one for / with Cache=directsync and one for /data with cache=none) using btrfs. I also disabled Swap on said Guest. The write speed did not really improve, but the IOdelay subsided (still at 6% as opposed to 16-27%)





regarding the ZFS tuning. I'll look into these before pulling the trigger on "sync=disabled (or hardware changes)". So far the read has been quite informative.


Thanks @alexskysilk
 
update: got a good deal over at hetzner (2x 240GB SSD + 4x3TB HDD)

set up exactly like before, no iodelay anymore.

I'll considered this solved now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!