[SOLVED] ZFS Performance Tuning

mattlach

Well-Known Member
Mar 23, 2016
169
17
58
Boston, MA
Hey all,

While I've been using ZFS for years, I am completely new to it under Linux. I was trying to replicate some performance tuning I'd done successfully on BSD, where "tunables" are added to /boot/loader.conf

Problem is, I couldn't seem to figure out where these "tunables" go on ZFS on Linux.

This page suggests that ZFS "tunables" can go in /etc/modprobe.d/zfs.conf. This file does not exist on my Proxmox install. If I create it, will the ZFS implementation in Proxmox honor the settings I put in there?

I'd appreciate any feedback!

Thanks,
Matt
 
yes, just create the file.

see also:

http://pve.proxmox.com/wiki/Storage:_ZFS


Thank you, that is very helpful!

I'd be interested what you tune and why.

I plan to sett the following tunables in order to improve l2arc caching (at the expense of SSD lifespan, but nothing lasts forever :p )

l2arc_noprefetch 0
l2arc_write_boost 134217728
l2arc_write_max 67108864

One of my containers is a media server, and I am looking to avoid content stutter during heavy pool load. These tunables worked very well in that regard when my pool was in FreeNAS.


On a side note:

I currently have two pools on my proxmox box, the mirror created by the installer, and a 12 disk (dual 6 disk RAIDz2) with mirrored SLOG/ZIL SSD's and dual l2arc SSD's for mass storage.

It concerns me a little bit that the mirrored pool created by the proxmox uses direct device links to /dev/sda and /dev/sdb

When I first imported it, my big pool did the same, and it didn't take long for one of the devices to be offlined because it's device name changed during reboot, so I converted it to control devices by by id by exporting it again, and importing it using the following command:

Code:
zpool import -d /dev/disk/by-id/ <poolname>

This worked beautifully, as my big pool status now looks like this:

Code:
 state: ONLINE
  scan: scrub repaired 0 in 7h25m with 0 errors on Wed Apr 27 05:55:41 2016
config:

        NAME                                             STATE     READ WRITE CKSUM
        zfshome                                          ONLINE       0     0     0
          raidz2-0                                       ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EK8ZSX37     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EK8ZS6DS     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EF24PKAY     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EE3LX9RV     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5UJ0PX5     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EFP2610U     ONLINE       0     0     0
          raidz2-1                                       ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1636111     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1647062     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1646218     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1646088     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5PEDRTY     ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5X4NKU1     ONLINE       0     0     0
        logs
          mirror-2                                       ONLINE       0     0     0
            ata-INTEL_SSDSC2BA100G3_BTTV425101D5100FGN   ONLINE       0     0     0
            ata-INTEL_SSDSC2BA100G3_BTTV4253038A100FGN   ONLINE       0     0     0
        cache
          ata-Samsung_SSD_850_PRO_512GB_S250NXAH313648V  ONLINE       0     0     0
          ata-Samsung_SSD_850_PRO_512GB_S250NWAG822223D  ONLINE       0     0     0

errors: No known data errors


I'm looking to convert the installer generated mirror to use device id's as well, but the approach above is unlikely to work, as I can't export the root drive while booted.

I'm wondering if a command like this would work, one by one (waiting for the resilver to finish in between) on the two disks in the mirror:

Code:
zpool replace /dev/<sdx> /dev/disk/by-id/<equivalent disk id>

or if the volume checker would complain that the disk I am trying to add to the mirror is already a member of the mirror.

If I get the complaint above it becomes more problematic, because then I have to first offline the sdx disk, then somehow delete the ZFS label so that the disk appears to not be a member of the pool anymore, and then resilver it.

The command "zpool labelclear <device>" is supposed to clear ZFS labels, but it never works for me, and I see similar complaints out on various forums to suggest it isn't working for others either, so I assume it is broken. This leaves overwriting the labels with DD.

The only way I have successfully done this is by doing a "dd if=/dev/zero of=<device>" and allowing it to overwrite the entire disk. This takes a VERY long time though, and it wastes drive write cycles in the case of SSD's. The reason for this is that ZFS creates 4 labels throughout the disk. If I had their exact locations I could DD just those spots, but the best I can come up with are vague statements about two labels being towards the beginning of the drive, and two being towards the end.

Any suggestions here?
 
Thank you for sharing. I wasn't aware of the multiple label stuff. I just deleted the GPT via parted.

Your setup looks quite fast. Where do you house your disks? I'm housing mine in two MSA60 shelves with a SAS2008-based JBOD controller (for >2 TB sizes).
 
Thank you for sharing. I wasn't aware of the multiple label stuff. I just deleted the GPT via parted.

Your setup looks quite fast. Where do you house your disks? I'm housing mine in two MSA60 shelves with a SAS2008-based JBOD controller (for >2 TB sizes).

Mine is built into a Norco RPC-4216 (with optional 120mm fan wall)

Here is an older pic (case is still the same though)

15268538436_a8ac02e4ed_b.jpg


Full specs are as follows:
  • Case Norco RPC-4216 (with optional 120mm fan wall)
  • Motherboard: Supermicro X8DTE
  • CPU's: 2x Intel Xeon L5640 (6 cores, 12 logical with HT each, total of 12 cores, 24 logical at 2.2Ghz, turbo to 2.8)
  • RAM: 192GB Registered ECC DDR3-1600 (12x16GB)
  • Controllers: 2x LSI 9211-8i flashed to IT mode to act as HBA's for ZFS
  • Big Storage Pool Drives: 12x WDRed 4TB in 2x 6 drive RAIDz2, with 2x Samsung 850 Pro for L2ARC and 2X Intel SSD's for SLOG/ZIL.
  • Boot drives and VM/Container storage: 2x Samsung SSD 850 EVO 512GB in ZFS Mirror (on board SATA)
  • Dedicated Swap Drive: 128GB Samsung 850 Pro (on board SATA)
  • MythTV container LiveTV Drive: 128GB Samsung 850 Pro (on board sata)
  • MythTV container Scheduled Recordings drive: 1TB Samsung 850 EVO
  • On board 2x Gigabit ethernet connects to switch using LACP
  • Additional dual port gigabit adapter has ports bridged and directly connects to two Silicondust HD Home Prime TV tuners
  • Brocade BR-1020 10Gig fiber adapter directly connects to my workstation (but currently is having problems, driver related? Worked fine in ESXi)

You know, just a little home hobby server :p

I have MythTV set up so that all temporary LiveTV buffers go to a dedicated 128GB Samsung 850. These are automatically erased by MythTV after 24 hours(or when space is needed), unless someone presses "record" while watching live, in which case they stay there. I have a cronjob that checks for video files older than 24 hours every night at 4 and moves them to the main 1TB scheduled recording SSD.

Another cronjob checks for free space on the 1TB Samsung drive every night, and if needed moves the oldest files to the main ZFS pool to make space for the next days scheduled recordings.

MythTV scans all of it's assigned folders for it's video content, so you can freely move it between drives and it will find it, which makes this setup possible.

At first I was concerned about using the Samsung EVO TLC drives for some of these things, but I did the math on my previous drives, and based on my consumption rate of write cycles, the time the drives were in the server exposed to my workload and the total write cycles available, for the scheduled recordings drive, the 1TB EVO should last 7 years before exhausted, which is more than good enough. Coincidentally I predicted 7 years for the mirrored EVO drives I boot off of and host my VM images and containers on as well. This is why I moved the swap to a dedicated MLC SSD with more write cycles though, as I was concerned that swapping to a TLC drive would significantly shorten its life.

For more drives with more intensive write activity (like the live TV buffer drive, and the L2ARC drives there was no other option but go with full on modern 3D nand MLC drives for their high write endurance.
 
  • Like
Reactions: slanbuas
wow! Very nice setup and this is really at your home?

Is your TV station that good that you can record around the clock? TV in Germany is so bad that I abandoned it in 2004 and never looked back.
 
wow! Very nice setup and this is really at your home?

Yeah, it is. It sounds more impressive than it is really. I've built it together from bits and pieces I've pounced on when the price has been unusually favorable used on eBay over the years :p

Is your TV station that good that you can record around the clock? TV in Germany is so bad that I abandoned it in 2004 and never looked back.

Well, TV is just one of the many things I use the server for, but because it is very disk latency dependent, it gets more attention from me from a cache and disk performance perspective. Playback of videos isn't usually very intensive locally, but add together the latencies from networked tuners, networked storage (that is shared with other workloads) and networked HTPC's for playback, and it all adds up, and becomes critical to manage for it to work properly, not skip or freeze, especially during fast forward and rewind which trick typical caching methods pretty bad.

Personally I don't care that much about TV. When I was single I didn't even have one, but my fiancee and future stepson appreciate it, so I subscribe to a FiOS cable package which has a hundred or so HD cable channels (most of them utter trash if you as me, but it is what it is.) Using the server for DVR under MythTV started when I looked at my bill and was mad at how much the cable box rental fees were. The irony is that since then, I've spent WAY more money on this setup than I ever would have if I had rented the equipment from the cable provider, but on the flip side it is a fun hobby, and it is WAY more capable than their junk, and I'd rather my money go towards computer hardware than the evil cable company :p

I'm still working through my migration from ESXi to Proxmox. Already when I was using ESXi, this server was admittedly a little bit on the overkill side, but I would still load it up every now and then. I've used this since June 2014. Before that I was on a AMD FX-8350 with 32GB of RAM, and was constantly running low on RAM, but unable to expand it due to only 4 RAM slots taking max of 8GB unregistered modules. I only got the current server because I found an amazing deal on eBay for the motherboard, and then was able to pick up the Xeon CPU's for $60 a piece on eBay as well. The funny part is in going from a latest gen 8 core AMD chip to two generations old dual 6 core hyperthreaded Intel chips, the power consumption stayed about the same.

Now on Proxmox this thing is embarrassingly over-dimensioned for my uses. At the same time as I migrated to Proxmox I also upgraded my ram from 96GB to 192GB, which turned out to be a waste. Under ESXi I was running FreeNAS in a VM with passed through LSI controllers. This required giving the FreeNAS guest a TON of RAM in order to satisfy the ARC cache. Now that I'm running the pool natively on the host, the memory is managed MUCH more efficiently. This combined with the fact that I have converted all but two of my old VM's to LXC containers which use less resources means that I'm using even less RAM. Once I am done migrating everything over, I'm probably going to have lots of extra RAM. I might just go in and force the ARC size up to 128GB or something like that to improve disk performance, just because I have it. It also leaves room for expansion, which is good. or at least better than the opposite :p
 
Thanks for all the info on this thread. I've been going back and forth about doing an ESXi/napp-it setup vs something like Proxmox. My setup is similar, though with slower CPUs and "only" 98GB RAM. :)

I think I'm leaning more toward Proxmox as I've had performance issues with everything in it's own VM using NFS mounts between them. Currently on FreeNAS, and like the jail based setups, which LXC is quite similar to. I was going to experiment with the new FreeBSD VM, but it doesn't support Linux on my older hardware. pffttt.

And Linux based setups are not only more familiar, they are also less whiny about hardware.
 
Very nice. I enjoy hearing stories like that very much.

One fact I always stumble upon: You're all migrating from ESXi towards Proxmox, which is very good. I use Linux for the better part of two decades now and I have been virtualizing with qemu, xen, chroot, jails that for ages, but every time I have to use ESXi it is a terrible experience. It is so slow compared to proxmox, the GUI is rubbish. Why are you migrating from ESXi?
 
Very nice. I enjoy hearing stories like that very much.

One fact I always stumble upon: You're all migrating from ESXi towards Proxmox, which is very good. I use Linux for the better part of two decades now and I have been virtualizing with qemu, xen, chroot, jails that for ages, but every time I have to use ESXi it is a terrible experience. It is so slow compared to proxmox, the GUI is rubbish. Why are you migrating from ESXi?

Yeah, when I first started using ESXi it was really the only system that could do what I wanted to do at the time. KVM has come a long way in the last 6 years.

For me there were many reasons for switching, here are some of them:
  • A bug in ESXi 6 resulted in a kernel panic hard crash situation. The bug was fixed in ESXi 6 update 1a (and 1b and later) but I could not get the updates to install with my free license.
  • Arbitrary limitations of features that are freely available in Open Source implementations based on license levels. For instance using 802.3ad Link Aggregation is only available in paid license implementations using Vcenter. I had to use static link aggregation using what they called "route over IP hash" instead.
  • Collaboration with hardware vendors to block features that otherwise should work, in order to try to upsell you to "professional" versions. For instance, PCIe passthrough of consumer Nvidia GPU's is blocked by a collaboration of Nvidia and VMWare, forcing you to get a Quadro 2000 or above in order to use passthrough. The Proxmox team - on the other hand - seems to be actively circumventing these types of efforts, which I greatly appreciate.
  • Unless you have an expensive paid license you can't use Vcenter, and if you can't use Vcenter you can't use the web based interface, and are forced to use the VMWare ESXi Windows Client. Since I primarily use Linux, I was actually forced to run a Virtualbox desktop VM install of Windows to use the VMWare client to manage my ESXi server (which is linux based to begin with!) which was just insanely stupid.
  • No community priced license exists. I would have had no problem paying a Windows-like price for a home license for ESXi to get some of the features limited in the free version mentioned above, but these did not exist, so it was either very expensive licenses beyond what the home hobbyist can rationalize paying, or use the free version. Now Proxmox is not perfect in this regard either, but it is leaps and bounds better. Since I have a two socket server my annual community license with Proxmox costs ~$150, which seems a little pricy for a home license for what essentially is a management interface on top of free open source software. My only other price comparison for paying for an operating system has been Windows, where I've paid a one time fee of ~$139 for the pro version, and it has lasted me between 3 and 6 years before needing an upgrade (so between ~$25 and ~$45 a year). Other similar projects that operate management interfaces over FOSS software (pfSense, FreeNAS, etc.) are free to the community with full stable updates. Now the argument can be made that these projects are much less complex, but still.
  • I was looking forward to getting back to something more closely Linux based that I could use the famillair command line approach to managing when my needs go beyond the web management interface. Proxmox works great in that regard. ESXi - by comparison - felt too much like a black box. I never really felt like I understood what was going on completely under the hood.
  • ESXi's comparatively limited hardware support and seemingly arbitrary blacklisting of hardware was a major nuisence, especially compared to the excellent hardware support in the Linux kernel.
  • The availability of LXC containers has been HUGE for me as far as resource efficiency goes, not to mention how much better performance I get out of bind mounts in my containers to a native ZFS on Linux array rather than having to do emulated network shares between VM's. Most of my full VM guests in ESXi are now LXC containers in Proxmox, and it has made a HUGE difference in disk and RAM use. The fact that Proxmox has a unified management interface for both KVM and LXC is fantastic. VM When you absolutely need it, container when you don't!

In addition there's also the feel of the whole thing. VMWare is a big faceless corporation with profit incentives at odds with the users interests. I am much more comfortable with the Open Source + Enterprise support revenue model, and while Proxmox is not 100% this, it is a hell of a lot closer than ESXi.
 
  • Like
Reactions: MimCom
  • Unless you have an expensive paid license you can't use Vcenter, and if you can't use Vcenter you can't use the web based interface, and are forced to use the VMWare ESXi Windows Client. Since I primarily use Linux, I was actually forced to run a Virtualbox desktop VM install of Windows to use the VMWare client to manage my ESXi server (which is linux based to begin with!) which was just insanely stupid
I still have this inside my Proxmox. I wanted to do a virtualized VMware, but too much nesting and too much time spend. I only need it for exports for some customers, so I had to buy real hardware. Unfortunately, it is so slow, I cannot believe it. It has the same dual 4 GBit fiberchannel than the proxmox nodes but the throughput of my mobile phone is better. The local disk is only a little bit slower.

  • The availability of LXC containers has been HUGE for me as far as resource efficiency goes, not to mention how much better performance I get out of bind mounts in my containers to a native ZFS on Linux array rather than having to do emulated network shares between VM's. Most of my full VM guests in ESXi are now LXC containers in Proxmox, and it has made a HUGE difference in disk and RAM use. The fact that Proxmox has a unified management interface for both KVM and LXC is fantastic. VM When you absolutely need it, container when you don't!
I really like this, too. I would really love to have ZFS as main storage, so that LXC could be used very intelligently, but I can't. I have a big SAN (everything redundant) and I do not want to use storage virtualization through ZFS to archive this.
 
think about block-size on zfs pool and internal vm fs, example: freebsd default block 32kb, ext4 4k, but zfs pool 128kb.
 
think about block-size on zfs pool and internal vm fs, example: freebsd default block 32kb, ext4 4k, but zfs pool 128kb.

I honestly never paid much attention to the block sizes of the pools.

My big pool was originally created on FreeNAS a long time ago, but has been exported, moved, imported, upgraded, etc. over the years, so I decided to check:

Blocksize of rpool (boot mirror set up by Proxmox installer):
Code:
# zfs get recordsize rpool
NAME   PROPERTY    VALUE    SOURCE
rpool  recordsize  128K     default

Blocksize of my big pool:
Code:
# zfs get recordsize zfshome
NAME     PROPERTY    VALUE    SOURCE
zfshome  recordsize  128K     default

So, it would seem the blocksizes in both are the same.
 
Could you run this command?
Code:
fio –name fio_test_file –direct=0 –rw=randwrite –bs=4k –size=20G –numjobs=16 –time_based –runtime=180 –group_reporting

similar block size on zfs and internal fs of virtual machine very important, if you have different block size, you will have a lot of over commit and performance your storage will degraded.
 
I plan to sett the following tunables in order to improve l2arc caching (at the expense of SSD lifespan, but nothing lasts forever :p )

l2arc_noprefetch 0
l2arc_write_boost 134217728
l2arc_write_max 67108864

One of my containers is a media server, and I am looking to avoid content stutter during heavy pool load. These tunables worked very well in that regard when my pool was in FreeNAS.

I would caution anyone blindly applying these values:
http://mirror-admin.blogspot.hu/2011/12/higher-l2arcwritemax-is-considered.html
 
If the storage in question is based on zvols then the block size of the zvol is to be used and not the block size of the pool.
yes, that's why we have big problem with degradation performance, more different block size in virtual machine and zvols.
 

I would caution against blindly setting any values :p

I've been using these settings for some time under FreeNAS. I originally experimented with them because I was having problems streaming media content off the volume when it was busy. These settings all but made that problem ago away. I think it depends on your usage scenario, but they worked for me.
 
Could you run this command?
Code:
fio –name fio_test_file –direct=0 –rw=randwrite –bs=4k –size=20G –numjobs=16 –time_based –runtime=180 –group_reporting

similar block size on zfs and internal fs of virtual machine very important, if you have different block size, you will have a lot of over commit and performance your storage will degraded.

To be clear, no KVM disk images reside on my big 12 disk pool. It is for file storage only. Unusually hyperactive file storage that - if poorly planned - can hurt media streaming performance but file storage none the less.

My VM storage resides on the ZFS mirror created by the Proxmox installer at install time, which is a simple mirror made up of two SSD's. I've never seen the issue you talk about with over-commit and degraded performance, but maybe that is due to the random performance of the SSD's making up for it.

Funnily enough I don't have to worry much about the performance of disk images though. When I previously was on ESXi I had 12 guests. Two of them were BSD based (pfSense and FreeNAS) and the remaining 10 were Linux (Ubuntu 14.04LTS server) based. When I migrated over to Proxmox I decided it was time for pfSense to get its own little box (I was tired of my internet shutting down every time I maintained the server, and well, its generally a better idea to keep your edge device separate from other things, even if you plan security carefully like I did) and because of some issues I had with FreeNAS under KVM I imported my FreeNAS pool in the host instead and manually shared it.

In other words, my only two non-Linux guests disappeared.

Of the rest, all but 1 I reinstalled as LXC containers. The result is, I only have one VM with a disk image left. (The only reason for this one to exist is because it needs fuse, and I couldn't get fuse to work in a container). All the containers just use the native host file system. It also has very little disk activity, as all it does is read a remotely mounted NFS file system (from the host) apply reverse encryption using encfs, and export it right back out over NFS again to my external backup. Local disk writes are pretty much limited to swap, log files, and package upgrades.

I could run the FIO test, but I don't want to needlessly consume 20G of write cycles on my dual ZFS mirror. I would if I were having problems, but I'm not. I'd run it on the big pool, but I have no reason to there either, as it does not contain any disk images.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!