[SOLVED] ZFS incredibly low IOPS and Fsyncs on RAIDZ-1 4 WD Red

brickmasterj

New Member
Oct 18, 2015
19
1
3
For some reason, on 2 of my servers I'm getting incredibly slow IOPS and FSYNC per sec running pveperf Pproxmox 4, on a ZFS RAIDZ-1 cluster with 4 4TB WD Red's.

Code:
root@pve:/# pveperf
CPU BOGOMIPS:      15959.00
REGEX/SECOND:      743734
HD SIZE:           9921.49 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     62.85

All these drivers are directly connected through SATA interfaces, sda through sdd. Now the thing that I find weird is that on a third server, with 8 2TB WD Red's through a RAID controller as 4 4TB volumes, the IO is fine, getting a smooth 3000+ fsyncs/sec... All servers running Proxmox 4.1.

Any ideas how I would go about diagnosing the problem here? Where it might be located and where to find the documentation on that?
 
Well, let me try with some general hints.

The write cache is actually called SLOG, ZIL (ZFS Intent Log, not cache) is always present in RAM and is necessary for ZFS operation. SLOG is something that supports writing ZIL contents to disk by layering a cache device - mostly a fast SSD - between the ZIL and the slow spinning disks. So write operations first get committed to the SLOG (if present) and then to disk, every 5 seconds. If the system goes down unintendedly, uncommitted writes stay in the SSD-backed SLOG and are committed on next ZFS startup, so it also behaves a little like BBWC in HWRAID cards. In short, the SLOG can be small, I'd say several 10 GB is more than enough.

OTOH, L2ARC can be arbitrarily large, but it's required size depends on your "working set" or "read hot spots". It holds the most frequently accessed set of data on your spinning disks for fast read access. There's no rule of thumb for allocating the size, but you can add an SSD (or a few ones striped) and watch the pool status for the used capacity online. You can add more later if needed.

You're better off using separate SSDs for both, since they would compete for the single SSD resource otherwise (and thus lowering IOPS and bandwidth). With low to moderate loads, it might be sufficient, but you can test and see if it improves your throughput enough for your purposes.

You don't absolutely need to use redundancy for SLOG and L2ARC devices, ZFS will just put them offline if they fail, but to avoid this situation (and the assicioated performance degradation) it can be recommended to apply some redundancy anyway.

Finally, yes, there are plenty of discussions on the net about these, just google it :)
 
  • Like
Reactions: vkhera
SLOG is something that supports writing ZIL contents to disk by layering a cache device - mostly a fast SSD - between the ZIL and the slow spinning disks. So write operations first get committed to the SLOG (if present) and then to disk, every 5 secondsin HWRAID cards.
This is not entirely true. It is only synchronous writes which are passing through the SLOG.

In short, the SLOG can be small, I'd say several 10 GB is more than enough.
Remember that the SLOG only needs the capacity to be able to contain the amount of synchronous writes per txg (5 sec default). Unless the you change txg between 5 - 8 GB SLOG is more than adequate.

OTOH, L2ARC can be arbitrarily large, but it's required size depends on your "working set" or "read hot spots". It holds the most frequently accessed set of data on your spinning disks for fast read access. There's no rule of thumb for allocating the size, but you can add an SSD (or a few ones striped) and watch the pool status for the used capacity online. You can add more later if needed.
There is a penalty to pay in terms of RAM for the size of L2ARC. Each GB of L2ARC requires an amount of RAM and since RAM is more important than L2ARC there is a limit for L2ARC where increasing the size will only lower performance of the pool.

You don't absolutely need to use redundancy for SLOG and L2ARC devices, ZFS will just put them offline if they fail, but to avoid this situation (and the assicioated performance degradation) it can be recommended to apply some redundancy anyway.
L2ARC only supports striped vdevs so no mirror or raidz is available. For SLOG only striped vdevs or mirrored vdevs is supported. Roamers say that raidz for SLOG is in the making.
 
@mir, thanks for the added wiseness. Just as a side note I'd like to point out hat SLOG usage is influenced by certain dataset properties, too, sync writes aside.

While we're at it, do you have a definite source of answer for the L2ARC/memory ratio? I havent't been able to gather very useful info about it. I'd imagine all the metadata for the L2ARC needs to sit in RAM (ARC), which is orders of magnitude less than the L2ARC itself. Some sources say it's between 10x .. 20x of RAM that can be effectively used.

And I think it's better to let SLOG be larger than necessary - since it's not very large to start off -, it is especially helpful with cheaper SSDs by helping in wear-leveling.
 
Last edited:
While we're at it, do you have a definite source of answer for the L2ARC/memory ratio? I havent't been able to gather very useful info about it. I'd imagine all the metadata for the L2ARC needs to sit in RAM (ARC), which is orders of magnitude less than the L2ARC itself. Some sources say it's between 10x .. 20x of RAM that can be effectively used.

Thread below has a very thorough explanation and algorithm for mesouring L2ARC consumption in ARC:
http://zfs-discuss.opensolaris.narkive.com/bH133S3V/summary-dedup-and-l2arc-memory-requirements
 
Well, that thread assumes one's using dedup which I'm not interested in. It's a lot of stuff, bet definitely not a definite answer :) I think as a general guidance, simple SSD caching calculations would suit the original question by @brickmasterj better as well.
 
Thanks @mir and @kobuki for your insights, I will look into it further and play around with various settings.

For now, I've added a 120GB SSD to the servers, split the capacity in half meaning effectively 55.9GB for both cache and log. This resulted in an already quite remarkable increase to the following fsync/sec's, and so far the average RAM usage seems to havebarelyl been hit, if at all:
Code:
root@pve:# pveperf
CPU BOGOMIPS:      15961.20
REGEX/SECOND:      730594
HD SIZE:           9921.88 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     3200.85
I will play around and do some experimenting with various cache and log sizes, but ultimately this result is a sufficient increase regarding my initial question. Thanks.
 
I've added a 120GB SSD to the servers, split the capacity in half meaning effectively 55.9GB for both cache and log
For your pool size 12 TB (4 x 4 - 1) my recommendation would be:
Enterprise
Buy 2 80 GB DC S37xx and partition both 10 GB for SLOG and 70 GB for L2ARC. Create a mirrored SLOG over the 2 10 GB partitions and create a stripped L2ARC over the two 70 GB partition.
Smaller business
Buy 2 80 GB DC S36xx and follow partition for enterprise
Home server
Buy 2 80 GB DC S35xx and follow partition for enterprise. If loosing the last 5 sec of writes is acceptable you could get along with only one disk.

PS. For DC series SSD there is no need for manual under provisioning since this is already provided by the manufacture DC S36xx and DC 37xx is approx. 40% under provisioned while DC35xx is 27% under provisioned.
 
PS. For DC series SSD there is no need for manual under provisioning since this is already provided by the manufacture DC S36xx and DC 37xx is approx. 40% under provisioned while DC35xx is 27% under provisioned.

Could you please link us to the official statement from Intel regarding the factory underprovisioning of these models? Thanks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!