[SOLVED] Add ZFS Log and Cache SSDs, what i have to buy and to do?

fireon

Distinguished Member
Oct 25, 2010
4,484
466
153
Austria/Graz
deepdoc.at
Hello,

i would like to add cache and logs ssd's to my existing pool: http://pve.proxmox.com/wiki/Storage:_ZFS#Add_Cache_and_Log_to_existing_pool

I don't use the rootpool for datas or VMs. Use an extra pool:

pool: v-machines
state: ONLINE
scan: resilvered 1.09T in 4h45m with 0 errors on Sat May 23 02:48:52 2015
config:


NAME STATE READ WRITE CKSUM
v-machines ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D0KRWP ONLINE 0 0 0
ata-WDC_WD20EARX-00ZUDB0_WD-WCC1H0343538 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D688XW ONLINE 0 0 0
ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D63WM0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
ata-WDC_WD20EARX-00ZUDB0_WD-WCC1H0381420 ONLINE 0 0 0
ata-WDC_WD20EURS-63S48Y0_WD-WMAZA9381012 ONLINE 0 0 0

So i have to make 2 partitions on the ssd. 50/50. But what kind of ssd should i buy? And is it a good idea to use a mirror with cache and log? Because when an ssd is going failed.
Is it ok when i buy two of the 60 GB Silicon - Power S60 550mb/s read and 500mb/s write.

pve-manager/4.0-57/cc7c2b53 (running kernel: 4.2.3-2-pve)

Thanks for Information.

Best Regards
Fireon
 
I bought an Samsung Enterprise 1GB for Logs and the Rest vor Cache. SAMSUNG MZ7KM240HAGR-00005 (SSD SM863 Series 240GB ) It is amazing.
 
So i have to make 2 partitions on the ssd. 50/50.

Please explain why you are doing this. You should be using separate drives for cache and log.

But what kind of ssd should i buy? And is it a good idea to use a mirror with cache and log? Because when an ssd is going failed.

It used to be very important to mirror your slog, because if you lost it it could be a real problem, these days it is a lot more forgiving. There are still some corner cases where if you are paranoid you might want to mirror your slog, but it really is not as important as it once was.

To understand how the ZIL (which resides on the slog) works, we have to first understand synchronous and asynchronous writes. With Synchronous writes, when you save data to the pool, it first commits it to the physical drive then tells the client that it has been written.

Asynchronous writes speed things up, and lie to the client that the data has been committed to disk while it is still writing it. This allows the client to move on and send more data before the write operation is entirely complete speeding things up, but it introduces risk. The last second or so of data may exist only in RAM and if you have a sudden crash or power loss that last second of data may be lost.

The SLOG aims to speed synchronous writes up by keeping an intent log (ZIL) separate from the pool on a fast (battery backed ram or SSD) drive. Under normal operation nothing is ever read from the SLOG. The system writes the data to the SLOG, then tells the client that it is written, while the data is still being written from RAM to the physical disk. Once it has been committed to the physical disk, the data is purged from the ZIL on the SLOG without ever having been read from.

The only time your pool will ever read from the SLOG is if something gone wrong (crash or power loss) and the data in RAM was never fully committed to disk. In this case, the SLOG is read from when the pool is next mounted, and the data is reconciled into the pool such that there is no data loss.

So, based on this, what would be the benefit of mirroring your SLOG. Well, you COULD lose the last second of synchronous write data IF your SLOG drive fails, AND your system hangs/loses power at the same time. This is not all that likely, but it could happen if you - for instance - are struck by lightening. If this does happen, you only wind up losing the last second or so of synchronous write data.

I happen to have mirrored SLOG drives, but only because I got them back when SLOG failure was more dangerous than it is today. If I were shopping today, I probably wouldn't bother.

You definitely want your SLOG (ZIL) and your L2ARC (cache) on separate drives, as firstly, they may be busy at the same time and slow each other down, and secondly they benefit from different drive types. Your L2ARC will benefit from a fast sequential drive, with good write endurance. I like the Samsung 850 Pro for this. Your SLOG on the other hand will benefit from low latency writes, and you'll want capacitor backed drives that can flush their drive cache in the case of power loss. SLOG's don't need to be large, only enough to fill about a seconds worth of writes. They will see almost constant writes though, so you'll want decent write endurance.

Back when I bought mine, everyone was suggesting the Intel S3700 drives for this, as they were both battery backed, and had low write latency. I have no idea what the best current drives are for this purpose. I will say though, you don't want to pick a consumer SSD for this purpose, as it may actually slow things down even slower than going without a SLOG at all. Before I knew this, I originally tried consumer Samsung SSD's for this purpose and was horribly disappointed. The S3700's (I got the smallest 100GB ones, since I only needed a few hundred MB) made a huge improvement, but they are a few years old now, so there must be something better on the market.
 
Yes my first idea was 4SSDs 2 for Log (Mirror) and two for cache (stripped). The proxmox Support said yes this is optimal, but not really required. The sayed one disk for all is ok...
But let me ask you. so you spend for example two Samsung 850 pro for Log that needs maxi. 100MB of diskspace?
What you say to that, when i use the ssd only for the cache and deactivate the Log on this drive. Could this be an problem? I think not.

Best Regards
 
100 MB could be already too much, because of the slowness of these drives for synchronous writes, but yes. You only can store 5 seconds of synchronous writes on the disk (according to the sebastien-list, 10 MB would be sufficient). The rest of the drive is not used at all, so caching on this devices makes totally sense (with respect of space usage).

You have to keep in mind that while heavy SLOG usage and L2ARC writes, your SLOG becomes even slower, because the bandwidth has to be shared among SLOG and L2ARC. Also, L2ARC needs also ARC, so you end up with less ARC than before (which means slower best-case response times)
 
Yes my first idea was 4SSDs 2 for Log (Mirror) and two for cache (stripped).

I'd argue you can go with one log drive and one cache drive. These days you only need to mirror your logs if you are REALLY paranoid.

You have to have the system crash/lose power at the same time as your log SSD fails (within a 1-5 second window) in order to have any resulting data loss, and if you do it will only be the last 1-5 seconds worth of writes, and chances are those will be corrupt anyway, because chances are your write was interrupted when your system crashed/lost power.

Two striped drives for your cache device really isnt necessary. I did this because I got a good deal on the drives, but having one drive is more than enough for cache, IMHO.

The proxmox Support said yes this is optimal, but not really required. The sayed one disk for all is ok...

Really, are you sure you didn't misunderstand them? While I'd argue one of each is perfectly fine, putting cache and log on different partitions of the same drive is generally considered pretty poor practice, and may actually reduce performance rather than improve it.

But let me ask you. so you spend for example two Samsung 850 pro for Log that needs maxi. 100MB of diskspace?

Well, first off, let me be clear, I am NOT using the 850's for log devices. The 850's are fast for sequential type consumer loads and work great as cache devices, but they are not very well suited as log drives. Their write latencies are too high.

But that being said, yes, this is the dilemma. Log drives will waste a lot of space, because all the good ones out there are WAAAAAY larger than you need them to be for a ZFS log device. it's probably not a bad thing though, as this helps with write cycles and wear leveling, since they are going to be doing a lot of writes.

What you say to that, when i use the ssd only for the cache and deactivate the Log on this drive. Could this be an problem? I think not.

Well, tha'ts really what it all comes down to. In your usage scenario, will you even see a benefit from a log device and a cache device. Some implementations don't. How large is your working data set? If it fits entirely inside the size of your cache device, you will see good performance. If it doesn't, the benefit of a cache device will be marginal.

Do you do lots of sync writes? If yes, a good log device will help, if no, you'd probably not even notice it was there.

Which begs the question: Does anyone make a small (like 30-60 gB) enterprise-class SSD?

I've been looking around a lot, and I ave not found any. You really want an enterprise class drive for a log, and they are all HUGE for the task. My S3700's are OK, but occasionally I ahve looked around for good replacements, and haven't found any.

Also, I don't know if this has been linked in this thread before, but here is some performance testing that is relevant to selecting a log device:

This one is a little old, from back when I bought my S3700's

This one is more comprehensive and newer It was written from the perspective of a journal device, but the same performance characteristics are important here, as in a ZFS SLOG, and this illustrates clearly that most consumer SSD's are really very poor at SLOG duty, even to the point where it might be faster to not add them at all.
 
As far as I know, with one log drive, if it dies then the pool may be broken.

We use two high end recommended intel drives. Partition and mirror write-log . Cache used on 2ND partitions.
 
As far as I know, with one log drive, if it dies then the pool may be broken.

We use two high end recommended intel drives. Partition and mirror write-log . Cache used on 2ND partitions.


This USED to be the case several pool revisions ago. I can't find the exact pool revision it changed in, but it hasn't been a problem, probably since 2012 or so. These days when a ZIL fails you just lose any data that needs to be read from the ZIL, and that only happens if the system goes down prematurely, because any data in the ZIL is written directly to the pool from RAM under normal operation, then discarded from the the ZIL once committed to stable disks.
 
Yeah, so "Log Device Removal" was added as a feature in ZFS Pool Version 19 in September 2010.

Ever since then, removing a log device no longer kills the pool. You just lose any uncommitted pending data in the log, which you would only have if your system crashed or lost power at the same time (or within seconds of) your log device failing.


Proper Solaris ZFS only went up to pool version 33 or so (I think) before Oracle made it proprietary, making the open source efforts for off of it.

Once that happened, they all went to Pool version 5000 instead, which incorporates all previous pool versions features, plus adds so called "feature flags" for any added features after the switch to 5000.

You can verify you are on 5000 by doing "zpool get version <pool>" from the command line. if you see a dash instead of a pool version number, you are on 5000.

if you want to see what post switch features are implemented, you can run "zpool get all <pool> | grep feature@".

On my Proxmox 4.2 box it looks like this:

Code:
root@proxmox:~# zpool get version rpool
NAME   PROPERTY  VALUE    SOURCE
rpool  version   -        default

and

Code:
root@proxmox:~# zpool get all rpool |grep feature@
rpool  feature@async_destroy       enabled                     local
rpool  feature@empty_bpobj         active                      local
rpool  feature@lz4_compress        active                      local
rpool  feature@spacemap_histogram  active                      local
rpool  feature@enabled_txg         active                      local
rpool  feature@hole_birth          active                      local
rpool  feature@extensible_dataset  enabled                     local
rpool  feature@embedded_data       active                      local
rpool  feature@bookmarks           enabled                     local
rpool  feature@filesystem_limits   enabled                     local
rpool  feature@large_blocks        enabled                     local

I do not know what the difference between "active" and "enabled" is though.
 
Last edited:
Yeah, so "Log Device Removal" was added as a feature in ZFS Pool Version 19 in September 2010.

Ever since then, removing a log device no longer kills the pool. You just lose any uncommitted pending data in the log, which you would only have if your system crashed or lost power at the same time (or within seconds of) your log device failing.


Let me add one caution to this.

If you created your pool a long time ago on a different system, and imported it to your Proxmox box it MAY still be running the old pool version, despite running on a more modern implementation of ZFS.

You can correct this by upgrading your pool using the "zpool upgrade" command.
 
Yeah, so "Log Device Removal" was added as a feature in ZFS Pool Version 19 in September 2010.

Ever since then, removing a log device no longer kills the pool. You just lose any uncommitted pending data in the log, which you would only have if your system crashed or lost power at the same time (or within seconds of) your log device failing.

thanks for the updated information. I started using zfs before 2010 and have not kept up with the release notes.
 
This is an old but interesting thread.

In Proxmox's own "ZFS tips and tricks" (and here), it is indeed mentioned that if you have only 1 SSD, do split it for caching and logs:
Use flash for caching/logs. If you have only one SSD, use parted of gdisk to create a small partition for the ZIL (ZFS intent log) and a larger one for the L2ARC (ZFS read cache on disk). Make sure that the ZIL is on the first partition. In our case we have a Express Flash PCIe SSD with 175GB capacity and setup a ZIL with 25GB and a L2ARC cache partition of 150GB.
My specific question now is:
  • what is the recommendation if you have proxmox sitting on a simple ZFS 2xSDD mirror (RAID 1) and you don't/can't have additional SSDs for caching & logs?
  • Does it make sense from a performance perspective to split up the SSD and create separate partitions for the log and cache?
 
Last edited:
A SLOG only makes sense if you got alot of sync writes (async writes won't be write cached at all) and if your SLOG disk is way faster than the disks you want the write cache for. So using a additional partition for SLOG on the same SSDs won't make any sense, because if there is no SLOG the ZIL will be written to those SSDs anyway.
And the L2ARC won't help in most situations and can even make it worse. With L2ARC you are basically sacrificing fast ARC in RAM for more but slower L2ARC on some slow SSDs. The bigger your L2ARC is, the more RAM it will consume. In general a L2ARC should only be used if your ARC isn't big enough and you already use the max amount of RAM your mainboard supports. If you RAM isn't maxed out it would be make more sense to buy more RAM.
 
Last edited:
  • Like
Reactions: 9bitbyte
This is an old but interesting thread.

In Proxmox's own "ZFS tips and tricks" (and here), it is indeed mentioned that if you have only 1 SSD, do split it for caching and logs:

My specific question now is:
  • what is the recommendation if you have proxmox sitting on a simple ZFS 2xSDD mirror (RAID 1) and you don't/can't have additional SSDs for caching & logs?
  • Does it make sense from a performance perspective to split up the SSD and create separate partitions for the log and cache?

Short Answer:

No.

Long Answer:

The different VDEV's used to speed things up (Cache aka L2ARC, SLOG aka dedicated ZIL drive and more recently special allocation class) all take advantage of the underlying drives to do their thing.

A cache VDEV - for instance - only make sense if it is faster than the main pool drives, probably by a good margin.

If your pool consists of spinning metal drives, one or more SSD's might make sense as a cache device. If your pool consists of SATA SSD's, one or more NVMe SSD's might make sense as a cache VDEV.

In reality, the cache VDEV's don't seem to help much even when they are on searate faster drives, outside of very special workloads that operate within a known working set (smaller than the cache device) and access that working set frequently.

A SLOG/Dedicated ZIL is a very special purpose type of drive that only helps accelerate synchronous writes, nothing else, AND relies on very particular attributes of an SSD, meaning most SSD's would actually make for a pretty bad SLOG. If you want one, make sure you do sync writes, and make sure you find a special SSD that has very low latency, as well as capacitor or battery backed RAM cache.

A special allocation class drive moves the metadata off the main pools to a faster drive, and can also optionally be used to store small blocks that might otherwise slow down the main pool. Much like the Cache device, these also rely on being faster than the underlying drives in the main pool VDEV's

If you can't add more drives, and would just be playing with partitions, there is little point in bothering with any three of these. It is highly unlikely to help performance, and may even introduce instability.

I wouldn't recommend it.
 
  • Like
Reactions: 9bitbyte and Dunuin
Very elaborate answer, crystal clear to me now; thanks @mattlach and @Dunuin!

I have 2 ZFS pools/mirrors: one with 2 consumer grade SSDs (proxmox os) and one with 2 SATA HDDs (vms and data).

I considered assigning part of the SSD space to 2 dedicated SLOG/cache (L2ARC) partitions for the SATA HDD pool. But I now understand that doesn't make sense a) without enterprise-grade SSD and b) because my workload use-case doesn't justify it.
 
A special allocation class drive moves the metadata off the main pools to a faster drive, and can also optionally be used to store small blocks that might otherwise slow down the main pool. Much like the Cache device, these also rely on being faster than the underlying drives in the main pool VDEV's
Just to be clear for others reaching this page: It is driveS. You don't want to have a SPOF in your ZFS setup (otherwise perfectly explained and also recommended for smaller datasets like PVE itself).

Metadata drives (Enterprise SSDs) are a very, very good speed improvement for spinning rust. Best bang for the buck solution regarding ZIL is an small optane (e.g. 16GB) PCIe drive, which also increases the synchronous write tremendously and is very cheap, even with an PCIe adapter if you don't have a dedicated slot for that.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!