ZFS with no log and cache

jena

Member
Jul 9, 2020
47
8
13
34
Hi all,

I have created two zfs pools with the simplest command: zpool create <pool> <device> at the begining.
It seems have no log (aka zil or slog) and cache. I believe I did turn on lz4 compression for all three.
My question is: in my case, do I need log or cache for each pool?

zpool status returns:
Code:
root@mars:~# zpool status
  pool: hddpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:30:04 with 0 errors on Sun Nov  8 00:54:05 2020
config:

        NAME                                   STATE     READ WRITE CKSUM
        hddpool                                ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-HGST_HUH721212ALE604_XXXXXXXX  ONLINE       0     0     0
            ata-HGST_HUH721212ALE604_XXXXXXXX  ONLINE       0     0     0

errors: No known data errors

  pool: nvmepool
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:10:42 with 0 errors on Sun Nov  8 00:34:45 2020
config:

        NAME           STATE     READ WRITE CKSUM
        nvmepool       ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            nvme-eui.  ONLINE       0     0     0
            nvme-eui.  ONLINE       0     0     0
            nvme-eui.  ONLINE       0     0     0
            nvme-eui.  ONLINE       0     0     0
            nvme-eui.  ONLINE       0     0     0
            nvme-eui.  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:00:07 with 0 errors on Sun Nov  8 00:24:11 2020
config:

        NAME                                   STATE     READ WRITE CKSUM
        rpool                                  ONLINE       0     0     0
          mirror-0                             ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB-part3  ONLINE       0     0     0
            ata-Samsung_SSD_860_EVO_1TB-part3  ONLINE       0     0     0

errors: No known data errors

rpool was created by the proxmox install
- two mirrored 1TB Samsung 860EVO

nvmepool was created by simple zpool create
- raid-z2 of six Western Digital SN750 2TB

hddpool was created by simple zpool create
- mirrored WD DC530 12TB x 2

Other specs are:
Threadripper 3970x
256GB DDR4 3200MHz Corsair LPX

This is a for some scientific research computation (not super mission critical) but as an improvement over a single DELL workstation, which allows multiple lab members to use their own VMs and GPU passthrough for machine learning and other data processing.
 
Last edited:
My question is: in my case, do I need log or cache for each pool?
L2ARC is only useful if you already maxed out your RAM and still need more read caching.
A SLOG is only useful if your workload uses a lot of sync writes and the SLOG device has lower latency then your normal drives in the pool. For async writes it itsn't used.
A SSD mirror as special device might be an option for your HDD pools if you need more performance.
nvmepool was created by simple zpool create
- raid-z2 of six Western Digital SN750 2TB
Keep in mind that these drives are consumer SSDs without powerloss protection and only TLC NAND and its not recommended to use them with ZFS because of the high wear out caused by the wite amplification and CoW. Especially if your workload uses sync writes like DBs do. You really should monitor the wear out using smartctl and calculate how long it will take until the TBW will be reached. Sometimes consumer SSD will only last for some months if you write alot or the write amplification is too high.
And you should calculate the best volblocksize for that pool or you will loose capacity and increase wear.
 
Last edited:
L2ARC is only useful if you already maxed out your RAM and still need more read caching.
A SLOG is only useful if your workload uses a lot of sync writes and the SLOG device has lower latency then your normal drives in the pool. For async writes it itsn't used.
A SSD mirror as special device might be an option for your HDD pools if you need more performance.

Keep in mind that these drives are consumer SSDs without powerloss protection and only TLC NAND and its not recommended to use them with ZFS because of the high wear out caused by the wite amplification and CoW. Especially if your workload uses sync writes like DBs do. You really should monitor the wear out using smartctl and calculate how long it will take until the TBW will be reached. Sometimes consumer SSD will only last for some months if you write alot or the write amplification is too high.
And you should calculate the best volblocksize for that pool or you will loose capacity and increase wear.

Thanks for the very detailed reply.

This server is a somewhat proof of concept and a relatively affordable one to show my boss the capability of using Virtualization to utilize the high core count threadripper's performance. With lesson learned on this one, we might consider Lenovo P620 with zen3 threadripper as the next one in maybe two years.

For the nvmepool, it is used to store VMs only.
I just checked SMART data (from Proxmox GUI, Disk -> show SMART values), total write on each ssd is about 1.4TB (running for about 6 month), and available Spare at 100%
Total utilization of the nvmepool is at 10% currently.
WD SN750 has 1200 TBW endurance for 2TB model and 600 TBW for 1TB model.
Am I safe using those SSD in the next 5 years without reaching TBW limit?

Data is on the hddpool (two mirrored 12TB HDD). The majority of the data and computation are images (about 10-50MB each).

I am new to the concept of sync write. For example, what type of workload uses sync write?

Thanks a lot.
 
For the nvmepool, it is used to store VMs only.
I just checked SMART data (from Proxmox GUI, Disk -> show SMART values), total write on each ssd is about 1.4TB (running for about 6 month), and available Spare at 100%
Total utilization of the nvmepool is at 10% currently.
WD SN750 has 1200 TBW endurance for 2TB model and 600 TBW for 1TB model.
Am I safe using those SSD in the next 5 years without reaching TBW limit?
That sounds fine. Look from time to time at the wear out indicators but at that rate they should last for many years. I really thought your wear out must be much higher. My raidz1 pool of 5x 200GB SSDs for VMs only writes 0,6TB per day. And that is just a homeserver without many write activity. Wouldn't be unusual that a pool such yours would write some TBs per day.
Data is on the hddpool (two mirrored 12TB HDD). The majority of the data and computation are images (about 10-50MB each).
I am new to the concept of sync write. For example, what type of workload uses sync write?
That depends on the software you are using. Async writes can be cached in RAM on your mainboard and on the drives RAM itself. RAM is volatile so as soon as you encounter an power outage or something similar all cached data is lost. Normally writes are asyncronous and the programm will tell the linux/drive to store something without waiting for an answer that tells the program that everything was successfully stored.
But a program can be coded to use syncronous writes if the data is really important. Like mission critical or scientific applications where it is important that the result is always valid. In that case each write must be done one by one and the next write only starts if the programm recieved the answer that each write was successfull. Because of that RAM caching of the OS isn't possible and drives can only cache if it got some kind of powerloss protection so it can save the data in the RAM even if the power supply has failed.
SSDs can only write blocks of several hundrets of KB or maybe some MB at once. Lets say 2MB for example. If a programm wants to do 1000x async write operations of 8kb size, these will be cached and the SSD writes them for example as 4x 2MB blocks. That way the write amplification isn't that bad and only 8MB in total is written to the NAND flash of the SSD. But if a program wants to do 1000x sync writes to a SSD without powerloss protection it can't cache and it will write 8kb of data, which will write a full 2MB block to the NAND flash, send a confirmation and then that repeats another 999 times. In total 2GB (1000x 2MB) will be written to the NAND flash to store these 8MB (1000x 8kb) of real data and you get a extreme write amplification of factor 250.
A database like MySQL for example is using sync writes to store tiny blocks of 8kb or 16kb and that several times per second. So keep that in mind if you install a new program to a VM. That program might use a DB as storage in the background without you noticing it.

And an example why a program might want to do a sync write:
Lets say a program depends on a file that is always valid. If you overwrite that file with a async write you will never now if something gets wrong and the file will be corrupted. So you want to create a backup of that file first and only change that file if the backup is successfully stored. If you do that with a async write your write operations aren't always handled in the order you started them. If the power supply fails it is possible that the file is only half overwritten but the backup is still in the cache and lost. You will end with a corrupted file without backup. But if you do that with sync writes you will only start the overwriting of the file after you recieved the acknolegement message that the backup is stored and safe. That way you might loose the changes that should be saved to the file but atleast you got the file at the valid state before you tried to change it and not everyting is lost.

That caching is also why you shouldn't remove a USB stick from the computer without telling the OS to eject it first. You copy some stuff on that stick and your OS will tell you that the copy operation is finished. But in reality only half of the files are already on the stick and the other half is still in the cache of your computer RAM. If you now just unplug the stick only half of the files are stored on it. If you tell your OS to eject it, it will make sure that everything from the cache is written to the stick before you get a message that you might remove the stick now. Thats because async writes are used. You never know what is written and what not unless you use sync writes.
You also could force every async write to be handled as a sync write (ZFS option for that is "sync=always"). This will make everything slower but also more secure. If you force every write to be a sync write and copy stuff to your USB stick you can be sure that the copy operation is really finished as soon as the progress bar reaches 100%. Its also possible to force every sync write to be handled as a async write to increase speed and lower the write amplification but in most cases that isn't a good idea.
 
Last edited:
  • Like
Reactions: kyriakoschar