ZFS Tests and Optimization - ZIL/SLOG, L2ARC, Special Device

Mecanik

Well-Known Member
Mar 2, 2017
173
4
58
32
Since there are many members here that have quite some experience and knowledge with ZFS, not the only that I`m trying to find the best/optimal setup for my ZFS setup; but also want to have some tests and information in 1 place rather than scattered around in different threads, posts and websites on the internet for Proxmox and ZFS.

Apologies for the many paste links, but the forum does not allow posting such long content as of now. (Please enter a message with no more than 10000 characters. but I had 9k)

The system:
  • CPU: AMD EPYC 7371
  • RAM: 256 GB running at 2666 MHz (M393A4K40CB2-CVF)
  • HDD: 2× 6TB HDD SATA Soft RAID (HGST_HUS726T6TALE6L1)
  • NVME: 2× 960GB SSD NVMe (SAMSUNG MZQLB960HAJR-00007)
  • Network: 10 Gbps
The system is as is, hardware cannot be added or removed.

ARC settings (as per the docs): https://ghostbin.co/paste/8pkku

The goal:
  • Run as many KVM/WIN machines as possible, without lag inside them
  • Have the lowest server load as possible, without dedicating all the RAM to ARC and L2ARC
  • Prevent server crash running many VMs (obviously)
What I know about ZFS so far:
  • ZFS (Zettabyte File System) is an amazing and reliable file system.
  • ZIL stands for ZFS Intent Log. The purpose of the ZIL in ZFS is to log synchronous operations to disk before it is written to your array.
  • ZIL SLOG is essentially a fast persistent (or essentially persistent) write cache for ZFS storage.
  • ARC is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency.
  • L2ARC sits in-between, extending the main memory cache using fast storage devices, such as flash memory based SSDs (solid state disks).
  • Special Device is called Special Allocation Class, essentially you add a fast SSD and it will speed up the slow disks ?! Not much info out there. It cannot be removed.
  • KVM machines use synchronous write instead of asynchronous.
  • Block size should be the same as the hardware, in case of normal HDDs (and my case) 4k.
  • ZIL SLOG must be added "if needed", yet some posts/websites say it's a must. It is also mentioned the ZIL must be mirrored in RAID1, but some say it's not required.
  • L2ARC must be added "if needed", and eats up ARC (RAM). It is not clear what is the ratio used, for example if you add 400 GB SSD, how much ARC it will use.
What I don't know:
  • How to measure/find the optimal block size for the pool, running only KVM/WIN machines
  • Do I really need SLOG and L2ARC
  • How to set the limit for SLOG/L2ARC to be optimal
  • Do I need to add special device ?
  • Can you mix SLOG/L2ARC with special device ?

Current pool status: https://ghostbin.co/paste/ups72

Current ARC summary: https://pastebin.com/y4WRCvbb

ZDB report: https://pastebin.com/eW9JRy1N

PVEPERF: https://ghostbin.co/paste/gpmos

Small tests using FIO, commands explained:
  • --direct=0 # O_DIRECT is not supported in synchronous mode
  • --name=seqread # File name
  • --rw=read # Type of test
  • --ioengine=sync # Defines how the job issues I/O. We will be using SYNC since our pool has sync set to "standard"
  • --bs=4k # This is the default block size for these HDD's
  • --numjobs=1 # 1 test only
  • --size=1G # File size
  • --runtime=600 # Terminate processing after the specified number of seconds

Results:

Now let's check ARC summary after these small tests: https://pastebin.com/Gq9jkJS6

As you can see the Cache Hit Ratio is 99.46%. But this is (I`m guessing) because we had small files in our tests, enough to fit in the ARC. But let's make a bigger test, something that cannot fit in the ARC.

Random Read/Writes – SYNC mode with 100 GB file and 4k block size. During the layout (loading) of fio the ARC was increasing, IO delay 5% and 4-5% CPU usage, however once ARC was full, IO delay jumped to 10-19% and even 28%. (I`m guessing since it was writing directly to the disk now)

  • First run: https://pastebin.com/N0PqXtFA As we know, the average iops for a spinning HDD is aroung 75-100 iops, we can see an increase up to 135 iops on average. This is not the same as the small 1 GB files, and my guess/thought is because the ARC is full.
  • Second run: https://pastebin.com/GtiQYHC8 Instant run this time, because the data is cached in the ARC. We can see an increase to iops up to 150 iops on average.
  • Third run: https://pastebin.com/nAGFrGyj This time we see even less iops, this I don't understand why exactly. From my opinion it should be the same as run #2 or even better ? The average iops now is as run #1 and even less.

Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/xcsz5 (sorry, I reached my limit on pastebin and I will not create an account there)

I do not understand everything in the stats, and what they represent but I will leave that to someone willing to explain. All I understand is that the cache hit ratio is a bit bigger now.

Not quite happy with the results, so I will not try to add a ZIL SLOG. From the docs: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#_limit_zfs_memory_usage, it says to create a partition for ZIL and L2ARC.

The partition for ZIL should be half of the system memory (as the docs say), so in my case I made it 125 GB, and the rest of 769 GB I will dedicate to L2ARC.

Adding the SLOG, without mirror just for testing:

Code:
# zpool add rpool log nvme0n1p1
# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        rpool        ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sda2     ONLINE       0     0     0
            sdb2     ONLINE       0     0     0
        logs
          nvme0n1p1  ONLINE       0     0     0

errors: No known data errors

Now let's get back to testing, again with Random Read/Writes – SYNC mode with 100 GB file and 4k block size. The ARC was cleared, the file was removed and we start fresh with fio.

Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/94s8b

Overall, after adding the ZIL/SLOG NVME SSD the performance has not increased at all from what I see. The next step is to add L2ARC, and test:

Bash:
# zpool add rpool cache nvme0n1p2
# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        rpool        ONLINE       0     0     0
          mirror-0   ONLINE       0     0     0
            sda2     ONLINE       0     0     0
            sdb2     ONLINE       0     0     0
        logs
          nvme0n1p1  ONLINE       0     0     0
        cache
          nvme0n1p2  ONLINE       0     0     0

errors: No known data errors

Now let's get back to testing, again with Random Read/Writes – SYNC mode with 100 GB file and 4k block size. The ARC was cleared, the file was removed and we start fresh with fio.

  • First run: https://ghostbin.co/paste/7png2 I can finally see some improvement! On average 160 iops, and was more impressive is that the CPU load was under 1% after the layout of fio. IO delay was around 3% constantly. Good (?). Here are some arcstats towards the end: https://ghostbin.co/paste/b4c9o
  • Second run: https://ghostbin.co/paste/c7x8x Things are starting to look good! Average iops now is 270, almost double. At the beginning it jumped even to 4k iops, but I guess that's just because of ARC.
  • Third run: https://ghostbin.co/paste/zpkuq Even more increase this time, average iops being 335 which shows a good improvement.

Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/ktd8h

Interesting is that now Cache Hit Ratio is 99.52% which shows an increase compared without L2ARC and SLOG. L2ARC Hit Ratio is 65.01%, I`m not sure if this can be/or will increase.

Conclusion:
  • ZIL/SLOG without L2ARC does not improve anything
  • Server load (CPU mostly) is much lower and better using ZIL/SLOG + L2ARC cache.

More questions:
  • Is this setup optimal ? Is there anything that needs/can be tweaked ? Without adding hardware of course.
  • Is the special device adding in combination with this a bonus ? The second NVME sits there for nothing right now.
  • Does the NVME need to be RAID1 or can stay as is ? This is datacenter class so lifetime should not be an issue for a good while.

Thank you for those taking the time to read my (long) post with questions.
 
VM testing, important observations:

  • If you set disk cache to "writeback", it will not use L2ARC and the IO on PVE will jump like crazy even to 40% when booting the VM and the operation inside the VM is horrible.
  • Setting the disk cache to "none" will make use of L2ARC, and the VM boot is quite fast; operations inside VM are OK as well.

VM config:

Bash:
qm config 100
agent: 1
bios: seabios
bootdisk: scsi0
cores: 2
cpulimit: 2
ide2: none,media=cdrom
ipconfig0: ip=dhcp,ip6=dhcp
memory: 2048
name: 4545
net0: virtio=02:00:00:a2:19:54,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: win10
scsi0: vmpool:vm-100-disk-0,discard=on,iops_rd=1024,iops_rd_max=2048,iops_wr=1024,iops_wr_max=2048,iothread=1,mbps_rd=10,mbps_rd_max=12,mbps_wr=10,mbps_wr_max=12,size=300G
scsi1: vmpool:vm-100-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-single
smbios1: uuid=ed7053a0-9211-4ab1-a8d1-78115c0d598e
sockets: 1
vcpus: 2
vmgenid: c18feb25-8486-4131-8ba3-6fd06f97992d
vmstatestorage: vmpool
 
How to measure/find the optimal block size for the pool, running only KVM/WIN machines

Most of the cases will be 4k(especially fot HDD with 4kn ). But... your correct question must be other: what will be volblocksize for my KVM/WIN machines !!! The best test, is to find a way to emulate the load of your VMs. If you find this way, please tell me ;) In my case I try to find the optimal using the real KVM guest and to observe via snmp how is perform.

Do I really need SLOG and L2ARC

You need a SLOG if you have some kind of applications like DBs(sync op). If you do not have this situation, you do not need SLOG.

L2ARC is useful if you will read the same data for many times. As a hint, command arc_summary will tell if a L2CACHE will help or not. In my case, it will not:

Most Recently Used Ghost: 0.00% 7.68k
Most Frequently Used Ghost: 0.30% 483.66k


Do I need to add special device ?

If you will use "special device"(SSD, NVME, and so on) you can for example to cache zfs metadata that will help a lot. But this "special device" will be part of the zpool as any other device.

Good like / Bafta!
 
  • Like
Reactions: Juertes
Most of the cases will be 4k(especially fot HDD with 4kn ). But... your correct question must be other: what will be volblocksize for my KVM/WIN machines !!! The best test, is to find a way to emulate the load of your VMs. If you find this way, please tell me ;) In my case I try to find the optimal using the real KVM guest and to observe via snmp how is perform.



You need a SLOG if you have some kind of applications like DBs(sync op). If you do not have this situation, you do not need SLOG.

L2ARC is useful if you will read the same data for many times. As a hint, command arc_summary will tell if a L2CACHE will help or not. In my case, it will not:

Most Recently Used Ghost: 0.00% 7.68k
Most Frequently Used Ghost: 0.30% 483.66k




If you will use "special device"(SSD, NVME, and so on) you can for example to cache zfs metadata that will help a lot. But this "special device" will be part of the zpool as any other device.

Good like / Bafta!

Uhm, I know what arc_summary is... please read my post properly. Bafta.
 
Try to change your volblocksize from default 8K to samething like 16, 32K. The same for your fio test(--bs=16k/32k)

Bafta!

But the HDD block size is 4k, so I set for the pool as well 4k. Why and how would it work optimally to double it ? Everywhere I read about this, it says to match the hardware block size.
 
Hi,

This are 2 different things. Let try something like this. You have a VM. This guest will need to write 1 blocks of 16k(volbloksize=16k). At the zfs pool level will need to write 4 x 4k (as you have ashift=12). But zfs will try to write write(most of the time) sequential.
Now think that you will use volblocksize=4k ... You will need to write 4 different blocks that most of the time will not be sequential. For reading is much worth, in case of read 4 blocks who are not in a sequential positions (like for DBs).


Now as for your post title, take in account that zfs cache (arc/l2arc) do not know nothing about your data. So it has now ideea about what to cache or not. But your VM know better what to cache for best performance. In this case, is better to cache on ARC/L2arc only metadata and you can use less ram for this. Then you can increase the VM allocated ram (better efficiency).

Bafta.



 
I can only speak for zfs on freebsd but I guess it's the same for linux...

How to measure/find the optimal block size for the pool, running only KVM/WIN machines
that will be difficult to almost impossible because you have the txg groups and compression between it (if enabled)...but it's also not necessary from a performance point of few 'cause there won't be much to gain. zfs block size is variable and you define the max. for vm images 128kb is the "standard".
https://www.joyent.com/blog/bruning-questions-zfs-record-size

Do I really need SLOG and L2ARC
https://forum.proxmox.com/threads/zfs-worth-it-tuning-tips.45262/page-2#post-217209

Can you mix SLOG/L2ARC with special device ?
you can but it's not recommended. I would advise against it. diff to debug at the end.
 
I can only speak for zfs on freebsd but I guess it's the same for linux...


that will be difficult to almost impossible because you have the txg groups and compression between it (if enabled)...but it's also not necessary from a performance point of few 'cause there won't be much to gain. zfs block size is variable and you define the max. for vm images 128kb is the "standard".
https://www.joyent.com/blog/bruning-questions-zfs-record-size


https://forum.proxmox.com/threads/zfs-worth-it-tuning-tips.45262/page-2#post-217209


you can but it's not recommended. I would advise against it. diff to debug at the end.

Thank you. However I still dont understand how can ZFS work well if you set 128k block size, and your hdd handles 4k.

Regarding the L2ARC, towards the end of my post and my reply you can see that is a must, and performance really increases.
 
Thank you. However I still dont understand how can ZFS work well if you set 128k block size, and your hdd handles 4k.

You have compression, otherwise you don't. If you compress a logical 4K block, it will be written to a physcial 4K block on disk, so nothing gained here, but if you compress a logical block of 128K, it will be smaller than 128K, e.g. 90KB on disk, so you will indeed save space and time in writing the block. You will, however waste more space on snapshots, because you have 128K logical blocksize, so one changed bit will have to write another 128K logical blocksize.

If you will use "special device"(SSD, NVME, and so on) you can for example to cache zfs metadata that will help a lot. But this "special device" will be part of the zpool as any other device.

Yes, I recently built my first pool with metadata cache on SSD and it is really, really fast now for every zfs operation you have to run. I can recommend adding this, but only on a small partition of your 960 GB SSD. You will need to mirror this to not have a single point of failure.

Another note:
L2ARC is only valid if your system is up and will be empty if you reboot. So, if you reboot a lot, you will have to warm it up until you can benefit from it. That can be a show stopper in the long run.
 
You have compression, otherwise you don't. If you compress a logical 4K block, it will be written to a physcial 4K block on disk, so nothing gained here, but if you compress a logical block of 128K, it will be smaller than 128K, e.g. 90KB on disk, so you will indeed save space and time in writing the block. You will, however waste more space on snapshots, because you have 128K logical blocksize, so one changed bit will have to write another 128K logical blocksize.



Yes, I recently built my first pool with metadata cache on SSD and it is really, really fast now for every zfs operation you have to run. I can recommend adding this, but only on a small partition of your 960 GB SSD. You will need to mirror this to not have a single point of failure.

Another note:
L2ARC is only valid if your system is up and will be empty if you reboot. So, if you reboot a lot, you will have to warm it up until you can benefit from it. That can be a show stopper in the long run.

Thanks for your input, I enabled LZ4 for compression and it is indeed noticeable.

Regarding special device, is this like better or faster than l2arc ? Or is it used in conjuction ?

As for L2ARC, yes I noticed the part about reboot.
 
Regarding special device, is this like better or faster than l2arc ? Or is it used in conjuction ?

That is a very good question. I did not found anything relating this in the manpage. In my usecase however, a workstation machine with is only turned on while using, the difference is noticeable directly. I also have my ROOT-dataset sitting on a dataset with special_small_blocks option set to 128K, so that all files are indeed lying on the SSD instead of the disks. Before updating to allocation classes I did a zfs get all and find / on each boot to speed up further processing on the machine, but that is totally gone none.

With regard to special_small_blocks, I think it can be used in conjunction to L2ARC.

In order to be able to say that your L2ARC is useful or not, you need a filled L2ARC disk and at least 30 days or so and monitor the hit rate. We have a ZFS-based backupserver via L2ARC and your cache hit rate is at max 3.5%. Metadata is always cached .
 
That is a very good question. I did not found anything relating this in the manpage. In my usecase however, a workstation machine with is only turned on while using, the difference is noticeable directly. I also have my ROOT-dataset sitting on a dataset with special_small_blocks option set to 128K, so that all files are indeed lying on the SSD instead of the disks. Before updating to allocation classes I did a zfs get all and find / on each boot to speed up further processing on the machine, but that is totally gone none.

With regard to special_small_blocks, I think it can be used in conjunction to L2ARC.

In order to be able to say that your L2ARC is useful or not, you need a filled L2ARC disk and at least 30 days or so and monitor the hit rate. We have a ZFS-based backupserver via L2ARC and your cache hit rate is at max 3.5%. Metadata is always cached .

I am noticing that L2ARC is not really... as I expected. Cache hit ratio is however 30% (still io and cpu jumps crazy when you make a machine), but if I monitor the ZFS with netdata, I can see that "metadata" is the most active. So when you create a new VM for example, L2ARC is not really being used much as I expected, but metadata is jumping crazy.

L2ARC on the other hand gets it's size increased with each VM created, which means that it is caching it indeed but... how does that help. (?)

Which leads me to the thought of, remove L2ARC and setup this "special device" instead ?

Please give me your thoughts on this.
 
L2ARC on the other hand gets it's size increased with each VM created, which means that it is caching it indeed but... how does that help. (?)

Yes, my L2ARC is also full, but very low cache hit rate.

Which leads me to the thought of, remove L2ARC and setup this "special device" instead ?

I did it (I don't need L2ARC), and it is really fast. I have two mirrored enterprise SSDs as special device.
 
Yes, my L2ARC is also full, but very low cache hit rate.



I did it (I don't need L2ARC), and it is really fast. I have two mirrored enterprise SSDs as special device.

Thank you, I will remove L2ARC and setup this special device.

Could you share your zfs status ? So I can see how your setup currently is please.
 
Could you share your zfs status ? So I can see how your setup currently is please.

This wiki article describes how to set up the special devices.

MY setup currently looks like this:

Code:
root@proxmox ~ > zpool status
  pool: zpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 03:00:26 with 0 errors on Mon Mar 23 12:18:53 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        zpool                                           ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            sdc                                         ONLINE       0     0     0
            sdd                                         ONLINE       0     0     0
            sde                                         ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            sdf                                         ONLINE       0     0     0
            sdg                                         ONLINE       0     0     0
            sdh                                         ONLINE       0     0     0
        special
          mirror-2                                      ONLINE       0     0     0
            sda2                                        ONLINE       0     0     0
            sdb2                                        ONLINE       0     0     0

errors: No known data errors

I'd use an additional small partition for ZIL/SLOG, I forgot to set up...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!