Does a zvol benefit from a Metadata Special device?

JonathonFS

New Member
Mar 8, 2022
10
9
3
I've been reading up on ZFS performance enhancements. We currently use ZFS on PVE to store the VM disks. It's my understanding that each VM is stored in a zvol.

Looking at ways to improve VM performance, it seems an SLOG cache will help with writes. Our Read speed is good enough, so I'm not concerned with a L2ARC cache just yet. But I haven't been able to figure out if we'll benefit from a metadata cache (aka Special device). From what I can tell, the special device works by storing and speeding up responses to the ZFS file allocation table. But since we're using zvols exclusively (not datasets), does that mean we won't benefit from a metadata cache?

I see this question was asked twice on the TrueNAS forum, but no one seems to have an answer.
 
Looking at ways to improve VM performance, it seems an SLOG cache will help with writes.
Only sync writes. It will help nothing with async writes, so first check your sync/async writes ratio if it would make sense at all to use a SLOG.
Our Read speed is good enough, so I'm not concerned with a L2ARC cache just yet.
That also really depends on the workload. Using L2ARC will consume RAM so your ARC will be smaller. And the RAM is way faster than a SSD, so using a L2ARC can also lower your read performance as data will be read from the slow SSD intead of the fast RAM. And a L2ARC will only be used when your ARC can't grow anymore. So it would be faster to just buy more RAM instead of buying a L2ARC SSD. L2ARC makes sense if you need even more cache but you already populated all your RAM slots with the biggest possible DIMMs.
But I haven't been able to figure out if we'll benefit from a metadata cache (aka Special device). From what I can tell, the special device works by storing and speeding up responses to the ZFS file allocation table. But since we're using zvols exclusively (not datasets), does that mean we won't benefit from a metadata cache?
Special devices aren't a metadata cache. They are a metadata storage. If you loose your special devices all data of the pool is lost. If you just want a metadata cache SSD use a L2ARC SSD and set it to "primarycache=metadata" "secondarycache=metadata".
Without a special device all data + metadata will be stored on your pools SSDs. If you use a pair of dedicated SSD as special devices your pool should be faster as then only data will be stored on these data SSDs and all metadata will be stored on these special device SSDs. So you should get a better performance because each SSD is hit by less IO. But I guess with only SSDs thats not that useful as you also just could add more normal SSDs to your pool to increase the throughout and IOPS. Its more useful if your got slow HDDs and want some SSDs for the metadata so the IOPS limited HDDs aren't hit but so much small random reads/writes.
 
Last edited:
  • Like
Reactions: JonathonFS
metadata doesn't (only) mean file metadata in this case. zvols are also made up of metadata nodes and data nodes, and the metadata nodes do get stored on a special vdev as well. you'll likely see less speedup than with regular datasets, since for filesystems accessing just the metadata is a frequent occurence (things like directory listings, stat-ing paths, reading/writing small files completely embedded the metadata without any data nodes, ...), whereas for zvols all of that happens one layer below inside the VM. if your RAM is adequately sized, most of the metadata should be cached in ARC anyway for zvols, but you can verify that with arcstat/arc_summary, and if that is the bottle neck, maxing out the RAM first is probably a good idea. special vdevs really shine for things like maildirs, PBS datastores, and the like - lots and lots of not too big files stored directly on ZFS, with frequent listings, renames, stats, but too much data to keep all the metadata in ARC.

as an example, for a 1G zvol completely filled with random data, I get about 10M worth of metadata nodes. half full, 6M. zdb -Lbbbs POOLNAME and looking at the ASIZE column for "LX" will give you an estimate - L0 is the actual data, L1 and above are metadata nodes/objects. for a 1G zvol you can see 1 very small L2 object that holds the references to 128 small L1 objects which hold the references to 128K L0 data objects. the L0 data objects each represent one block (8k default volblocksize, so that will affect the ratio of metadata:data as well). for a more concrete real world example, one of my pools has 656G of zvol data objects, and roughly 9G of associated metadata objects which would be stored on a special vdev.

IIRC what doesn't work for zvols is the special_small_blocks mechanism (that would store data blocks under a certain size on the special vdevs as well), so you can't use it to move complete zvols including data nodes to the special vdevs by setting volblocksize to less than special_small_blocks.
 
Hello to all,
I've just finished PBS setup that I'm pretty satisfied with.
Server is supermicro 721 case, 4 x Intel(R) Atom(TM) CPU C3558 @ 2.20GHz (1 Socket), MB with 6 sata ports available.
PBS is installed on raidZ10 (4 x 4T sata / slow 5400rpm) , data store is local dataset on rpool -> /rpool/databck and on other two sata ports left I put two 64G sata-dom ssds as special metadata storage -> vdev zraid1 ...
Backup performace is almost as fast as 1G link , but restore porformance also very good cca 450-500Mbs when restoring one VM at a time(compared to w/o special dev (150Mbs) ... BUT , when restoring more VM concurrently, restore speed gets saturated at 1Gbs ... I launched restoring 8 VM at the same time and all that time read speed got stuck at 1Gbs .... So, PBS in this configuration is very capable of serving out VMs to get restored ... actually in this case performance is limited by network speed ...
But why restoring just one VM does not take advantage of more bandwidth than mentioned above ? ... Do we have some kind of "restore speed limitation by VM" ? ... Not a big deal of course , this situation is also very acceptable

So it seems that adding ssd mirror vdev to a "slow" pool as this one means significant "performance added value" , but the next important step would be protecting and taking care that special dev does not get lost somehow ...
Regarding this point of view the question would be : How can we monitor special dev condition/peformance/ etc ? .. Waerout-ness is visible in PBS gui , but is this information enough for us to react on time and replace failed mirror-member ?

Thank you in advance

BR
Tonci


1647592452046.png

1647592066448.png
 
So it seems that adding ssd mirror vdev to a "slow" pool as this one means significant "performance added value" , but the next important step would be protecting and taking care that special dev does not get lost somehow ...
Regarding this point of view the question would be : How can we monitor special dev condition/peformance/ etc ? .. Waerout-ness is visible in PBS gui , but is this information enough for us to react on time and replace failed mirror-member ?
I would use smartctl to regularily check the wear. And creating a weekly cron that runs smart selftests can be useful too. And ofcause a monthly scrub of your ZFS pools. And you could setup postfix and zfs-zed to get alert emails in case a ZFS pool degrades.

I'm personally running zabbix as my monitoring server and set it up to use a smartctl and ZFS template. So it will keep an eye on the pool and disk health and warn me on its dashboard if something looks suspicious.
 
(EDIT: Changed "primarycache" to "secondarycache", based on input from following post by Dunuin)

L2ARC makes sense if you need even more cache but you already populated all your RAM slots with the biggest possible DIMMs.
Great way of putting it, thanks!

Special devices aren't a metadata cache. They are a metadata storage. If you loose your special devices all data of the pool is lost. If you just want a metadata cache SSD use a L2ARC SSD and set it to "secondarycache=metadata".
I didn't even know about the "secondarycache=metadata" option. Thanks! It seems like a good intermediate solution. Since it's just a cache, we don't need to worry about data loss in case of hardware failure, so you only need a single drive (no mirror).

Its more useful if your got slow HDDs and want some SSDs for the metadata so the IOPS limited HDDs aren't hit but so much small random reads/writes.
This is my exact scenario. All my ZFS volumes are currently running on spinning HDDs. Benchmarking with fio shows good read performance, but horrible performance on random writes. This is why I was leaning towards setting up a SLOG mirror, but as you said before, it will only help with synchronous write.

if your RAM is adequately sized, most of the metadata should be cached in ARC anyway for zvols, but you can verify that with arcstat/arc_summary, and if that is the bottle neck, maxing out the RAM first is probably a good idea.
This is a great way of looking at it. If I've maxed out RAM, but metadata is falling out of the ARC cache, I can use Dunuin's idea of an L2ARC SSD set to secondarycache=metadata.

for a 1G zvol you can see 1 very small L2 object that holds the references to 128 small L1 objects which hold the references to 128K L0 data objects. the L0 data objects each represent one block (8k default volblocksize, so that will affect the ratio of metadata:data as well). for a more concrete real world example, one of my pools has 656G of zvol data objects, and roughly 9G of associated metadata objects which would be stored on a special vdev.
Very interesting. So when using a zvol, the metadata stores information about each block (8k by default). In your example, 656GiB / 8K volbocksize = ~86 million metadata entries. Since you had ~9GiB of metadata objects, we get an object size of ~112 bytes per volblock. This yields the following formula for calculating the expected metadata size of a zvol:

zvol size (GiB) / volblocksize (KiB) * 112 = metadata size (MiB)
656 / 8 * 112 = 9,184 MiB

Here's a table I put together to help me visualize how Datasets and Zvols compare. I'm using the term "Chunk" here for lack of a better word.
Dataset​
ZVol​
ZFS "Chunk" Size Settingrecordsize (default = 128KiB)volblocksize (default = 8KiB)
Metadata stores "Chunk" checksumYesYes
Metadata can store small data objectsYes (see "special_small_blocks" setting)No

Assuming a "chunk's" 256-bit checksum is stored in it's metadata object, then an write to the block (async or sync) would incur a checksum calculation and write to the metadata storage. Therefore, offloading the metadata storage to a special device has the potential to enhance async & sync write performance of a Zvol.

Unanswered Questions
  • Would an L2ARC SSD set to secondarycache=metadata have the same write performance benefits as a special device? Intuition says no, because the L2ARC cache only accelerates read performance.
  • Are there any write performance benefits of implementing a metadata special device, if a fast SLOG is already configured? Foundational questions:
    • Would a write to the ZFS metadata be handled by an SLOG device?
    • Does ZFS write to metadata synchronously?
Performance Testing
I'm going to be working on this next week, so should have some feedback after that. The plan is to configure a PVE node to boot several VMs on a ZFS pool. Each VM will run a Windows workload simulator. Here's the 6 configurations I want to test:

Metadata on HDDs​
L2ARC SSD (secondarycache=metadata)​
Metadata Special Device Mirror​
No SLOG / ZIL
SLOG / ZIL Mirror
 
Last edited:
  • Like
Reactions: basecasefalse
But why restoring just one VM does not take advantage of more bandwidth than mentioned above ? ... Do we have some kind of "restore speed limitation by VM" ?
When you're restoring multiple VMs at once, are you restoring to the same PVE node and to the same zpool? Also, are you using any kind of link aggregation or NIC bonding?
 
I didn't even know about the "primarycache=metadata" option. Thanks! It seems like a good intermediate solution. Since it's just a cache, we don't need to worry about data loss in case of hardware failure, so you only need a single drive (no mirror).
Sorry, should be "secondarycache=metadata" not "primarycache=metadata". Primarycache is for the ARC, secondarycache for the L2ARC.
With "secondarycache=metadata" you tell it to only use the L2ARC for caching metadata, so its similar to a "special device" especially when using persistent L2ARC (so it doesn't need to read metadata from HDDs once after a reboot) but loosing the L2ARC SSD wouldn't result in dataloss.
Unanswered Questions
  • Would an L2ARC SSD set to primarycache=metadata have the same write performance benefits as a special device? Intuition says no, because the L2ARC cache only accelerates read performance.
It should not as the metadata still needs to be written to the slow HDDs, so the HDDs are busy writing metadata and can't write data at that time.
  • Are there any write performance benefits of implementing a metadata special device, if a fast SLOG is already configured?
SLOG will only help with sync writes. A "special device" should also speed up async writes as the HDDs don't need to store the metadata so the HDDs are hit by less IOPS and can use the saved IOPS to write more data.
 
Last edited:
  • Like
Reactions: JonathonFS
When you're restoring multiple VMs at once, are you restoring to the same PVE node and to the same zpool? Also, are you using any kind of link aggregation or NIC bonding?
No, there is no aggr or bond .. just one NIC towards PBS. So the main question would be : Is there any way to populate i.e. 80% of this 1G wire-bandwitdth when restoring just one VM at a time? When restoring 2-3 VMs concurrently to the same PVE node , wire link gets full populated at 1Gbs ... and this is understandable and wanted, so we can say that 1G NIC presents bottle-neck , but what would the bottle-neck be when restoring just one VM at a time and bandwidth is "just" 40% populated ? ... Do we have kind of restore-speed limitation ? ... Today I had to restore cca 800GB VM and it took 6 hours instead of maybe 3 ?!
 
@tonci
If there's no bonding, then this shouldn't be a network path limitation issue. If you're restoring multiple VMs to the same ZFS pool and getting full link speed, then the bottleneck isn't with the destination storage volume (it can clearly take it!).

Here's some ideas, but you may have better luck asking this question in the PBS Install & Config forum.
  • This post indicates PBS may be using a single thread per VM being restored. Seeing as your running lower powered Intel Atom CPUs, you may being running into a CPU bottleneck. I would check your CPU utilization during a restore operation, and see if any of them are pegged. This problem could also be exaggerated if you're running NICs that don't good hardware offload support (or hardware offload is disabled).
  • Is a restore bandwidth limit set on the destination PVE storage volume? https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_bandwidth_limit
  • Maybe the PBS restore process writes the VM to the destination node synchronously (as opposed to async)? I doubt this is the case, but it could explain why you're getting more performance with multiple restore jobs.
 
@tonci
If there's no bonding, then this shouldn't be a network path limitation issue. If you're restoring multiple VMs to the same ZFS pool and getting full link speed, then the bottleneck isn't with the destination storage volume (it can clearly take it!).

Here's some ideas, but you may have better luck asking this question in the PBS Install & Config forum.
  • This post indicates PBS may be using a single thread per VM being restored. Seeing as your running lower powered Intel Atom CPUs, you may being running into a CPU bottleneck. I would check your CPU utilization during a restore operation, and see if any of them are pegged. This problem could also be exaggerated if you're running NICs that don't good hardware offload support (or hardware offload is disabled).
  • Is a restore bandwidth limit set on the destination PVE storage volume? https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_bandwidth_limit
  • Maybe the PBS restore process writes the VM to the destination node synchronously (as opposed to async)? I doubt this is the case, but it could explain why you're getting more performance with multiple restore jobs.
JonathonFS thank you for your hints ... You're right , I'll post new(same) question to PBS Install&Conf forum
BTW ... My setup is 'default' and does not have has any bw-limits ... I've just tried vzdump backup and restore to (and from) the same PBS server (made nfs export in rpool/nfs directory on the same PBS) -> backup and restore runs at full speed ... which also proves that there is no bw-limit (?)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!