Special device for caching

Elleni

Well-Known Member
Jul 6, 2020
233
20
58
52
Hi all,

on a server which still contains hdd's instead of ssd's our developers discovered that build times are not as good as expected and we think the bottleneck is IOPS. The operating system is setup on two small ssd's. While the zfs disks is setup on those hdd's.

The pool looks like this:
Code:
        NAME                        STATE     READ WRITE CKSUM
        pvedata                ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000cca2940d2830  ONLINE       0     0     0
            scsi-35000cca2940f49b0  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35000cca294129990  ONLINE       0     0     0
            scsi-35000cca294151e00  ONLINE       0     0     0
          mirror-2                  ONLINE       0     0     0
            scsi-35000cca294153eb8  ONLINE       0     0     0
            scsi-35000cca294154c74  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            scsi-35000cca294154edc  ONLINE       0     0     0
            scsi-35000cca29415539c  ONLINE       0     0     0

I read about a special device for this pool could augment io performance significantly. Can it make sense to add an ssd special device?

What happens if the special device fails - is all pool data corrupted and need to be restored from backups, or will it continue to function with normal hdd performance?

Finally - if no other option I thought, maybe buy two ssds and replace the two 250GB ssds in rpool with bigger 1.x TB ssds and create two partitions each - a small, say 50GB partion and a big second partition. Then create rpool on the first partition (mirrored) while using the second partition as special device mirror. I know its recomended to preferably use whole disks instead of partitions, thus would like to know the downside of a a setup like this with partitons for system and special device mirrors.
 
Last edited:
1. special device increase performance for prune and GC, but not for consistency check
2. u can use separat partitions, needed space for special device is approx. 0,03% of pool space
3. if special device fail your pool will fail

Alternative for HDD pool is to use BTRFS
 
  • Like
Reactions: Elleni
Thanks for your quick reply.
1. i would need it for iops operations in the guests which do compile/building -somany small read and write operations - would a special device help in that regard?
2. If I understand correctly I could replace system ssd's with bigger ssds and partially use second partitions for a mirror as special device
3. so then recreation pool from scratch would be necessary with restore from backups - hence better have a mirrored special device as cache?

Do you think the gained performance for the vm's building will be noticeble - is it worth a try?

Well I dont want to change away from zfs as I got used to it and love its functionality.
 
Last edited:
Sorry, but my one thought this is a PBS question.
Nevertheless my answer 2 and 3 point to your situation.
 
  • Like
Reactions: Elleni
As my one comprehend your ones use VMs on local zfs for build/compile actions with read intensive use case?
 
yeah, for read and write operations with many small files thus I think iops becomes the bottleneck on a local hdd zfs pool. So I want to find out if a special device with two ssds (mirrored) could help.
 
Then a special device wont help because it depends on datasets not on zvols. If u can install the compile environment in a container (lxc) then it can help.
If your compile process use often the same data a read cache L2arc can help. Otherwise a write cache Slog can help if many new files will be stored but if the buffer will be filled up the performance slows down (see https://forum.proxmox.com/threads/verständnisfrage-über-zfs-zil-slog-writeback-an-die-zfs-gurus.147704/).
Both caches could be installed on partitions. The read cache dont have to be mirrored. One can use 2 partitions with zfs stripe.
 
Last edited:
  • Like
Reactions: Elleni
I guess I have to learn a bit more about arc and slog as I falsely thought that it is some sort of cache which augments iops capability of the whole pool. Thanks for the linked post which seems a good starting point. I ll come back with questions if anything remains unclear.
 
Last edited:
You can start with an L2arc device, those are basically disposable and won't fail your pool if they die. You can't mirror them, but you can "stack" them for more cache. Going beyond ~10% of the pool size for cache may be diminishing returns, I haven't tested extensively.

ARC which uses system RAM is much faster, but L2 will survive a reboot. You can try various tuning of ZFS kernel parameters as well.

Personally for homelab I use PNY 32-64GB USB thumbdrives for L2arc, but for Production you should probably use something a bit more substantial if you have drive bays available. 500GB enterprise-class SSD is overkill for cache; but you probably won't have to worry about it for years, and you can partition it to limit the size.

For ZFS Special device, you want 2x 500GB-1TB SSDs of different make/models. If both die at the same time, it kills the pool. So you kind of want one to die first. (Think Evo and Pro levels of projected endurance, and go with the highest TBW rating you can find.) You can start with a mirror and extend it if needed in a raid10-like fashion.

https://klarasystems.com/articles/openzfs-understanding-zfs-vdev-types/

https://search.brave.com/search?q=z...summary=1&conversation=bc5d155e79bc0d1e10d086

You may need to do an analysis of file sizes in the source code, ideally everything would be moved to the special vdev (you may have to rewrite dataset{s}) and run from SSD if you don't want to replace all the spinners.

Alternative of course would be to replace the spinners with Enterprise SSDs, but I have no idea what the overall pool size is or if you have a budget.

Wild-hair case: Depending on source code corpus size, you could possibly run builds from RAMdisk and rsync/rclone the results back to spinners after the build finishes. Maybe also look into distcc and spread the compiling between multiple instances.
 
  • Like
Reactions: UdoB and Elleni
I have no experience with special vdev under Proxmox, only TrueNAS. But I think this won't work well, since you are not working with datasets but zvols.
SLOG is probably also not helping, because your workload is probably not doing any sync writes?

Are you sure that the pool is the bottleneck when compiling? Do you see IO wait in the Proxmox dashboard?

If you don't need a lot of storage I would simply replace the spinners with SSDs.
 
  • Like
Reactions: UdoB and Johannes S
Thanks guys - I also started thinking that a special device might be worth a try.
Code:
A special device can improve the speed of a pool consisting of slow spinning harddisks with a lot of metadata changes. For example workloads
that involve creating, updating or deleting a large number of files will benefit from thepresence of a special device. ZFS datasets can also be
configured to store whole small files on the special device which can further improve the performance. Use fast SSDs for the special device.
But that wont help as someone earlier wrote because of working with zvols The Thing is - this server is old and is due to replacement in a year anyway so its not worth investing too much to it.

Well assigning more and more cpu and ram and augment jobs didnt make build times shorter so it seams obvious that the bottleneck is the read-/write performance of hdds. However I am willing to record the io during a build, so I am looking into which tools gets the job done best.
 
Last edited:
  • Like
Reactions: Johannes S
A special device might still help you since you can also configure it with the special_small_blocks parameter that some (but not all) regular data is written to the ssds instead of the hdds. @LnxBil described such a setup here: https://forum.proxmox.com/threads/zfs-metadata-special-device.129031/post-699290

The idea is basically that you create a dedicated dataset for that data then configure special_small_blocks that everything on that dataset will be saved to the ssds. This won't work for zvols (see here: https://forum.proxmox.com/threads/d...-a-metadata-special-device.106292/post-457502 ) though, but their metadata will still end up on the ssds. You could still use this approach by serving a dedicated network share on the ssd with an lxc nfs fileserver to the vms. Another possibility would be to add the ssd dataset as directory storage and then have some "fast" vm discs as qcow files on them.

But to be honest for that usecase it would propably a better approach to have a dedicated ssd pool and add a "fast" virtual disc to the developer vms.
 
  • Like
Reactions: UdoB and Elleni
You are so right @Johannes S , thats what I start thinking too. I am trying to decide if 2 x 1TB mixed use SSDs for this server each around 1k bucks is worth it. Thing is - the guest vm building heavily has a size of 5 TB thus I am trying to figure out where all the io takes place to either mount some folders to ram and/or create a separate pool with the ssd's and assign an ssd powered zvol disk to the corresponding vm.
 
Last edited:
  • Like
Reactions: Johannes S
You are so right @Johannes S , thats what I start thinking too. I am trying to decide if 2 x 1TB mixed use SSDs for this server each around 1k bucks is worth it.

I can understand that sentiment since you mentioned that this server will only live for another year. On the other hand not doing anything will also have costs: How is the hourly rate of your developers? Let's assume their wage costs are at 50 bucks an hour (this is just an arbitary number for easy calculations) , then 2000 bucks would be equivalent to 40 work hours of a developer in terms of costs. The question is how much of this hours could be saved by having a faster builds? This is something I would consider, especially if you need to convince your boss to have the budget for ssds ;)
With a team of let's say ten developers who make 50 bucks an hour at a 40 hours work week 2000 bucks for the ssds are equivalent to one work week of developers wages. So one could argue that going ssd will save development costs in the end ;)


@Kingneutron already mentioned distcc. I now remember that at a former employer of mine we used icecc together with ccache for distributed building of c/c++ projects. ccache is a cache of compiler artifacts so you don't need to rebuild everything. It's size can be configured so that might be worth a shot to provide your developers with small virtual discs on a ssd which they then can use as ccaches. icecc and distcc allows to build a build cluster out of several machines to distributed the build workload. But unlike distcc Icecream uses a central server that schedules the compile jobs to the fastest free server so that would be my preference. We mainly used this on the developer workstations so their build times could be reduced by leverarging the compute power on their coworkers machines if they didn't do any builds themselves. But of course this would also work with vms (propably even better when they are on the same host since you don't have the network as bottleneck). This won't help much if the hdds are the main bottle neck but still better than nothing. You can also combine ccache and icecc so ccache is used for everything already in the cache and the build cluster for anything else. Depending on your available hardware and network the combination might yield better build times or not. For us adding ccache was worth it but a friend of mine tried it in his company and for them only using icecc was actually faster. So: Do some benchmarks ;)

If you happen to use rust there is sccache developed by Mozilla which combines the functionality of ccache and icecc for rust,c++, c und nvidia cuda: https://github.com/mozilla/sccache I never used it myself but different to ccache/icecc it also allows using network storage (s3, redis, memcached and some other backends) for storing the cache so this might help too.

Another thing you need to be aware of: A special device will only store the metadata of new data so for your old data you need to rewrite it with zfs send/receive like described here: https://forum.proxmox.com/threads/zfs-metadata-special-device.129031/post-564923
 
Last edited:
  • Like
Reactions: UdoB and Elleni
Great post Johannes, so at the moment I am waiting to get price for two 1TB ssds but to create a separate pool for this building vm. In the meantime I will do two additional tests:

- mount the io-heavy source folders into ram as the server has significant amount of ram still free
- then its a two node cluster and on the other cluster we have ssds so I'll take like half a GB from the second node (which has slower cpu/less ram) and try to mount it to the building machine and we'll do some testbuilds.


In everycase thank you to all of you as for me its clear now that if new disks will be bought they will be used to create a quick zvol for the one guest rather than to try to augment the performance for the whole hdd pool. What way would you recomend to mount the ssd diskspace from the other node into a folder? I can think of nfs or sshfs.
 
  • Like
Reactions: Johannes S
nfs will propably yield the best performance but is problematic in terms of security. If the node is not part of a cluster you yould setup another virtual switch in sdn which is then used as nfs network on the nfs Server and vms who use it.

The idea with the RAM disc is quite smart but how do you ensure no data is lost in case of a host crash or reboot?

Btw: ZFS ARC cache also speeds up read access and ProxmoxVEs default is to use 10% of available RAM during install. It might be worth a try to allow it to grow larger if you have spare RAM:
https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage
 
Last edited:
  • Like
Reactions: Elleni
good point on allow zfs arc cache to grow. What I will try after the ram test-build, is expose the others servers zfs ssds as iscsi and then use it on the first node instead of the nfs approach, as its a 2node cluster. And the clusternetwork is established by a direct connection of the two nodes with a 10 gb nic - without switch inbetween. Maybe I'll be able to add even 2 40gb nics for clusternetwork.

if ram is significantly faster - which should be the case - we ll setup an rsync job to copy the needed part of the tmpfs before shutdown, or live with the fact that we have to copy over half a terra bevore starting to build in tmpfs. Guess this would be the "easiest" workaround as while it would be appealing for me to test iSCSI setup from one node to the other in production I dont see it as a fitting scenario - having that build-vm being dependent on both nodes.
 
Last edited:
  • Like
Reactions: Johannes S