ZFS Disk config help - New to proxmox from vmware

bonkas

Renowned Member
Mar 31, 2015
6
1
68
Hey guys,

After many years of being a vmware user at home I decided it is time to switch to Proxmox for a bit more freedom in configuration and options.

I am having some issues with disk performance, I think my issues are probably down to my limited understanding of the disk setup so I have some questions, before that though here is my basic disk setup:

AMD 3900x Processor
64GB Memory

Drive configuration:

IBM M5015/LSI 9220-8i controllder in IT/HBA Mode.

Proxmox installed to a 250GB Gen4 nvme drive connected to mainboard.

Connected to HBA:

1x ZFS Raid1 pool with the following disks:
  • 2x 1TB WD NAS 7200RPM Hard Drives
  • 2x 1TB WD Enterprise 7200RPM Hard Drives
1x ZFS Raid1 pool:
  • 3x 4TB WD NAS 7200RPM Hard Drives
All these disks are non-SMR disks.

The Issues I am experiencing is very high IO Latency (30-80% doing simple tasks) I am sitting around 0-4% on idle, locking up all VM's or outright crashing them on large read/writes and poor performance. This was never an issue under VMWare and I got a average 350MB's read/write speed on my Windows VM's (I know IOPS is a better measurement but a quick read/write test under windows was a quick and dirty solution to test gerneral performance for the work most of the VM's do)

I have performed the following after doing a bit of digging/reading:

  • Changed to SCSI disks and controller for all VM's - This helped hugely
  • Installed QEMU guest client on all VM's - This helped with the above change
  • Enabled/Disabled disk cache - this didnt seem to make any difference
  • Enable IO thread for the Windows VM's - This seemed to help sustained read/writes without locking up the host.

What I think I should do (Please note I am unable to purchase any more hardware so trying to make do with what I have in a homelab):

  • Change my large ZFS pool to 16kb or larger as this volume contains large media files mostly.
  • Change my Smaller ZFS pool to a striped and mirrored setup (RAID10), this is what I was running under VMWare with a battery backed raid controller.
  • Maybe add a SSD for cache?
I am open to any easy configuration changes that may help. as I say, I am very new to proxmox but would love to start using it to it's full potentiol with the hardware I have.
 
Last edited:
Best would be to add 2 small mirrored SSDs as "special metdata devices". That way HDDs would only need to store data and all metadata would be stored on the small SSDs. This would increase IOPS for reads+async writes+sync write. I think it made my HDDs about 2-3 times faster, because the HDDs are hit by less IO. But I guess thats not an option as you only got 1 SATA port on that HBA left. Just a single SSD as special device would be terrible, as it is no cache. Loosing that single special device SSD would mean all data on all HDDs is lost.
What you could try, in case your workload contains a lot of sync writes, is to add a small enterprise SSD as a SLOG. But this SLOG SSD will only cache sync writes, not async write. So really depends on your workload if that would make sense or not. But ZFS metadata is also doing sync writes, so it might help at least a little bit. But not sure if it is worth it. But a SLOG can be added and removed at any time, so you could just test it in case you got a spare SSD laying around.
You could also try a L2ARC SSD, but thats usually also not that useful, except for special workloads. Usually recommendation is to first max all the RAM your mainboard supports and only then add a L2ARC SSD when your workload still needs more read cache. Because a L2ARC will not just add more read cache. It also consumes RAM. So the bigger your L2ARC is, the less RAM you got and therefore the smaller your way faster ARC will be. So you are sacrificing a bit of very fast read cache to get some more way slower read cache.

What always helps is adding more RAM so your ARC read cache is bigger. But I guess thats also not an option if you don't want to buy new hardware.

Creating a striped mirror (raid10) of those 4 HDDs of cause would double the IOPS performance. A raidz (raid5) is always bad for IOPS performance, as the IOPS performance will only scale with the number of vdevs, not the number of disks. So a 3 disk raidz got the same slow IOPS performance as a single disk.

Also keep in mind that a ZFS pool should not be filled more than 80%. Otherwise it will fragment faster and become slow.

And that raidz needs a volblocksize of 16K or higher. Otherwise you will lose 50% of your raw capacity (-33% parity and -17% padding overhead). And of that 50% you again lose 80% because 20% should always be kept free. So only 40% of the raw capacity actually usable.
For a mirror or striped mirror I would stick with the default 8K. Otherwise some workloads like PostgreSQL DBs will result in terrible performance.
 
Last edited:
Thank you for your reply, I have only tinkered in the linux realm before which is why I wanted to take a deep dive into Proxmox, because of this my terminology may be off so my apologies.

So it sounds like I should definitely change my raidz1 of the 4x 1tb disks to a striped mirror configuration - Is this easy to do in the GUI or is this a command line only task? Looking this up is leading down a deep rabbit hole, but I will keep investigating.

Keep 8k volblocksize if going the stripe/mirror route or 16k and above if keeping raidz1?

What about the 3x 4tb disks? - They are 99% media storage and the only read/writes are adding/deleting large media files, should I move the vm-disks off here and re-create them with 16k or even 32k volblocksize? Leave these also in the raidz1 config or is there a better option I should be thinking of?

Also the SSD route for cache, I do have some smaller SSD's I can use for this purpose and I have sata ports on the mainboard available that I can use. This is the L2ARC cache correct? I do have a decent amount of memory available for ARC cache, I am only using 20-30GB of the available 64GB memory so not sure if the L2ARC cache will be much of a benefit? and if I go through this route, if data is lost on these disks what happens to the data on the ZFS volumes/pools?

Apologies for the questions, just trying to broaden my understanding
 
So it sounds like I should definitely change my raidz1 of the 4x 1tb disks to a striped mirror configuration - Is this easy to do in the GUI or is this a command line only task? Looking this up is leading down a deep rabbit hole, but I will keep investigating.
Don't mix up ZFS terminology. "Raidz1" or "raidz" is like a raid5. "Raidz2" is like a raid6. "Raidz3" is like a raid5 or 6 but with 3 parity disks. "Mirror" (or I think called "z-raid1"in the PVE installer) is like a "raid1". "Striped mirror" is like a raid10.
You can create that striped mirror using the webUI (see YourNode -> Disks -> ZFS -> Create: ZFS) but this requires the disks to be unpartitioned. So you first need to back everything up, then wipe those disks (YourNone -> Disks -> select the correct disk -> "Wipe disk" button).
Keep 8k volblocksize if going the stripe/mirror route or 16k and above if keeping raidz1?
8K for the mirrors or striped mirrors of up to 4 disks. 16k for a raidz1 of 3 disks.
What about the 3x 4tb disks? - They are 99% media storage and the only read/writes are adding/deleting large media files, should I move the vm-disks off here and re-create them with 16k or even 32k volblocksize? Leave these also in the raidz1 config or is there a better option I should be thinking of?
Yes, you need to delete and recreate the zvols (virtualk disks) after setting the block size of the ZFS storage to 16K (you can set that at Datacenter -> Storage -> YourZfsStorage -> Edit -> Block size). The volblocksize can only be set at creation of the zvols and can't be changed later. So easiest is it to backup and restore a VM or migrate it back and forth between two nodes. If you got enough free space you could also work with replication (google for "zfs send | zfs receive" ) and create a copy of a zvol on the same pool.
Also the SSD route for cache, I do have some smaller SSD's I can use for this purpose and I have sata ports on the mainboard available that I can use. This is the L2ARC cache correct? I do have a decent amount of memory available for ARC cache, I am only using 20-30GB of the available 64GB memory so not sure if the L2ARC cache will be much of a benefit? and if I go through this route, if data is lost on these disks what happens to the data on the ZFS volumes/pools?
L2ARC and SLOG are just cache, so redundant data. No problem if you lose them.
But if you really want to see a performance increase I would recommend to use special devices. But there you really want mirroring as the the pool can't compensate a loss.
 
Last edited:
Don't mix up ZFS terminology. "Raidz1" or "raidz" is like a raid5. "Raidz2" is like a raid6. "Raidz3" is like a raid5 or 6 but with 3 parity disks. "Mirror" (or I think called "z-raid1"in the PVE installer) is like a "raid1". "Striped mirror" is like a raid10.
You can create that striped mirror using the webUI (see YourNode -> Disks -> ZFS -> Create: ZFS) but this requires the disks to be unpartitioned. So you first need to back everything up, then wipe those disks (YourNone -> Disks -> select the correct disk -> "Wipe disk" button).
Yup, I do understand I need to blow away the disk config to make any changes in this regard, I assume the config is similar to a LSI or similair raid controller where I create the two mirrors, then create a striped array and the two mirrors should become available to select?

Yes, you need to delete and recreate the zvols (virtualk disks) after setting the block size of the ZFS storage to 16K (you can set that at Datacenter -> Storage -> YourZfsStorage -> Edit -> Block size). The volblocksize can only be set at creation of the zvols and can't be changed later. So easiest is it to backup and restore a VM or migrate it back and forth between two nodes. If you got enough free space you could also work with replication (google for "zfs send | zfs receive" ) and create a copy of a zvol on the same pool.

Cool, I have this pool set to 16k already so I will leave it be. Great tip about zfs send/receive! Up until I learned about this I have been manually creating another virtual disk, mounting to the VM and manually moving the data across and then removing the old virtual disk - this takes about 20 hours :P.

L2ARC and SLOG are just cache, so redundant data. No problem if you lose them.
But if you really want to see a performance increase I would recommend to use special devices. But there you really want mirroring as the the pool can't compensate a loss.
I have some consumer grade SSD's I can get my hands on for this, I know this is not recommended but is what I have access to :P, do you have any documentation around configuring for this?
 
Yup, I do understand I need to blow away the disk config to make any changes in this regard, I assume the config is similar to a LSI or similair raid controller where I create the two mirrors, then create a striped array and the two mirrors should become available to select?
When using the webUI you just select the 4 disks and select raid10 as raid level. When using the CLI you can directly create a striped mirror with those 4 disks or you just create a single mirror first and add another mirror to it after that. Syntax would be zpool create -f -o ashift=12 YourPooName mirror /dev/disk/by-id/disk1 /dev/disk/by-id/disk2 mirror /dev/disk/by-id/disk3 /dev/disk/by-id/disk4 for 4k sector HDDs.
Cool, I have this pool set to 16k already so I will leave it be. Great tip about zfs send/receive! Up until I learned about this I have been manually creating another virtual disk, mounting to the VM and manually moving the data across and then removing the old virtual disk - this takes about 20 hours :p.
Yes, ZFS got a lot of great features like these. But most people aren't willing to invest time to learn how to use ZFS the right way and just want a simple software raid.^^
zfs send | zfs receive should be way faster, as it does this on block level. Another benefit is that you don't loose snapshots that way. And you can use it recursively, it can do incremental transfers and you can even tunnel it though SSH over the internet between pools of different servers.
I have some consumer grade SSD's I can get my hands on for this, I know this is not recommended but is what I have access to :p, do you have any documentation around configuring for this?
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_special_device
 
Last edited:
I just waned to update you guys.

After a bit of research and more learning about ZFS in general I have installed two SSD's, partitioned each identically with two partitions - One small 5-10GB for slog and ~80GB for for special device.

I decided to not worry about L2Cache as my VM's on a fresh boot are only consuming ~20GB of the 64GB total memory so I am just letting ARC do it's thing.

I have created the log and special drives as a mirror and attached to the storage holding my VM's. I have yet to rebuild this storage to raid10 but I am already seeing improvements with the VM's sitting mostly idle performing thier usual duties - I am sitting at 0-0.5 IO delay whereas before I was sitting on average at 5-10%

I think I am on a good path here with my hardware constraints and really enjoying using proxmox - being stuck in the vmware ecosystem was getting under my skin. Big thank you to @Dunuin for the advice so far and the push to make changes I was not 100% sure about.
 
Last edited:
Just keep an eye on the SSD wear and monitor your pool state. SLOG can kill SSDs quite fast so not ideal to put SLOG and special device on the same SSD, as a loss of the special device vdev also means loss of the entire pool. Not that bad when using a mirror, so atleast one SSD is allowed to fail. But you really should replace the SSD as fast as possible in case it fails, as SSDs tend to fail soon after each other.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!