Best disk setup for config

stevenwh

Member
Mar 16, 2024
30
2
8
Hello everyone,
First I want to say, I know this question gets asked a lot. I apologize for asking a similar question again. I've spent a little time trying to read through a lot of the posts that I could find, but honestly I just don't have the time to translate others specific set ups to mine. And I'm failing at finding any generalized documentation to help me decide the route to go. Just being frank, I don't have time to learn all the intricacies of storage configurations and hoping someone can just help me with straightforward simple answers.

I'm configuring a new Hypervisor server for mostly homelab use, but it's kind of an elevated home lab that will be running some personal use servers and possibly a few friends and family servers. So not strictly a playground. Performance is an important factor for me on some things. Up until now I've been just playing around testing proxmox vs esxi deciding which way I wanted to go. I'm leaning now toward Proxmox because of the built in support for ZFS as well as because the server I'm using is an old R730 and ESXi has deprecated the CPUs in it. So upgrades later might become problematic and I feel it would be better to go ahead and go with Proxmox for hopefully better / longer support for the hardware. I don't want to get invested in ESXi now and in a year or two need to upgrade for some reason and can't because the CPU is fully removed.


So, all that said, my hardware, R730, 768 GB ram, 2 x E5-2690 xeon processors, I also have a Tesla P40 in the server for vGPU, as well as a 4060 Ti for single pass through that will be used for possibly some gaming, possibly some AI training. Additionally I might put in a Tesla P4 or an RTX A2000 if needed for additional pass through or something. I also might get a couple of dual NVME PCIE cards if they would make a big enough difference on performance. Currently have 13 x 4TB 2.5 inch SSDs in the server, and may get 3 more to fill it out to 16. The boot device is a single 500 GB SSD.

In my testing I've definitely been hitting some kind of IOP bottle neck when just running on a the Single SSD. Multiple VMs doing thing simultaneously have hangs and poor performance even when the CPU is < 2% overall usage. I haven't figured out how to verify it's a IOP problem, but it's the only thing that makes sense as if just one VM is running or only one actively doing stuff it runs beautifully.

So, I've been considering the options... if I just stayed with my 13 current drives should I run 2 x 6 vdevs in RaidZ2? I will say I don't really want to waste 6 drives worth of storage to do a full on mirror'd vdev which I think I gather would give the best performance. If I understand correctly if I did 2 vdevs striped then I should get the IOPS of 2 disks? I'm worried that wont be enough though if I'm hitting some kind of bottle neck with just a single disk already =/ Maybe I should get the 3 more 4tb drives and run a 4 x 4 raidz1? But I'm pretty sure I read somewhere that performance is lessened when running an even number of disks on raidz1 =/ so would 4 x 4 rz1 or 3 x 5 rz1 have better performance there?

Then there is the possibility of the NVMEs. With the main array being SSDs, would there be a noticeable improvement if I had mirrored NVMEs as cache? I could put up to 4 x 2 TB NVMEs in there. Although that seems a little excessive as far as capacity for just caching? This is part where I'm still really unsure / inexperienced.

For the VMs I'll be running, there will probably be 10ish Linux VMs doing various tasks, web servers, email servers, game servers (minecraft, 7 days to die), etc. There will also be a couple of windows VMs, possibly a Windows Server VM. At least one Windows VM that would be used for a cloud gaming server for myself when I'm traveling for work and such so that I don't have to lug around a big gaming laptop hehe. Probably another few Window VMs that would serve as cloud workstations. I do contract work and generally set up a new environment for each contract to keep everything separated. I may have up to 3 or 4 contracts going at once. I know this is kind of a mix of business + homelab, but I don't make enough to go full on enterprise set up, so I'm trying to make do with a kind of hybrid system here.

All the VMs collectively might use up 8 to 10 TB of storage. And I'd like to be able to have some to use as network shares. Plus knowing that SSD performance tanks the more full it is, would like to make sure I have some overhead for that. So I'm thinking I want the total array usable size to be at least in the 20s if not 30 TB range.

Also just to note, the server has a dual 10 GB uplink to the local network, so I don't think network bandwidth will be an issue for local traffic. Dual 1 Gb upstream for remote connections (yeah I'm a crazy person that has 2 ISPs at home, I have a firewall appliance that handles load balancing of the two WANs).

If I missed any important details, please let me know.

I'd really appreciate anyone knowledgeable that would be able to just tell me the key points in configuration for the storage to get the best performance with the hardware I have to work with. I think I want to go with ZFS but I'm not completely sold on it if something else would be substantially better. I want some redundancy on the server, mostly to minimize any downtime in the event of a failure. I don't need a ton in the server itself as I do have another on site backup solution as well as an offsite backup solution already in place (each with their own raid redundancy).

Thanks for any advice!
 
Last edited:
I don't have time to learn all the intricacies of storage configurations and hoping someone can just help me with straightforward simple answers.
Once you decided what storage to use you will have to learn how it works and how to administrate it and have a good backup strategy and disaster recovery plan. If not you will probably lose or at least risk your data sooner or later.
There is for example no way to replace a failed ZFS disk via webUI and this has to be done via CLI and a type or bad understanding might wipe your whole pool...
PVE will also not warn you in case the pool degrades. You will have to set up proper monitoring for that and do stuff like setting quotas via CLI.
Currently have 13 x 4TB 2.5 inch SSDs in the server, and may get 3 more to fill it out to 16. The boot device is a single 500 GB SSD.
I hope no QLC and all enteprise/datacenter grade as you talked about ZFS...


if I just stayed with my 13 current drives should I run 2 x 6 vdevs in RaidZ2? I will say I don't really want to waste 6 drives worth of storage to do a full on mirror'd vdev which I think I gather would give the best performance.
You said you care about performance. Then a raid10 aka striped mirror is the only way to go. Will also help to keep the block size lower and you will be way more flexible when adding/removing vdevs.
Also keep in mind that you shouldn't fill your pool too much. Usual recommendation is to only fill it to 80%. You should also search this forum for "padding overhead" as most people don't understand that any raidz might waste tons of space if the blocksize was chosen too small and then you might loose those 50% of raw capacity as well.
And in case you want to make use of snapshots you probably want to plan in lots of space for those too.

If I understand correctly if I did 2 vdevs striped then I should get the IOPS of 2 disks?
Correct.


Maybe I should get the 3 more 4tb drives and run a 4 x 4 raidz1? But I'm pretty sure I read somewhere that performance is lessened when running an even number of disks on raidz1 =/ so would 4 x 4 rz1 or 3 x 5 rz1 have better performance there?
Not that bad with ZFS. But you will lose some capacity due to padding overhead unless you increase the volblocksize because of the odd number of data disks.


Then there is the possibility of the NVMEs. With the main array being SSDs, would there be a noticeable improvement if I had mirrored NVMEs as cache? I could put up to 4 x 2 TB NVMEs in there. Although that seems a little excessive as far as capacity for just caching? This is part where I'm still really unsure / inexperienced.
People who add L2ARC/SLOG disks usually do that because they don't understand ZFS well and think more cache is always better.
Better than L2ARC would be to buy more RAM for a bigger ARC, as you are sacrificing some fast read cache to get more slow read cache. It could even cripple performance.
SLOG will only help with sync writes. It won't help anything doing async writes and those are usually what you are doing most of the time. So only useful if you got crappy (so consumer/prosumer grade without a PLP) SSDs or you are doing lots of sync writes (running busy DBs and so on).
There are some niche cases where those make absolutely sense, but I would only add them if you are sure your workload really needs them.

servers, game servers (minecraft, 7 days to die)
Dont expect good performance. Games, eapecially minecraft, need fast single-threaded performance and your Xeons are really bad at that and are still beaten by a 55$ 6W TDP Intel N100 mobile CPU.
 
Last edited:
Once you decided what storage to use you will have to learn how it works and how to administrate it and have a good backup strategy and disaster recovery plan. If not you will probably lose or at least risk your data sooner or later.
There is for example no way to replace a failed ZFS disk via webUI and this has to be done via CLI and a type or bad understanding might wipe your whole pool...
PVE will also not warn you in case the pool degrades. You will have to set up proper monitoring for that and do stuff like setting quotas via CLI.

I hope no QLC and all enteprise/datacenter grade as you talked about ZFS...
I will learn enough to manage the storage. But I just don't have the time to become an expert in the various configurations and benefits or drawbacks to each. Writing scripts for monitoring won't be a problem for me, I've worked in software engineering for over 20 years so when it comes to any custom scripting or something, that is something I can easily handle.

The drives I have currently are TLC but not datacenter grade. As I said, I don't have enough budget to go full on enterprise setup. I've got to make do with what I have which is why I'm seeking advice for the best configuration with what I have.

You said you care about performance. Then a raid10 aka striped mirror is the only way to go. Will also help to keep the block size lower and you will be way more flexible when adding/removing vdevs.
Also keep in mind that you shouldn't fill your pool too much. Usual recommendation is to only fill it to 80%. You should also search this forum for "padding overhead" as most people don't understand that any raidz might waste tons of space if the blocksize was chosen too small and then you might loose those 50% of raw capacity as well.
And in case you want to make use of snapshots you probably want to plan in lots of space for those too.
yeah I care about performance as far as getting the best I can within my desires. If I have to sacrifice some performance to be able to have the amount of storage I want then that is what I have to do.

Correct.



Not that bad with ZFS. But you will lose some capacity due to padding overhead unless you increase the volblocksize because of the odd number of data disks.



People who add L2ARC/SLOG disks usually do that because they don't understand ZFS well and think more cache is always better.
Better than L2ARC would be to buy more RAM for a bigger ARC, as you are sacrificing some fast read cache to get more slow read cache. It could even cripple performance.
SLOG will only help with sync writes. It won't help anything doing async writes and those are usually what you are doing most of the time. So only useful if you got crappy (so consumer/prosumer grade without a PLP) SSDs or you are doing lots of sync writes (running busy DBs and so on).
There are some niche cases where those make absolutely sense, but I would only add them if you are sure your workload really needs them.

If NVMEs would be a stop gap measure to increase performance due to non enterprise disks, or because of a less than optimal configuration to keep capacity where I want it, then I'm willing to do that. If I can just throw more memory at it to achieve the same effect, I'm perfectly fine with that. This server as I said has 768 Gb which far more than I need. My VM load will probably consume more like 300 Gb memory on the high side. I can throw 300 - 400 Gb of memory at cache if it benefits performance.

Dont expect good performance. Games, eapecially minecraft, need fast single-threaded performance and your Xeons are really bad at that and are still beaten by a 55$ 6W TDP Intel N100 mobile CPU.
Gaming performance isn't a top priority. I've ran these type servers on slower hardware than these Xeons. And if I really want to max out performance on a specific game server then I'll just host that one on my desktop where I'm running a 13900k.

I thought (free) ESXi was EOGA , were you thinking of purchasing it?
I have some licenses available to me through some contracts I work if I decided to go the esxi route. But like I said, I don't like that option since my CPUs are end of life. Plus another big thing pushing me toward Proxmox is ESXis lack of support for software raid. I do have a Perc H730P in the server, and could run hardware raid, but I don't really want to unless it gave me a significant performance advantage. (I currently have it in HBA mode, and from everything I can find it is a true HBA so I don't have the concerns of running ZFS on top of a hardware raid controller). What research I have done tells me it doesn't really add a large benefit to performance though. So I'd rather not rely on an old hardware raid card that if it goes bad I have to find exactly the same one to replace it to recover the array. To do ESXi I have to either rely on an external NAS with iSCSI block drives for the VM OS / boot storage, or a really hoakey setup with a guest VM handling the arrays and then running scripts on the ESXi host to mount the datastores and start other VMs after that VM starts. Yes, it works, I've already experimented with that and wrote the scripts to do it... I just really don't like it setup like that lol. Most likely I'd go the external NAS and iSCSI blocks if I went with ESXi. I have a small low powered 4 x NVME NAS board that could handle that task. (CM3588 if anyone is interested, they are neat little boards for a basic NAS).
 
Last edited:
After thinking about it a little more, I've decided that it's actually fine on this server for me to do mirrored vdevs. If I max out my 16 drives with 4tb ssds I can still get 32 TB minus drive and fs overhead and such probably more like 26ish usable. But that is within my desired range. Even only using 80% of that at 21ish TB I think I'll be fine with that. For snapshots and such, I'll probably offload those to my slower HDD NAS that I have 60 TB on. Those aren't something I need fast access to under normal situations.

If I'm understanding correctly just doing mirrored pairs is the fastest performance right? Not Raid z1 or z2 mirrored vdevs? That design would also makes it easier for me to just start with 12 drives right now and keep my 13th as a spare until I can get one more to add another mirrored pair and then 2 more later on to max out the 16 slots on my server. (I'd most likely end up with at least 4 more total and always have one cold spare on hand ready to pop in if a drive fails)

If that is right the only question that remains if I should do NVME cache drives or not, and if so figuring out the appropriate size / mirror configuration for them. How do you determine where the break point is for them adding speed or actually hurting speed?
 
Last edited:
So, I went ahead and configured 12 disk in zfs raid10 and was doing some testing on it using fio

I'm seeing around 2400 - 2500 MB/s using a fio randwrite with a 1m bs. If I use a 4k bs it drops to more like 3 MB/s ... IOPS are ~2200 for 1m and ~785 for 4k. Latency in both seconds is mostly in the microsecond range (94% at 500 microseconds on the 4k test)

This is another place where I'm going to show my lack of storage knowledge lol. The 1m randwrite speeds seem ok, everything else seems low? Especially when SSD drives are supposedly rated in the tens of thousands if not hundreds of thousands IOPs range lol. I know that their rating are under the perfect ideal conditions and probably not practical, but I feel like I should be seeing more than 2200 striping across 6 disks?

Or are these numbers ok and I just have no idea what I'm talking about? lol I'm working on configuring a few test VMs to see how performance feels when multiple VMs are doing things.
 
For snapshots and such, I'll probably offload those to my slower HDD NAS that I have 60 TB on. Those aren't something I need fast access to under normal situations.
Snapshots are integral part of ZFS and will always be stored on the disks you create the snapshot on. So your SSDs if you create a snapshot on that pool. And snapshots will grow over time. If you create a snapshot and keep it for 2 months, nothing written/edited after creating that snapshot will be deletes and still consumes space until you destroy that snapshot.

If I'm understanding correctly just doing mirrored pairs is the fastest performance right? Not Raid z1 or z2 mirrored vdevs?
Yes, for IOPS performance you want as much striped vdevs as possible. So 8x 2-disk mirrors will be double as fast as 4x 4-disk raidz1.

If that is right the only question that remains if I should do NVME cache drives or not, and if so figuring out the appropriate size / mirror configuration for them.
With hundreds of unused GBs of RAM and lots of consumer SSD I would only add one (or two in mirror without UPS/redundent PSUs) small (16 or 32 GB should do the job but bigger one would help with durability) Intel Optane for the SLOG.

know that their rating are under the perfect ideal conditions and probably not practical, but I feel like I should be seeing more than 2200 striping across 6 disks?
Depends on what writes you do. You got consumer SSDs that can't cache sync writes. In case you did a latency test doing 4k random sync writes those numbers would be totally normal. Enterprise SSD with PLP would be 100 times faster doing that.

To see good performance you should try 16k random async writes with multiple jobs in parallel and a high queue depth.
 
Last edited:
Snapshots are integral part of ZFS and will always be stored on the disks you create the snapshot on. So your SSDs if you create a snapshot on that pool. And snapshots will grow over time. If you create a snapshot and keep it for 2 months, nothing written/edited after creating that snapshot will be deletes and still consumes space until you destroy that snapshot.
Hmm, I did do a test snapshot and it let me select the destination directory as long as the directory has vzdump backup file as a content type. Is there something to prevent me from creating such a directory on a remotely mounted path? Even if it did have some other restriction this is linux... can't I just replace a directory somewhere with a symbolic link to a mounted directory and offload them still that way? hehe. I honestly haven't tried anything of these things, I just kind of assumed it was possible.

Yes, for IOPS performance you want as much striped vdevs as possible. So 8x 2-disk mirrors will be double as fast as 4x 4-disk raidz1.


With hundreds of unused GBs of RAM and lots of consumer SSD I would only add one (or two in mirror without UPS/redundent PSUs) small (16 or 32 GB should do the job but bigger one would help with durability) Intel Optane for the SLOG.


Depends on what writes you do. You got consumer SSDs that can't cache sync writes. In case you did a latency test doing 4k random sync writes those numbers would be totally normal. Enterprise SSD with PLP would be 100 times faster doing that.
maybe one day I'll be able to afford enterprise drives hehe. Hopefully I can get by for now on the consumer ones I have. Looks like the cheapest enterprise 3.84 TB is around $410 each right now, I definitely can't dump 5 to 7k into this server right now hehe. And I have no idea what features they are missing to make them that cheap since I see other Intel 3.84TB enterprise drives going for 600ish per drive.

To see good performance you should try 16k random async writes with multiple jobs in parallel and a high queue depth.
So, I'm not sure what a high queue depth is. Looking at fio, I see an iodepth setting, is that the same? If I run this fio command

Code:
fio --ioengine=libaio --rw=randwrite --bs=16k --direct=0 --numjobs=4 --iodepth=4 --runtime=60 --time_based --name async-write --size=16g

I do see a lot higher iops. around 30k which around 475 MB/s write speed. Playing around with the iodepth doesn't seem to make a big difference on the IOPS though, setting it to 4, 8, or 12 all stay about the same. The number of jobs makes a bigger difference but seems to have diminishing returns, 16 jobs hits around 42k IOPS, 32 jobs peaks around 46k, hit 700 MB/s with the 32 jobs. But I stop seeing much improvement after that. Might be starting to CPU bottle neck by that point? I have 28 cores (56 threads counting hyperthreading) so maybe more than one fio thread per core is too much?
 
Hmm, I did do a test snapshot and it let me select the destination directory as long as the directory has vzdump backup file as a content type.
That then was a backup and not a snapshot.

Looking at fio, I see an iodepth setting, is that the same?
Yes. And you probably want it bigger, like 32 and not just 4.

Playing around with the iodepth doesn't seem to make a big difference on the IOPS though, setting it to 4, 8, or 12 all stay about the same.
Then your disks can't handle more.

The number of jobs makes a bigger difference

Might be starting to CPU bottle neck by that point?
Correct. Like I already said, that CPU got a really bad single-threaded performance.
A Ryzen 7950X got roughly 3 times the single-threaded and multi-threaded performance of a Xeon E5-2690 v4 according to geekbench.
 
Alright, thanks. Hopefully in practice this performance is adequate for my workload. I guess I'll find out once I start loading stuff up. I have no idea how to translate these performance numbers over into what my VMs workload would actually require hehe. I don't plan on running any big busy databases or anything like that. There will probably be some databases, but they'd be measured queries per minute not queries per second even at the busiest of times hehe.

I see that you can put bandwidth limits in MB/s or ops/s for disks in VMs. I suppose I'll just have to monitor it as I build out the VMs and if I start having issues with it I'll have to implement limits on VMs that I don't need to be as fast until I can afford to buy higher end hardware.
 
Well shoot, I thought I was settled... and then I decided I needed to test the hardware raid performance for myself since it seems as if I might be getting CPU bottle necked. Running the same fio commands as before with 12 disks in RAID10 on the perc h730p, I can get it up to 2.8 million IOPS for about 10-15 seconds, then it goes down and settles around 85k IOPs. Guessing the initial 10 seconds is some sort of caching (not sure what exactly it's using for caching, will have to look into that and see if I can increase the size of it or not). But still twice the IOPs and it peaks at 5400 MB/s versus the 700 MB/s ZFS RAID10 was getting =/

If I'm leaving that much performance on the table using ZFS I might just have to do the hardware raid after all lol. If I use the hardware raid though I'll have to order a spare one to have on hand in case it ever fails. They are pretty widely available right now, but no idea if they will be later. I'll also have to try to do some more research on what ZFS features I'm giving up by using the hardware raid.
 
hw raid doesn't trim ssd, so your consumer ssd will wear out quickly.
zfs will wear out quickly too.
the lowest wear is ext4/lvm.
personnaly, for homelab, I'll use ext4/lvm and PBS as daily or twice a day backup to have a async RAID..
 
ZFS isn't for performance. It's about features and data integrity. I prefer a way slower storage where I can trust to read the exact same data some I wrote to it many years before than having a fast storage that might return corrupted data without me knowing it.
Beside the bit rot protection things like replication and transparent block level compression are things I don't want to miss. While you could get features like encryption, snaoshots and so on using other filesystems on top of your HW raid, those will also cost additional performance that you didn't include in your HW benchmark but that are already accounted for when using ZFS.
 
Last edited:
  • Like
Reactions: UdoB

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!