Why Does ZFS Hate my Server

so you think those 12Gb SAS drives doing 200 MB/s sequencial writes on zfs sounds about right?

Glancing at this thread, that does not sound right on a (striped) mirror with high enough bs, however anyone wants to justify it. I did not see you do a test at QD 1, out of curiosity?

RAIDZs are hit and miss, really. Also, it's not your fault ...

it let me select raid10 when creating it :)

The PVE installer somehow uses RAID 0 / 1 / etc nomenclature, which is completely wrong with ZFS. I think it's obvious what one means, but at RAIDZs and DRAIDs it gets confusing not to call them as per ZFS docs.
 
these drives are connected directly to the system board; the raid card does not see them. however, i see in the BIOS the sata write cache is disabled. my understanding of zfs is it SHOULD be disabled, because zfs does its own caching, right?

The sata write cache depends... Its a very hard topic, it can be completely pointless nowadays or neccessary.

HDDS:
Zfs does own queues with the zfs io scheduler. So yes zfs will usually disable any sort of caching anyway.

NVME's:
Things changes with modern zfs versions for nvme's, because nvme's have own queues defined in the protocol itself, so (very new versions of zfs) will not queue anything and push stupidly to the drives straight without any queueing. And nvmes handles the cache itself in a completely different manner as HDD's.
NVME's dont have usually any sort of caching (Yes there was like 1-2GB DDR-Memory Cache in the beginnings), but nowadays the flash itself is used to act like caching.
However nvmes does the caching or not, zfs will leave it to the nvmes itself and just stupidly push without queuing. (Im just not sure if its already implemented or not, but will definitively be this way soon, or is already)

Sata SSD's:
Not sure, i never was interested in sata drives, sorry... So i dont know anything.
Sata drives are for me just a step in the middle to the way to proper SSD'S with a proper Protocol (nvme's).

Scsi/SAS: Is almost dead either, companies try to adopt it for SSD's, but its still basically an enhanced sata protocol, primarily meant for HDD's, and with HDD's the cache gets disabled anyway usually.
The Problem with the sas future is, that there are already nvme drives out that scan be driven by multiple cpu's or servers. And thats where the future is heading.

I talk explicitely about SSD's, for HDD's SATA/SAS is amazing.
Cheers.
 
Last edited:
  • Like
Reactions: jwsl224
Scsi/SAS: Is almost dead either, companies try to adopt it for SSD's, but its still basically an enhanced sata protocol, primarily meant for HDD's, and with HDD's the cache gets disabled anyway usually.

Except when it comes to large pools of really large capacities, SAS allows for dual actuator ones that would be saturating SATA, e.g. Exos 2X18.
 
  • Like
Reactions: Ramalama
so you think those 12Gb SAS drives doing 200 MB/s sequencial writes on zfs sounds about right?
Yes. I don't think you have a hardware problem. Use a UPS and enable per drive caching. Are you running some serious High Availability level production workload on this box?
 
430MB/s per your benchmark (didn’t you say they were 2.5” 10k RPM) sounds about right for 2 spinning mirror, ZFS seems to perform slightly better than I expected from my history with HW RAID on those drives (it has been over 15 years since we used that kind of spinning disk in production). With compression enabled you likely will see even better stats depending on the compressibility of your workload.

If you test the other ones and you see a sudden drop, then check which of those drives is bad. But yes, I don’t see anything wrong with either your SSD or HDD other than the Patriot are not to be trusted and the HDD are ancient tech. There is a reason people purchased Intel X25E and the 160GB MLC by the boatload when Intel came out with them, most DB/VM systems didn’t need the extra capacity, they needed the IOPS and you could replace a rack of 15k RPM HDD with just 2U of SSD.

I’m fairly sure this is a home setup, but if you are looking at production, get 4 large SAS SSD and avoid the headache completely.
 
Last edited:
  • Like
Reactions: IsThisThingOn
first of all, i thank everybody that stopped by and took time to post on this thread on this issue. i'm amazed how many people, who obviously already have a busy schedule, still took time to write out detailed responses. i know i couldn't respond to each post individually, because there was just too much testing and replying going on. i do appreciate it though. and also apalrd, who bothered with this a bunch too.

second, it seems the general consensus here is these drives are doing about what can be expected, and that zfs is just doing more in the background than i realize. so let's take for granted that this is the case.

the final issue i have here on this server is how VM's perform on this server. specificially, windows VM's. it is much better and much shorter to explain in a video. so if i could get you to spend 2.5 minutes (i sped it up already for you) to look at how windows is all but freezing on this server's ssd's. i'm sure there is some kind of windows setting or ZFS caching setting that could be made to fix this: https://youtu.be/vJve0yA74tU
 
Last edited:
the Patriot are not to be trusted
I would even go one step further. Patriot consumer SSDs are known for having a firmware bug with the Phison controller so they lie about sync.
Using them for anything other than L2ARC is dangerous.
i know i couldn't respond to each post individually, because there was just too much testing and replying going on.
Which made the process unnecessary complicated. We started to debate claims and benchmarks that you made, before establishing some ground truth.
We still don't know:

- What you actual use case is
- What exact drives you own
- ashift

and many other settings.

the final issue i have here on this server is how VM's perform on this server.
We can continue to gish gallop from topic to topic and give you crystal-ball advice.
I personally think this does not lead to a good outcome.

In my opinion, the problem with bad performance (like your Windows VMs) is not the performance itself. Bad performance in ZFS and KVM is more of a symptom of something being wrong. That can work for some time but sooner or later will come back to haunt you.
 
Last edited:
In my opinion, the problem with bad performance (like your Windows VMs) is not the performance itself. Bad performance in ZFS and KVM is more of a symptom of something being wrong. That can work for some time but sooner or later will come back to haunt you.

The thread started with some absurdly low benchmarks and then got to the point of accepting that 12x 10k SAS drives are old, somehow not meant for VMs, etc, etc. But years ago everyone was running just that and their VMs were definitely not crawling. I found it strange the OP just gave up too.
 
This is like trying to fly a commercial airplane by just testing all the nobs instead of learning how to fly that thing ;)
this is about accurate :D
and many other settings.
here are the other things:

boot drives: Samsung 960 Pro SATA SSD's
storage drives: Toshiba PH0HNX0WTB20021G1CDUA00 12k SAS drives, 7200 RPM
box: Dell r730xd
controller: H730 mini in HBA mode or H330 mini mono, tried both
CPUs: 2x Xeon E5-2678 v3
RAM: 265GB DDR4
NIC: Intel X550-T2

the use case of this it will be a secondary file server (students at school, so slightly power is fine here), proxmox backup server, and other lower-power workloads like running veeam and asterisk

Bad performance in ZFS and KVM is more of a symptom of something being wrong
if there's anything else you can think of...

The thread started with some absurdly low benchmarks and then got to the point of accepting that 12x 10k SAS drives are old, somehow not meant for VMs, etc, etc. But years ago everyone was running just that and their VMs were definitely not crawling. I found it strange the OP just gave up too.
it WAS the overwhelming consensus in the chat that the benchmarks look about right. but by all means. if there's anything else you can think of..
 
Toshiba PH0HNX0WTB20021G1CDUA00
I could not find any datasheet for that drive. Just make sure you use the correct ashift.

controller: H730 mini in HBA mode or H330 mini mono, tried both
I would be to scared to use them in production, see: https://www.truenas.com/community/threads/dell-h730-mini-im-confused.89652/

CPUs: 2x Xeon E5-2678 v3
RAM: 265GB DDR4
Dual socket system and numa staff is something I would have to look into, to make sure there are no catches.

the use case of this it will be a secondary file server (students at school, so slightly power is fine here), proxmox backup server, and other lower-power workloads like running veeam and asterisk
Proxmox is a great Hypervisor and a not so great NAS. But that is just my personal opinion.

What really makes a difference, is that blockstorage behaves differently than normal storage.

Storage for VMs is blockstorage. Blockstorage or zvol has a volblocksize, and that is a fixed size. That can have huge performance and storage efficiency implications!

Storage for file servers are called datasets that have a record size. The record size is a max value! You can set it to 1mb or leave it at 128k, it will work fine.

If you are interested in why that is important, you can read this:
https://github.com/jameskimmel/ZFS/blob/main/The problem with RAIDZ.md
but to give you a short 5min summary:

Use a 3 way mirror SSD Proxmox system for VMs. If you really need sync write performance, add a SLOG. But even without a SLOG, this will leave your 12k SAS drives in the dust! It is not even close! And you don't have to worry about fragmentation, rw amplification why a 10GB disk takes up 20GB on your pool.

Use a RAIDZ2 TrueNAS system with HDDs for data storage and a small SSD as boot disk. If you want to speed up metadata performance add a special vdev in mirror.

That way you get the best of both worlds and mostly working ZFS out of the box.
 
Last edited:
I would be to scared to use them in production, see:
sorry. I meant an HBA330 mini. that should be full HBA mode from the factory, even if not the fastest thing in the world.


Blockstorage or zvol has a volblocksize, and that is a fixed size
this is the ashift value that is selected at install, correct?


Storage for file servers are called datasets that have a record size.
so my knowledge on the mechanics here is a bit limited. that's why I'm here I guess :)

so the file servers are windows server VM's. those would be sitting on the host in block storage? if so, where does the record size apply here? fyi, the default NTFS sector size is 4k. will increasing that waste storage?


Use a 3 way mirror SSD Proxmox system for VMs
how would that be different from the ssd mirror I have now in regards to performance?
 
this is the ashift value that is selected at install, correct?
No. Ashift should be the sector size of your disks. So most likely 12 for 4k drives or if you have old or special drives, 9 for 512b.
Since I don't know the drives you use, I don't know the what ashift you should use.

volblocksize and records size are something different.
so the file servers are windows server VM's. those would be sitting on the host in block storage?
Yes. But I would not do that to begin with and rather integrate TrueNAS into the AD and create SMB shares from there, so I can have datasets instead and don't have the blockstorage disadvantages.
fyi, the default NTFS sector size is 4k. will increasing that waste storage?
sector size of ntfs is another topic again.
how would that be different from the ssd mirror I have now in regards to performance?
I don't know what SSD mirror you have right now.
3way will have no write benefits but read benefits compared to a 2way mirror.
But the main reason for 3 way is that you can loose more than one disk, just like with RAIDZ2 over RAIDZ1.
Use at least half decent SSDs that don't lie about sync, don't use QLC and have a decent TBW.
 
Last edited:
  • Like
Reactions: esi_y
So most likely 12 for 4k drives or if you have old or special drives, 9 for 512b.
is there a situation where using ashift 12 on a 512b sector size drives leads to a performance degradation? they are perfectly divisible, so I can't imagine it's too bad
. But I would not do that to begin with and rather integrate TrueNAS into the AD and create SMB shares from there,
that will become possible once we get more boxes. the reason we're trying to hyperconverge is to maximize what we get out of each box


Use at least half decent SSDs that don't lie about sync, don't use QLC and have a decent TBW.
check check and pretty good check on the 860 pro from samsung
 
About the ashift thing, unless you have migrated the pool or you have intentionally set things wrong or you had your disks passed through a RAID controller during setup, it should be correct because ZFS figures it out. You can still extract the information, but I suspect that is not the problem. Note that a lot of the criticism of ZFS's performance ALSO occurs on regular professional RAID setups, if there are any performance benefits with certain RAID implementations it is because they are doing certain things "faster" that may cause data loss (eg. the RAID write hole), but for the last decades, hardware RAID has generally been slower than software-based solutions (such as LVM and ZFS) - the Sun Thumper kind of proved that Hardware RAID was dead.

The problem with mismatching 512b vs 4k is that if you use the wrong one, eg. if you format the disk as 512b on the VM side (the zvol) and your pool is set to 4k, then for every write that needs acknowledged (since you have earlier indicated that you have cache=none, so you do direct writes to disk by default and ignore all cache), your disk must read the 4k, modify the 512b part of it and write the 4k vs just writing the 4k, that quarters your IOPS/throughput easily. That is the "simplest" explanation for how these things work.

If you have RAIDZx for every write, all disks must acknowledge they are done so you will only go as fast as your slowest/busiest disk. Hence why the recommendation, especially with spindles is to use 2-way or 3-way mirrors (that was the case with hardware RAID as well, nobody used a VM cluster with 12-wide RAID5). The other issue is WHEN your disk fails, now all disks must help in rebuilding the array, 12 disks must read to rebuild 1 disk which is slow, so slow there is sufficient time that another disk could fail which wipes out all your data. Hence for 6-8 wide RAID, general recommendations is to use RAID6 (or RAIDZ2 on ZFS), for 10-12 wide you use RAIDZ3 (3 can fail, although RAID"7" is a rare beast) and the latter is generally only used for slow long-term archives like backup.

Hence why you see slow performance on your Windows, you are going to a single spinning disk, for those of us who are old enough, booting of a single disk (which is effectively what RAIDZx gives you) is slow onto itself (in comparison). Yes, we found 5m+ boot times acceptable, today, when all my Linux server simultaneously can boot in under 10s, even 30s is long. Current iterations of Windows 10 as well as 11 I don't think are even built to expect spinning disks so there will be a lot of thrashing the disk while it is collecting your data, I'm sure Windows 11 is also not supported on your CPU, so 'faking' the CPU means also emulating certain instructions, which will be slow.

So what performance metrics do you get in Windows is the question? What are your VM settings? What are you expecting?
 
Last edited:
The problem with mismatching 512b vs 4k is that if you use the wrong one, eg. if you format the disk as 512b on the VM side (the zvol) and your pool is set to 4k, then for every write that needs acknowledged (since you have earlier indicated that you have cache=none, so you do direct writes to disk by default and ignore all cache), your disk must read the 4k, modify the 512b part of it and write the 4k vs just writing the 4k, that quarters your IOPS/throughput easily. That is the "simplest" explanation for how these things work.
so setting an ashift 12 on a 512b disk should not lead to a performance penalty, because zfs will combine 4 into 1 and write over them? I'm quite sure the HD's are 512b, but I've tested with ashift 12 and 9 and didn't see a meaningful difference
 
Depends… it can’t combine 4 writes into 1 mechanically speaking, you’re just wasting some resources, as the write now has to be split at some level into 4, potentially introducing some overhead and latency which as you noticed isn’t really relevant to spinning disks, for NVMe SSD where you are capable of maxing out the CPU writing to disks, that may be an issue. There are hybrid disks where these things are handled at the disk level rather than the CPU, but for “modern” disks 4K is standard. Again, ZFS will look at the disk and set it to whatever the disk reports its block size to be, I have never had to set it manually although as I said, HW RAID controllers passing through disks can misreport block sizes.
 
is there a situation where using ashift 12 on a 512b sector size drives leads to a performance degradation?
Yes, the situation is called always ;)
that will become possible once we get more boxes. the reason we're trying to hyperconverge is to maximize what we get out of each box
Sure. Just know that it comes with lots of downsides.
In my opinion, using a Windows Server VM, on top of Proxmox, with a RAW virtual disk, on top of ZFS, is like putting makeup on a pig.

But on the other hand, I don't know how much performance you expect, how you handle fragmentation, if you use zfs send and receive for backups, if you plan on filling your pool beyond 50% and many other things. Maybe you don't really care about performance and you are fine with that "suboptimal" setup. Nothing wrong with a suboptimal setup!
check check and pretty good check on the 860 pro from samsung
Perfectly fine, just slow compared to even a cheap WD Red SN700.
I also like to spread the risk by not using drives from the same batch / manufacturer / controller
 
Yes, the situation is called always ;)
is today always? or do these disks lie about their sector sizes for compatability reasons?

1726060121665.png

In my opinion, using a Windows Server VM, on top of Proxmox, with a RAW virtual disk, on top of ZFS, is like putting makeup on a pig.
ok ok. lol. let's say virtualizing storage has it's upsides. where else might one put windows server vms'? surely not on hardraid

Nothing wrong with a suboptimal setup!
this will be primarily second stage backups and lighter workloads. taking a bit of a performance hit won't be the worst thing.
 
is today always? or do these disks lie about their sector sizes for compatability reasons?
To be honest, I don't get that model. In some datasheets they say 512.
And in marketing materials they say something that contradicts itself in my opinion.
4Kn, 512n Native Sector Technology, or 512e Advanced Format Sector Technology
Maybe this is firmware changeable? I really have no idea.
ok ok. lol. let's say virtualizing storage has it's upsides. where else might one put windows server vms'?
I am not against blockstorage for VMs.
I think files from a file server are not optimal on blockstorage.
Because of the fixed volblocksize.
And because mirror is a little bit wasteful for a slow file server but RAIDZ on the other hand can have huge implications on volblocksize. Which isn't true for datasets, unless you have many small files.
 
If you don't care or don't wanna bother with volblocksize, recodersize, fragmentation, padding or rw amplification, that is totally fine.
Just be warned that it CAN lead to poor performance and storage efficiency.

You can avoid said problems by sticking to two simple rules. And that is why these two points get recommended a lot here:
- Use mirror, not RAIDZ for VMs
- don't use blockstorage for data files (movies, music, PDF, docx....)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!