Slow Dual ZFS Mirror Write Performance

CalebSnell · May 30, 2023

Update: just tried creating a new ZFS pool on the one where I created the test LVM pool. I also turned off compression as a test. Same performance, I used Crystaldiskmark for both tests with a 16GB test file. Task manager shows the speeds going up to 120-160MB/s as expected for a second, then dropping back down to zero. it does this until the test is done, with varying peak speeds. I've attached a file with the results of "zfs get all" for this node. It should be all default.

Edit,
I think you asked this earlier as well, when I do a disk move from an SSD to the ZFS pool, they happen at around 100-160MB/s, which is more than acceptable, and around what I would see through SMB as well.

Edit again,

I will be bringing up a third node eventually which will be running my and my partners PCs in VMs. It will have one ZFS mirror for now, one for the Proxmox host itself and VM boot drives, and it will be using NVME SSDs. . I know I just complained about using an SSD cache drive, so maybe I'll give that a go once I get the write speed sorted for the spinning drives themselves. A bigger issue is that the IO wait skyrockets to 50-80% during sustained writes on the current ZFS mirror. So we'll see if that issue persists w/ the NVME drives.

CalebSnell · May 30, 2023

Alright, I think I figured out the issue. I manually created a new dataset on an existing zpool and it had 128k volblocksize by default. I mounted that, added it to the Shared PC Windows VM i have, and now I get ~100-120MB/s consistently. Testing the zvol that was created by proxmox automatically w/ an 8K blocksize and I get the same low 20-40MB/s write speeds.

I'll test some more blocksizes later and update this post. Hopefully that's the issue but I didn't say any other differences between the zvol and dataset.

CGC · May 30, 2023

SMR is terrible, but if you had a cheap CMR drive large enough to handle your typical maximum sized data set as cache, you would not have known they were there. Running multi-terabyte spindles with SSD cache size between 32GB - 64GB I forget I'm dealing with rotating drives, feels like all SSD to me.

When you say node here are you meaning that the two ZFS pools you showed me are in two different machines? If that's the case and you can tolerate the risk, making them into an 8GB stripe rather than a 4GB mirror pair should roughly double the throughput.

I have two similar machines - workstation under my desk with SSD, a pair of 1TB, and a single 4TB. Its twin downstairs is Proxmox with an SSD and a single 10TB. Right now I rsync important stuff from the 1TB mirror to the 4TB, then in off hours I let it rsync over the (slow) wire to the 10TB drive. I have a third machine, a $200 miniPC, and with that I have the quorum needed to experiment with Ceph. The rsync solution is fine, I just want to get familiar with Ceph.

I've done the ZFS reading but focused on the storage needs of the apps I support - ArangoDB, Elasticsearch, MaraiaDB/MySQL, RabbitMQ, and I think Milvus will be joining the pack. The two line zfs.conf is all I use - first line makes a suggestion on how much ram to use for ARC, second line tells ZFS it's got an SSD for cache.

Code:

cat /etc/modprobe.d/zfs.conf

options zfs zfs_arc_max=34359738368
options zfs zfs_prefetch_disable=0

ZFS snapshots are slick:

zfs snapshot poolname/dataset@tag-to-describe-backup

zfs rollback poolname/dataset@tag-to-describe-backup

I used that method manually to checkpoint VirtualBox VMs, it's great to avoid reinstalling a machine because one piece of software I'm trying behaves badly. The Proxmox interface provides a slick graphical presentation of the same process.

When I first started using ZFS I set up a mirrored pair with a cheap SSD for cache. I've ejected the cache drive in the midst of operations and ZFS never misses a beat. Replacing a drive in the mirror is NOT automated like it would be with a RAID controller, you have to manually remove the old disk and add the new one, but this is also invisible for the VMs using the storage.

Given what you're doing, there's another nice ZFS features you might want to explore - it can make size limited virtual block devices, like so:

zfs create -V 1G tentb/cephtest

ls -l /dev/zd*

brw-rw---- 1 root disk 230, 0 May 29 22:32 /dev/zd0
brw-rw---- 1 root disk 230, 1 May 29 22:32 /dev/zd0p1

CGC · May 30, 2023

CalebSnell said:
Alright, I think I figured out the issue. I manually created a new dataset on an existing zpool and it had 128k volblocksize by default. I mounted that, added it to the Shared PC Windows VM i have, and now I get ~100-120MB/s consistently. Testing the zvol that was created by proxmox automatically w/ an 8K blocksize and I get the same low 20-40MB/s write speeds.

I'll test some more blocksizes later and update this post. Hopefully that's the issue but I didn't say any other differences between the zvol and dataset.

That is REALLY interesting, I've never seen anything written showing more than matching the physical and logical sector size with ashift=12. I'm going to round up a cheap spindle for use in my Proxmox system and do some experiments with this.

CalebSnell · Jun 1, 2023

ugh, sorry for the misinfo- sync was set to standard so I wasn't seeing the issue as available RAM was really high and the cache never filled for long. I still ended up testing different block sizes which had no effect with sync disabled, consistently got about 60MB/s writes. Reads are great though! 3-4GB/s, so no complaints there, ZFS is doing its job, lol. But the issue is unfortunately not related to the blocksizes at least as far as I could tell.

Also, I just tested writes from two different guests on the same ZFS mirror pool and they averaged out to about the same speed. Transfers stopped and started quite a bit though.

As I mentioned previously, I have a third node I'm bringing online soon so I may just do a fresh install on the other two and pray...

If you end up getting a chance to test a new mirror yourself, definitely let me know the results.

CGC · Jun 11, 2023

CalebSnell said:
ugh, sorry for the misinfo- sync was set to standard so I wasn't seeing the issue as available RAM was really high and the cache never filled for long. I still ended up testing different block sizes which had no effect with sync disabled, consistently got about 60MB/s writes. Reads are great though! 3-4GB/s, so no complaints there, ZFS is doing its job, lol. But the issue is unfortunately not related to the blocksizes at least as far as I could tell.

Also, I just tested writes from two different guests on the same ZFS mirror pool and they averaged out to about the same speed. Transfers stopped and started quite a bit though.

As I mentioned previously, I have a third node I'm bringing online soon so I may just do a fresh install on the other two and pray...

If you end up getting a chance to test a new mirror yourself, definitely let me know the results.

I have a three node cluster getting turned up, two of them are going to contribute disk to Ceph, I'm puzzling over what role ZFS plays. There are some ZFS drives from this cluster's previous duties that have data on them, so it's a bit of a muddle. When the dust settles I'll have a couple spare spindles, may get them shipped to me so I can use them here for testing.

CalebSnell · Jun 21, 2023

Still smacking my head against this but found some more useful information -

It seems that when doing any of these write tests the ZFS array is actually getting read while I'm doing writes. I'm going to review the SMB config but I'm pretty sure this is behavior is happening even when doing non-smb write test, but we'll see.

Otherwise I'm really not sure what would be reading that didn't use to be reading when I was doing these writes. I tried setting logbias=throughput but didn't see a difference.

LnxBil · Jun 21, 2023

CalebSnell said:
Otherwise I'm really not sure what would be reading that didn't use to be reading when I was doing these writes. I tried setting logbias=throughput but didn't see a difference.

Depending on the setup, you may have to read in order to write. This can be the case if you e.g. have a 128K block and only change 4K of it. It has to be read and rewritten to disk.

Dunuin · Jun 21, 2023

I would also highly recommend not to disable sync. With "sync=off" all secure sync writes are handled as unsecure async writes...basically means on a kernel crash or power outage you might lose the whole pool as critical writes, that under no circumstances are toleratable to be lost, will be lost because they are then cached in volatile RAM.

And CrystalDiskMark is pretty bad for benchmarking SSD/HDDs as it can't do sync writes, so you are basically benchmarking your RAM, not your storage. If you want to benchmark your actual physical disk performance, try "fio" with sync writes while disabling ARC caching (zfs set primarycache=metadata).

LnxBil · Jun 21, 2023

Dunuin said:
I would also highly recommend not to disable sync. With "sync=off" all secure sync writes are handled as unsecure async writes...basically means on a kernel crash or power outage you might lose the whole pool as critical writes, that under no circumstances are toleratable to be lost, will be lost because they are then cached in volatile RAM.

Really loose the whole zfs pool? I would think only the last 5 seconds (if not changed), but the consistency is always ensured. You may loose your VM consistency due to those critical writes, but IMHO not the zfs pool per se.

Dunuin · Jun 21, 2023

Not the pool itself, but all pool contents. Yes, you lose "only" the last 5 seconds, but you are really crewed if those 5 seconds contain any data that shouldn't be lost because the guest can't recover otherwise. In my opinion, it is best to let the software developer decide when to use expensive but secure sync writes and when cheap but insecure async writes (so "sync=standard"). If the developer knows what he is doing, he will only use sync writes if it is really necessary. If you set "sync=disabled" you basically say "I know it better than the developer that actually wrote that software! I will lie to the software, so I get some more performance + less wear, but then I'm screwed if something happens, as the software was never meant to be run that way..."

CalebSnell · Jun 22, 2023

Hi Dunuin and LnxBill! Thanks for the info. For what it's worth, sync has been set to always for a little while now. I believe when I had it off it was for testing.

Also, I've done some previous testing to confirm that the testing I do within the OS (Crystaldiskmark or a file transfer from the SSD backed C: drive to the HDD backed D: drive) is accurate.

I just did some testing w/ DD on the SMB server, Ubuntu 18.04. Same speeds are seen across other VMs, and my other node in the same cluster.

I'm sure there are better peramiters for testing writes, so please let me know if you have any suggestions. On the right, is an example of "iostat 1" from the Proxmox node itself, "iostat 5" shows similar results

Values fluctuate all over the place, so writes can be anywhere from 8000kB/s to 40000kB/s for zd0. This is with primarycache=metadata.

Volblocksize is showing 1M for the SMB servers disk and the recordsize for the pool is 128k. Guessing that mismatch might be part of the problem? Should they at least both match to start? The purpose of this server is file back-up and lots of larger files (>5MB, usually over 25MB to 1G), so I'm not concerned about performance for small file transfers

It's also worth noting if I destroy the pool and test the disks independently, I get the expected write speeds, anywhere from 180-220MB/s. Read speeds are also as expected, a constant 240MB/s. Can be any file, not pulling from cache. The issue seems to be with my ZFS configuration somewhere, just not sure where.

Edit, looks like it continues to read for a while after the write tests:

It writes maybe 200kB/s occasionally while doing this but not much.

Thanks again everyone.

Dunuin · Jun 22, 2023

I would try a smaller volblocksize. 1M sounds very inefficient, as you should get massive read and write amplification when doing some small IO like updating metadata. And dd isn't really suited as a benchmark tool, especially when using "/dev/zero". Have a look at fio for proper benchmarking.

CalebSnell · Jul 4, 2023

Dunuin said:
I would try a smaller volblocksize. 1M sounds very inefficient, as you should get massive read and write amplification when doing some small IO like updating metadata. And dd isn't really suited as a benchmark tool, especially when using "/dev/zero". Have a look at fio for proper benchmarking.

Thanks for the suggestions.

I ran FIO in the SMB VM and saw that speeds were at ~120-140MB/s, with frequent (every 2-3s) dips to 30-60MB/s. Writes should be in the neighborhood of 180-200MB/s without these deeps, as we'll see below.

I'm going to test a 4k record/blocksize now, which requires migrating the VM back and forth between nodes so the volblocksize updates. These transfers work flawlessly, reads are split over the two drives in the mirror on the source node (~120MB/s per) and writes on the receiving node are ~220-240MB/s per drive. So ZFS itself is fine, it's just the VMs or maybe the volume that is misconfigured in some way since writes from within them happen at half speed and with some level of disruption.

Will update once the VM is on 4k blocksize. If anyone knows of a quicker way to update the blocksize on a zfs volume without migrating the VM or volume to another drive, let me know.

Thanks!

Dunuin · Jul 4, 2023

CalebSnell said:
Will update once the VM is on 4k blocksize. If anyone knows of a quicker way to update the blocksize on a zfs volume without migrating the VM or volume to another drive, let me know.

There is no other way. Volblockize can only be set at the creation of a zvol and not be changed later.

I would also test something like 16K and 64K.

SlimTom · Oct 12, 2023

...just wonder - any progress with this? As I have exactly the same problems transferring files. Especially write drops even to 0 sometimes. Did you find a solution ?

Edit: does this answers any of the problems? https://forum.proxmox.com/threads/extremely-low-zfs-performance.97480/

CalebSnell · Oct 12, 2023

Hey SlimTom,

Unfortunately, no change. I even added two more drives to mirror (so i think it's two mirrors, striped), and I see OK performance but it falls back to 50ish MB/s and occasionally 0.

Peak is 400-500MB/s, so about what you'd expect for two of these drives if they were RAID0'd, maybe a bit less. But its not consistent as you can see. Sync is set to always, verified RAM doesn't change during writes. I've also tested a bunch of different block sizes, 512, 4k, 8k, 16k, 256k, 512k, 1k, 4k, and all of them would have this issue eventually.

And, the above screenshot is a file transfer over SMB, but I see the same behavior w/ local file transfers (C: drive SSD backed to D: drive ZFS backed) and using Crystaldiskmark.

The only thing I can think of is maybe the SATA controller is bad? Not defective but just bad quality. I've heard that the built-in SATA controllers can be not the best, but with my motherboard being one of those ASUS CSM motherboards w/ ECC support, I figured I'd at least get OK speeds. What motherboard do you have if you don't mind sharing? Or do you use a SATA card for your drives?

CalebSnell · Oct 13, 2023

OK, looked into this a bit and tried disabling write caching in windows. When I did that before, it would cause the transfer to drop to 0 way more often, and SMB transfers dropped. So ATTO is showing 250MB/s at the higher IO sizes:

However, this should be closing to 400-500MB/s. iostat 1 shows just one drive being written to during these operations
I was being dumb and was testing to the SSD in the above screenshot. Speeds are 20MB/s w/ write caching disabled no matter what. Crystal diskmark, ATTO, SMB, local file transfer. Drives get written to appropriately.

SMB transfers are also back to ~20MB/s, same for local (C: -> D

transfers. So that's why I had write caching turned on. This same behavior happens in Linux VMs, and on another Proxmox server. This one I'm on now is a fresh build when i was testing this last time.

SMb transfers resource monitor and iostat summary. Looks the same no matter how we transfer.

ATTO stransfer and iostat summary, solid 200MB/s, but writing the benchmark file int he firsplace happens at 20MB/s

SlimTom · Oct 13, 2023

I wouldn't say anything is wrong with your hardware as I have exactly the same problems. New hardware. Bit different than yours. But exactly the same observations on data transfer, Regardles if local or over SMB.

Did you try anything what Duniuin suggests in a link I sent ?
"

Oct 7, 2021 / #9

First you would need to increase your pools blocksize (volblocksize) from the default 8K to 16K or otherwise you will loose 50% of your raw capacity and not just 33% because you would get additional 17% padding overhead. And volblocksize can't be changed after creation of your virtual disks, so you would need to destroy and recreate all virtual disks so the new blocksize will be used. You can do that by editing the blocksize (Datacenter -> Storage -> YourZFSPool -> Edit -> Blocksize) and then backing up all the VMs and restoring them by replacing the VMs.
And I would disable atime and maybe enable relatime for your pool for better performance so not every single read operation will cause an additional write."

Nuke Bloodaxe · Oct 13, 2023

I think this would be a good use case for an enterprise SSD acting as an SLOG, as it would reduce the slower write amplification on your mirror a reasonable amount; it'd certainly reduce the iops required per operation. It wouldn't need to be big or overwhelmingly expensive, it would just need to be enough to buffer the incoming data to allow for slow writes to the main mirror. (I use an Enterprise 256GB Samsung unit for this, it helped a lot on my end; which was in conjunction with 2x8TB Seagate EXOS units.)
Similar situation:
https://forum.proxmox.com/threads/need-help-for-setting-up-slog-and-zil.107581/

Slow Dual ZFS Mirror Write Performance

Well-Known Member

Attachments

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

New Member

Well-Known Member

Well-Known Member

Attachments

New Member

Active Member

We value your privacy