Slow Dual ZFS Mirror Write Performance

LnxBil · Oct 13, 2023

Keep in mind that a SLOG device only helps with sync writss. Copying files it normally asynchronous, therefore you the copy job is always lightning fast in the beginning unless the cache on the destination is filled and writing to disk is started. Then the throughput drops significantly and often almost to a halt. Just monitor with arcstat what your ZFS is doing in order to understand the write pattern better.

I can recommend using Intel Optane NVMe for a SLOG device, which is MUCH MUCH faster than SATA/SAS SSDs. The latency is much better and therefore the ZFS pool feels much faster on sync writes. For most workloads, 16 or 32 GB is enough. SLOG is normally just about 5 sec of data.

CalebSnell · Oct 13, 2023

SlimTom, thank you for your response! Yes, I've previously tried varying block sizes, including 16K. atime has been disabled. I'll try enabling relatime as well and let you know how that goes.

Nuke Bloodaxe, thank you as well. Yes, I've considered adding an enterprise SSD as a SLOG but for now, it's not in the budget. My problem is that I used to be able to do file level backups over SMB without running into this stop/starting and consistent 20/40MB/s write speed issues. Interestingly, I see similar issues on a Dell R710 w/ 8 SAD drives in a RAIDZ2. Write speeds low, read speeds perfect. So I'm not sure that the issue we're seeing is related to the mirror itself, though i'm sure that wasn't your main point, lol.

I'll see if I can test a SLOG with a spare consumer SSD i have now, and if I see improved performance (as in, iostat is actually showing acceptable writes to the drives, i.e not 20MB/s split between the two striped mirror) I'll plan to get an enterprise SSD and call the issue fixed.

Dunuin · Oct 13, 2023

CalebSnell said:
atime has been disabled. I'll try enabling relatime as well and let you know how that goes.

You don't need to try this, with atime already disabled, unless you really need the atime. Not writing a atime at all does less overhead than updating the atime once per day (what enabled atime + enabled relatime would do).

CalebSnell said:
I'll see if I can test a SLOG with a spare consumer SSD i have now, and if I see improved performance (as in, iostat is actually showing acceptable writes to the drives, i.e not 20MB/s split between the two striped mirror) I'll plan to get an enterprise SSD and call the issue fixed.

Doesn't make much sense. A consumer SSD isn't much faster than a HDD when doing sync writes. And sync writes is all a SLOG is doing. So when using a consumer SSD this won't be much better than without a SLOG.

CalebSnell · Oct 13, 2023

Dunuin said:
You don't need to try this, with atime already disabled, unless you really need the atime. Not writing a atime at all does less overhead than updating the atime once per day (what enabled atime + enabled relatime would do).

Doesn't make much sense. A consumer SSD isn't much faster than a HDD when doing sync writes. And sync writes is all a SLOG is doing. So when using a consumer SSD this won't be much better than without a SLOG.

Fair points! And honestly I have a hard time believing it's relating to not having a SLOG. Even if I set sync=never, the writes slowly filter out of memory at the same speeds, around 20MB/s, same as doing a copy on a guest or running ATTO or FIO. I have 64GB of DDR4 memory on that node as well, and these speeds are seen both if the memory fills up and the transfer pauses, or when the transfer finishes. Also, these drives test, with the same types of data, at 200+MB/s outside of the mirror. I have 4 of them now so I could raidz1 them or something to see if it's an issue with mirrors, but I have the same exact issue on another box w/ 8 10k rpm SAS drives in raidz2, which also test to test similar speeds outside of the ZFS pool, 150-200MB/s.

Interesting, when testing w/ ATTO in Windows, iostat shows that zd0 is writing at 200+MB/s mostly as we'd expect (it's a striped mirror so should be higher but whatever its good), but there is no file being written to the drive. Nothing shows up, just seems like data is being streamed to it? Then of course a crystal disk mark or file transfer test shows the low 20MB/s speeds on zd0 and 10ish MB/s on the drives.

I know there's lots of confounding variables here but the speeds I'm seeing on all the Proxmox boxes w/ ZFS mirror a raidz2 are so low it just doesn't feel like a performance tuning problem.

Outside of getting an enterprise SSD, is there anything else I can test here? I'll put it on the list but just hoping to be able to try something else in the meantime haha. If I just need to start from scratch and follow specific setup path, I can do that, I can move the VMs to another node and rebuild the one we're testing on. I just need to know what to change since I've had this issue on several installs.

Appreciate all the help everyone.

CalebSnell · Oct 13, 2023

Migrating a VM to the host w/ blocksize 16k block size and w/ sync=standard instead of always, it seems to level out at around 150MB/s to 250MB/s, about the speed of one drive, a bit more.

Will see if SMB is fixed on the VM when it finishes migrating. So ZFS must be configured fine if migrations are working for the most part? ~~Again, with it being a striped mirror you'd expect higher, but onboard sata and who knows what else, I'm not going to complain lol.~~ Being dumb again, the sender is only a single drive mirror so these speeds are exactly what i'd expect. Now to see how the host behaves.

CalebSnell · Oct 14, 2023

Same behavior post migration using 16k recordsize and blocksize

Average 20-30MB/s, some peaks to 50-70MB/s. This is with sync set to always. With sync set to standard, I get great xfer speeds for 25% of the transfer, then it stops, then good again for a bit, then stops, then settles out back to around 70MB/s before it finishes. iodelay and writes continue to occur on the pool well after the transfer stops.

idk, weird stuff.

Dunuin · Oct 14, 2023

CalebSnell said:
Average 20-30MB/s, some peaks to 50-70MB/s. This is with sync set to always. With sync set to standard, I get great xfer speeds for 25% of the transfer, then it stops, then good again for a bit, then stops, then settles out back to around 70MB/s before it finishes. iodelay and writes continue to occur on the pool well after the transfer stops.

idk, weird stuff.

Just sounds like normal caching. "sync=always" is always slow as it won't cache. "sync=standard" will cache async writes so it's an up and down as the cache is filling and flushing.

SlimTom · Nov 17, 2023

CalebSnell said:
Same behavior post migration using 16k recordsize and blocksize
Average 20-30MB/s, some peaks to 50-70MB/s. This is with sync set to always. With sync set to standard, I get great xfer speeds for 25% of the transfer, then it stops, then good again for a bit, then stops, then settles out back to around 70MB/s before it finishes. iodelay and writes continue to occur on the pool well after the transfer stops.

idk, weird stuff.

Thank you for all your testing. You saved me a lot of time. But I still have same/similar problems as yours. I just bough two Intel Optane 16GB drives (not expensive at all!) and will try to add them as ZIL/SLOG. ( But somehow I have a feelling that it wont help

)

CalebSnell · Nov 19, 2023

SlimTom said:
Thank you for all your testing. You saved me a lot of time. But I still have same/similar problems as yours. I just bough two Intel Optane 16GB drives (not expensive at all!) and will try to add them as ZIL/SLOG. ( But somehow I have a feelling that it wont help )

Sorry, still no resolution. I've tested with NVME drives lately and saw the same exact slow write speeds as any other spinning drive. Peaks were obviously higher, but the drives are capable of sustained writes in excess of 300MB/s especially since I'm only testing with 10G files. And if I were hitting RAM, one would expect higher writes than 30-70MB/s. RAM also doesn't increase on the host or guest...

Anyway, I ended up switching the NVME drives to LVM and redirected my folders (pics, downloads, etc) on my Windows guests to the SMB server that is still on ZFS, and I'll just deal with the poor write speeds. NVME drives now test to sustained 300MB/s-3GB/s depending on the workload, as expected. I did have to enable write caching on Windows, but I've confirmed that the memory on the host nor the guest increase when doing writes, and iostat shows the NVME drives being written to, so something weird is going on with the write caching function in Windows.

I'll see if I can test writes to a ZFS backed drive in a Windows guest with write caching enabled again. I recently switched to 64k blocksize so maybe this will be the confluence of conditions needed to make things work. I'll let you know.

CalebSnell · Nov 19, 2023

SlimTom said:
Thank you for all your testing. You saved me a lot of time. But I still have same/similar problems as yours. I just bough two Intel Optane 16GB drives (not expensive at all!) and will try to add them as ZIL/SLOG. ( But somehow I have a feelling that it wont help )

Quick updater. Disabling sync for the pool seems to fix the issue. I tested a 10GB file and it wrote at 300MB/s the whole time, and I can see that the drive is being written to at 160-200MB/s in iostat. Writes in iostat continue to for 15s-30s after the write in file explorer stops, so as expected there is definite risk for data loss here.

And unfortunately sync=standard still leads to sawtooth writes but with the same higher speeds, even though there is plenty of RAM available on the host and barely any is being used on the guest. Writes also continue in iostat for 15s-30s after the write in file explorer stops, so still seems like a risk of losing data.

Let me know how adding a ZIL/SLOG goes, and how it performs with sync=always.

Dunuin · Nov 19, 2023

A SLOG is not meant to boost general write performance. It's there to soften the big performance hit of sync writes. Not a good idea to force all fast async writes to be handled as slow sync writes just so that the SLOG will have to cache something.

SlimTom · Nov 19, 2023

Dunuin said:
Just sounds like normal caching. "sync=always" is always slow as it won't cache. "sync=standard" will cache async writes so it's an up and down as the cache is filling and flushing.

Do you have any suggestions to improve performance on our similar hardware, without the need to purchase SAS enterprise drives? CalebSnell and I have tried various recommendations with limited success. It's hard to believe we're the only ones facing this issue. Are there any monitoring tools to identify the root cause and aid in solving the problem?

LnxBil · Nov 20, 2023

SlimTom said:
Do you have any suggestions to improve performance on our similar hardware, without the need to purchase SAS enterprise drives? CalebSnell and I have tried various recommendations with limited success. It's hard to believe we're the only ones facing this issue. Are there any monitoring tools to identify the root cause and aid in solving the problem?

You already got your answer to your crosspost.

SlimTom · Nov 21, 2023

I was experimenting with different cache options and I got this error: WARN: iothread is only valid with virtio disk or virtio-scsi-single controller

So I've checked difference between VIRTIO SCSI (that I had) and VIRTIO SCSI single on this post: VIRTIO SCSI vs VIRTIO SCSI single , changed my controller to VIRTIO SCSI single, and my writes to server are more stable now. But still not completely happy... Wil implement Intel Optane as SLOG, and later enterprise grade SSDs and will report back...

CalebSnell · Nov 23, 2023

SlimTom said:
I was experimenting with different cache options and I got this error: WARN: iothread is only valid with virtio disk or virtio-scsi-single controller

So I've checked difference between VIRTIO SCSI (that I had) and VIRTIO SCSI single on this post: VIRTIO SCSI vs VIRTIO SCSI single , changed my controller to VIRTIO SCSI single, and my writes to server are more stable now. But still not completely happy... Wil implement Intel Optane as SLOG, and later enterprise grade SSDs and will report back...

Hey! I just purchased and recieved a used Intel 400GB S3710 SSD and set that up a SLOG device. I see consistent 130MB/s write speeds to my ZFS pool w/ the 2 4TB drives in a mirror. I believe the optane drives are significantly more performant than the Intel SSDs, so definitely let me know what you get with Optane. FWIW, I use the command pveperf [/poolnamehere] to get the "fsync" numbers. After adding a SLOG I get 3k fsync/second, and without it I was getting around 100 fsync/second

130MB/s is still slower than I would expect, but it's constant and that's with a 128k blocksize so not ideal for an SSD anyway, though I'm unsure if blocksize matters for a SLOG... Anyway, I'll follow-up if I find any issues or make any further improvements, but this is definitely better than 30-50MB/s!!

Quick update: sync=standard allows for much higher initial writes, at the expense of a 5s or so period in the middle where it drops to 0MB/s and then it goes back to ~130MBs

SlimTom · Nov 26, 2023

..would you be so kind and summarize all ZFS settings that proved working / are significant? (sync, relatime, atime, cache, sector size, ashift, etc etc...).

I didn't have a chance to install Intel Optane as this is production machine... and maybe I'll reinstall alltogether...

Update: Installed Optane as SLOG, enterprise grade SSDs, L2arc cache on Optane... As was already said - it didn't help much, as mainly I have async read/writes.... I also tested with LVM partition and there speed is as it should be. Only ZFS I cannot figure out not to have transfer blackouts... It is obvious there are some cache flush issues. but... how to tune it...

SlimTom · Dec 13, 2023

Useful info:
ZFS tuning cheat sheet (https://jrs-s.net/2018/08/17/zfs-tuning-cheat-sheet/)

and additional reading:
https://www.reddit.com/r/zfs/comments/xtbadx/what_zfs_ashift_size_do_you_recommend_for_a/

So setting recordsize, ashift (and eventually volblocksize) matters most. Allign those values to VM filesystems / databases - reduce amplification. SLOG drive helps ONLY with SYNC writes (databases mainly), otherwise no real benefit. L2ARC: having more RAM helps much more than a L2ARC drive. If using HDDs you can L2ARCs secondarycache=metadata to speed up searches. Special device helps only in case of HDDs are use in ZFS.
Usually ashift=12 (4k) recordsize=4k / 16k, volblocksize=16k, atime off, compression LZ4

One last note: When using NMVe, run nvme id-ctl /dev/ncmeXnY . Look at the drive's MDTS stat. The maximum amount of data it can transfer in a single DMA operation is <LBA-block-size> * 2^(MDFS), which is probably a good candidate for zfs record size

Let me summarize commands that helped me know about settings:

Find physical sector/block size: fdisk -l Disk cache: hdparm -W /dev/sdX

zfs get recordsize volumename
zfs get atime volumename
zfs get relatime volumename
zfs get sync volumename
zfs get volblocksize volumename

zfs get dedup volumename
zfs get compression volumename

Resize local to host ISOs: (you can even delete LVM /-thin if you have other disks/volumes to host VMs)
lvremove /dev/pve/data ("data" didn't exist on my system)
lvresize -l +100%FREE /dev/pve/root
resize2fs /dev/mapper/pve-root

SlimTom · Feb 9, 2024

Usefull to understand about this topic:
https://youtu.be/5wUK-AxOfsg?si=adq5DYPcKa9iHIFC
The Z File Systems (ZFS)

batot · Aug 5, 2024

CalebSnell I have identical this same problem. 600MB/s read but 30MB/s write RAIDZ 4x4TB.
More information there https://zfsonlinux.topicbox.com/gro...as2008-and-2000-write-degradation-performance

Can you told me how are you resolve problem?

J0rn · Dec 17, 2024

batot said:
CalebSnell I have identical this same problem. 600MB/s read but 30MB/s write RAIDZ 4x4TB.
More information there https://zfsonlinux.topicbox.com/gro...as2008-and-2000-write-degradation-performance

Can you told me how are you resolve problem?

As mentioned all over the place here:

zfs set sync=disabled POOLNAME

This is dangerous, but fixed my issue. I disable it for high write periods and then enable it afterwards for safety.

Slow Dual ZFS Mirror Write Performance

Distinguished Member

Member

Distinguished Member

Member

Member

Member

Distinguished Member

New Member

Member

Member

Distinguished Member

New Member

Distinguished Member

New Member

​

Member

​

New Member

New Member

New Member

Member

CalebSnell I have identical this same problem. 600MB/s read but 30MB/s write RAIDZ 4x4TB. More information there https://zfsonlinux.topicbox.com/gro...as2008-and-2000-write-degradation-performance Can you told me how are you resolve problem?​

New Member

CalebSnell I have identical this same problem. 600MB/s read but 30MB/s write RAIDZ 4x4TB.​

More information there https://zfsonlinux.topicbox.com/gro...as2008-and-2000-write-degradation-performance​

​

Can you told me how are you resolve problem?​

We value your privacy

CalebSnell I have identical this same problem. 600MB/s read but 30MB/s write RAIDZ 4x4TB.
More information there https://zfsonlinux.topicbox.com/gro...as2008-and-2000-write-degradation-performance

Can you told me how are you resolve problem?

CalebSnell I have identical this same problem. 600MB/s read but 30MB/s write RAIDZ 4x4TB.

More information there https://zfsonlinux.topicbox.com/gro...as2008-and-2000-write-degradation-performance

Can you told me how are you resolve problem?