Slow Dual ZFS Mirror Write Performance

Keep in mind that a SLOG device only helps with sync writss. Copying files it normally asynchronous, therefore you the copy job is always lightning fast in the beginning unless the cache on the destination is filled and writing to disk is started. Then the throughput drops significantly and often almost to a halt. Just monitor with arcstat what your ZFS is doing in order to understand the write pattern better.

I can recommend using Intel Optane NVMe for a SLOG device, which is MUCH MUCH faster than SATA/SAS SSDs. The latency is much better and therefore the ZFS pool feels much faster on sync writes. For most workloads, 16 or 32 GB is enough. SLOG is normally just about 5 sec of data.
 
SlimTom, thank you for your response! Yes, I've previously tried varying block sizes, including 16K. atime has been disabled. I'll try enabling relatime as well and let you know how that goes.

Nuke Bloodaxe, thank you as well. Yes, I've considered adding an enterprise SSD as a SLOG but for now, it's not in the budget. My problem is that I used to be able to do file level backups over SMB without running into this stop/starting and consistent 20/40MB/s write speed issues. Interestingly, I see similar issues on a Dell R710 w/ 8 SAD drives in a RAIDZ2. Write speeds low, read speeds perfect. So I'm not sure that the issue we're seeing is related to the mirror itself, though i'm sure that wasn't your main point, lol.

I'll see if I can test a SLOG with a spare consumer SSD i have now, and if I see improved performance (as in, iostat is actually showing acceptable writes to the drives, i.e not 20MB/s split between the two striped mirror) I'll plan to get an enterprise SSD and call the issue fixed.
 
atime has been disabled. I'll try enabling relatime as well and let you know how that goes.
You don't need to try this, with atime already disabled, unless you really need the atime. Not writing a atime at all does less overhead than updating the atime once per day (what enabled atime + enabled relatime would do).

I'll see if I can test a SLOG with a spare consumer SSD i have now, and if I see improved performance (as in, iostat is actually showing acceptable writes to the drives, i.e not 20MB/s split between the two striped mirror) I'll plan to get an enterprise SSD and call the issue fixed.
Doesn't make much sense. A consumer SSD isn't much faster than a HDD when doing sync writes. And sync writes is all a SLOG is doing. So when using a consumer SSD this won't be much better than without a SLOG.
 
Last edited:
You don't need to try this, with atime already disabled, unless you really need the atime. Not writing a atime at all does less overhead than updating the atime once per day (what enabled atime + enabled relatime would do).


Doesn't make much sense. A consumer SSD isn't much faster than a HDD when doing sync writes. And sync writes is all a SLOG is doing. So when using a consumer SSD this won't be much better than without a SLOG.
Fair points! And honestly I have a hard time believing it's relating to not having a SLOG. Even if I set sync=never, the writes slowly filter out of memory at the same speeds, around 20MB/s, same as doing a copy on a guest or running ATTO or FIO. I have 64GB of DDR4 memory on that node as well, and these speeds are seen both if the memory fills up and the transfer pauses, or when the transfer finishes. Also, these drives test, with the same types of data, at 200+MB/s outside of the mirror. I have 4 of them now so I could raidz1 them or something to see if it's an issue with mirrors, but I have the same exact issue on another box w/ 8 10k rpm SAS drives in raidz2, which also test to test similar speeds outside of the ZFS pool, 150-200MB/s.

Interesting, when testing w/ ATTO in Windows, iostat shows that zd0 is writing at 200+MB/s mostly as we'd expect (it's a striped mirror so should be higher but whatever its good), but there is no file being written to the drive. Nothing shows up, just seems like data is being streamed to it? Then of course a crystal disk mark or file transfer test shows the low 20MB/s speeds on zd0 and 10ish MB/s on the drives.

I know there's lots of confounding variables here but the speeds I'm seeing on all the Proxmox boxes w/ ZFS mirror a raidz2 are so low it just doesn't feel like a performance tuning problem.

Outside of getting an enterprise SSD, is there anything else I can test here? I'll put it on the list but just hoping to be able to try something else in the meantime haha. If I just need to start from scratch and follow specific setup path, I can do that, I can move the VMs to another node and rebuild the one we're testing on. I just need to know what to change since I've had this issue on several installs.

Appreciate all the help everyone.
 
Migrating a VM to the host w/ blocksize 16k block size and w/ sync=standard instead of always, it seems to level out at around 150MB/s to 250MB/s, about the speed of one drive, a bit more.
1697215671551.png

Will see if SMB is fixed on the VM when it finishes migrating. So ZFS must be configured fine if migrations are working for the most part? Again, with it being a striped mirror you'd expect higher, but onboard sata and who knows what else, I'm not going to complain lol. Being dumb again, the sender is only a single drive mirror so these speeds are exactly what i'd expect. Now to see how the host behaves.
 
Last edited:
Same behavior post migration using 16k recordsize and blocksize1697234443729.png
Average 20-30MB/s, some peaks to 50-70MB/s. This is with sync set to always. With sync set to standard, I get great xfer speeds for 25% of the transfer, then it stops, then good again for a bit, then stops, then settles out back to around 70MB/s before it finishes. iodelay and writes continue to occur on the pool well after the transfer stops.

idk, weird stuff.
 
Average 20-30MB/s, some peaks to 50-70MB/s. This is with sync set to always. With sync set to standard, I get great xfer speeds for 25% of the transfer, then it stops, then good again for a bit, then stops, then settles out back to around 70MB/s before it finishes. iodelay and writes continue to occur on the pool well after the transfer stops.

idk, weird stuff.
Just sounds like normal caching. "sync=always" is always slow as it won't cache. "sync=standard" will cache async writes so it's an up and down as the cache is filling and flushing.
 
Same behavior post migration using 16k recordsize and blocksize
Average 20-30MB/s, some peaks to 50-70MB/s. This is with sync set to always. With sync set to standard, I get great xfer speeds for 25% of the transfer, then it stops, then good again for a bit, then stops, then settles out back to around 70MB/s before it finishes. iodelay and writes continue to occur on the pool well after the transfer stops.

idk, weird stuff.

Thank you for all your testing. You saved me a lot of time. But I still have same/similar problems as yours. I just bough two Intel Optane 16GB drives (not expensive at all!) and will try to add them as ZIL/SLOG. ( But somehow I have a feelling that it wont help :) )
 
Thank you for all your testing. You saved me a lot of time. But I still have same/similar problems as yours. I just bough two Intel Optane 16GB drives (not expensive at all!) and will try to add them as ZIL/SLOG. ( But somehow I have a feelling that it wont help :) )
Sorry, still no resolution. I've tested with NVME drives lately and saw the same exact slow write speeds as any other spinning drive. Peaks were obviously higher, but the drives are capable of sustained writes in excess of 300MB/s especially since I'm only testing with 10G files. And if I were hitting RAM, one would expect higher writes than 30-70MB/s. RAM also doesn't increase on the host or guest...

Anyway, I ended up switching the NVME drives to LVM and redirected my folders (pics, downloads, etc) on my Windows guests to the SMB server that is still on ZFS, and I'll just deal with the poor write speeds. NVME drives now test to sustained 300MB/s-3GB/s depending on the workload, as expected. I did have to enable write caching on Windows, but I've confirmed that the memory on the host nor the guest increase when doing writes, and iostat shows the NVME drives being written to, so something weird is going on with the write caching function in Windows.

I'll see if I can test writes to a ZFS backed drive in a Windows guest with write caching enabled again. I recently switched to 64k blocksize so maybe this will be the confluence of conditions needed to make things work. I'll let you know.
 
Thank you for all your testing. You saved me a lot of time. But I still have same/similar problems as yours. I just bough two Intel Optane 16GB drives (not expensive at all!) and will try to add them as ZIL/SLOG. ( But somehow I have a feelling that it wont help :) )
Quick updater. Disabling sync for the pool seems to fix the issue. I tested a 10GB file and it wrote at 300MB/s the whole time, and I can see that the drive is being written to at 160-200MB/s in iostat. Writes in iostat continue to for 15s-30s after the write in file explorer stops, so as expected there is definite risk for data loss here.

And unfortunately sync=standard still leads to sawtooth writes but with the same higher speeds, even though there is plenty of RAM available on the host and barely any is being used on the guest. Writes also continue in iostat for 15s-30s after the write in file explorer stops, so still seems like a risk of losing data.

Let me know how adding a ZIL/SLOG goes, and how it performs with sync=always.
 
A SLOG is not meant to boost general write performance. It's there to soften the big performance hit of sync writes. Not a good idea to force all fast async writes to be handled as slow sync writes just so that the SLOG will have to cache something.
 
Just sounds like normal caching. "sync=always" is always slow as it won't cache. "sync=standard" will cache async writes so it's an up and down as the cache is filling and flushing.

Do you have any suggestions to improve performance on our similar hardware, without the need to purchase SAS enterprise drives? CalebSnell and I have tried various recommendations with limited success. It's hard to believe we're the only ones facing this issue. Are there any monitoring tools to identify the root cause and aid in solving the problem?
 
Do you have any suggestions to improve performance on our similar hardware, without the need to purchase SAS enterprise drives? CalebSnell and I have tried various recommendations with limited success. It's hard to believe we're the only ones facing this issue. Are there any monitoring tools to identify the root cause and aid in solving the problem?
You already got your answer to your crosspost.
 
  • Like
Reactions: SlimTom
I was experimenting with different cache options and I got this error: WARN: iothread is only valid with virtio disk or virtio-scsi-single controller

So I've checked difference between VIRTIO SCSI (that I had) and VIRTIO SCSI single on this post: VIRTIO SCSI vs VIRTIO SCSI single , changed my controller to VIRTIO SCSI single, and my writes to server are more stable now. But still not completely happy... Wil implement Intel Optane as SLOG, and later enterprise grade SSDs and will report back...



 
Last edited:
I was experimenting with different cache options and I got this error: WARN: iothread is only valid with virtio disk or virtio-scsi-single controller

So I've checked difference between VIRTIO SCSI (that I had) and VIRTIO SCSI single on this post: VIRTIO SCSI vs VIRTIO SCSI single , changed my controller to VIRTIO SCSI single, and my writes to server are more stable now. But still not completely happy... Wil implement Intel Optane as SLOG, and later enterprise grade SSDs and will report back...



Hey! I just purchased and recieved a used Intel 400GB S3710 SSD and set that up a SLOG device. I see consistent 130MB/s write speeds to my ZFS pool w/ the 2 4TB drives in a mirror. I believe the optane drives are significantly more performant than the Intel SSDs, so definitely let me know what you get with Optane. FWIW, I use the command pveperf [/poolnamehere] to get the "fsync" numbers. After adding a SLOG I get 3k fsync/second, and without it I was getting around 100 fsync/second

130MB/s is still slower than I would expect, but it's constant and that's with a 128k blocksize so not ideal for an SSD anyway, though I'm unsure if blocksize matters for a SLOG... Anyway, I'll follow-up if I find any issues or make any further improvements, but this is definitely better than 30-50MB/s!!

Quick update: sync=standard allows for much higher initial writes, at the expense of a 5s or so period in the middle where it drops to 0MB/s and then it goes back to ~130MBs

1700704690189.png
 
Last edited:
..would you be so kind and summarize all ZFS settings that proved working / are significant? (sync, relatime, atime, cache, sector size, ashift, etc etc...).

I didn't have a chance to install Intel Optane as this is production machine... and maybe I'll reinstall alltogether...

Update: Installed Optane as SLOG, enterprise grade SSDs, L2arc cache on Optane... As was already said - it didn't help much, as mainly I have async read/writes.... I also tested with LVM partition and there speed is as it should be. Only ZFS I cannot figure out not to have transfer blackouts... It is obvious there are some cache flush issues. but... how to tune it...
 
Last edited:
Useful info:
ZFS tuning cheat sheet (https://jrs-s.net/2018/08/17/zfs-tuning-cheat-sheet/)

and additional reading:
https://www.reddit.com/r/zfs/comments/xtbadx/what_zfs_ashift_size_do_you_recommend_for_a/



So setting recordsize, ashift (and eventually volblocksize) matters most. Allign those values to VM filesystems / databases - reduce amplification. SLOG drive helps ONLY with SYNC writes (databases mainly), otherwise no real benefit. L2ARC: having more RAM helps much more than a L2ARC drive. If using HDDs you can L2ARCs secondarycache=metadata to speed up searches. Special device helps only in case of HDDs are use in ZFS.
Usually ashift=12 (4k) recordsize=4k / 16k, volblocksize=16k, atime off, compression LZ4


One last note: When using NMVe, run nvme id-ctl /dev/ncmeXnY . Look at the drive's MDTS stat. The maximum amount of data it can transfer in a single DMA operation is <LBA-block-size> * 2^(MDFS), which is probably a good candidate for zfs record size

Let me summarize commands that helped me know about settings:

Find physical sector/block size: fdisk -l Disk cache: hdparm -W /dev/sdX

zfs get recordsize volumename
zfs get atime volumename
zfs get relatime volumename
zfs get sync volumename
zfs get volblocksize volumename

zfs get dedup volumename
zfs get compression volumename

Resize local to host ISOs: (you can even delete LVM /-thin if you have other disks/volumes to host VMs)

lvremove /dev/pve/data ("data" didn't exist on my system)
lvresize -l +100%FREE /dev/pve/root
resize2fs /dev/mapper/pve-root
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!