Slow Dual ZFS Mirror Write Performance

CalebSnell · May 13, 2023

Hello everyone,

I've been working on this issue for a little while now where I'm not seeing the expected write performance for a ZFS array. First, I'd like to go over the system setup and specs:

I have two servers in a cluster, both with a ZFS Mirror local-vm-zfs.
The node in question has two mirrors, local-vm-zfs and local-vm-zfs2. Local-vm-zfs has two Toshiba 4TB N300 CMR drives and local-vm-zfs2 has two Seagate IronWolfPro drives.
This node has an AMD Ryzen 5500 and 64GB of DDR4 2400MHz ECC RAM
All nodes and sending PC connected to a TP-Link 10G managed switch
SATA cables are all new

This is a temporary set-up. I have a Windows Veeam VM (data drive on local-vm-zfs) that takes hourly back-ups of an Ubuntu SMB VM (data drive on local-vm-zfs2). The permanent setup is that they are on separate nodes and since they're on a single node for now, I put their storage drives on two different ZFS mirrors.

Is this set-up the problem in and of itself? My read and write speeds do seem to be about half of what I'd expect for sequential writes:

It ranges from 30-110MB/s. iotop on the host shows it ranges from 80-180, sometimes lower , overall very sporadic. I do not believe the drives are the issue, as I have tried offlining both of them to see if one had failed. They were purchased a year apart from each other so unlikely to be a batch problem.

dd is somehow even slower, same for running a CrystalDiskMark benchmark in the Windows Veeam VM, around 10-20MB/s

I have tested this after rebooting and with all other VMs turned off other than the SMB VM. The fastest I've seen so far is a consistent 70MB/s, or spiking between 30-100MBs.

If I enable sync=never, I am able to verify that the issue is not the network, as speeds spike to 700MB/s and I see the RAM on the node filling up and then slowly writing to the drive. If the ZFS cache fills before the write is done, the upload speed will drop (sometimes to 0) and eventually it goes back to spiking between 30 and 100MB/s.

I believe this started after I created the second ZFS mirror and migrated the VMs over. Also, the drives are all plugged directly into the SATA ports on the motherboard. ZFS recordsize for both arrays is 128k, block size for the drives:

Let me know if there is any more information I can provide.

Thanks!
Caleb

CalebSnell · May 13, 2023

Also, I've tested creating a new container w/ an Ubuntu install and Samba, and I see somewhat better results. Not faster but significantly more consistent:

Prior to this issue, I was seeing speeds of 120-140MB/s. So not a ton faster, and a lot less than the rated 220MB/s seq write, but perfect world and all that.

*Edit, resilvers happen at 160MB/s, so that would seem to be the upper limit of what I would expect for write speeds.

Thanks!

CalebSnell · May 13, 2023

Some more info:

Pool is only 30% full

ZFS pool status:

Attached is a .txt file with the output of zfs get all

Thank you

CalebSnell · May 14, 2023

Another quick follow-up. interestingly, when I moved the disk from the SMB VM on local-vm-zfs2 to local-vm-zfs2 that write process happened at the expected speed - ~160MB/s. Since I'm seeing the slow VM level write speeds on both local-vm-zfs2 and local-vm-zfs, but not at the ZFS level when transferring the disks, this seems like the issue is somewhere between the guest OS and the subvol that holds the guest OS disk.

Are there some parameters that would need to be set at OS install time to make sure the guest disk is set-up as 4k blocksize? Here is what fisk says in the SMB guest;

And in the Veeam Windows guest:

Could this be a write amplification issue? How would this be remedied when spinning up new VMs? Is there a way to fix this without creating a new VM? Either way works, just want to make sure I don't keep doing this to myself if this is the issue.

Thanks!

CalebSnell · May 15, 2023

Bump. Let me know if there are any commands, tests, or anything I can do to help troubleshoot this. Thanks everyone.

CalebSnell · May 16, 2023

Some more testing:

I created a new Ubuntu server VM, installed as LVM, everything else default. looked for block size options but didn't see anything. DD reports the same speeds, around 14-70MB/s depending on whether I used urandom or Zero. I got 70MBs when I copied the test file to a new location using DD.

I also have another Proxmox node where I thought this issue wasn't happening but hadn't confirmed. I used a WIndows 11 VM to test and tried copying a file to the SMB server and got the same speeds, around 60MB/s. I tried copying a file from the C: drive (uses an NVME drive as the physical drive) to the D: drive (running off a ZFS mirror) and got the same speeds, around 60MB/s. Here's the disk settings:

The C: drive settings are the same, except discard and SSD emulation is turned on.

And finally, I moved the SMB server disk to local-vm-zfs and deleted local-vm-zfs2, rebooted the node, and I still see the same low speeds. iodelay at this time is around 60%.

I ran updates on both nodes so they're fully up to date now. Anyone have any ideas?

apoc · May 16, 2023

I guess you see write amplification issues and the fact that ultimately you will cause Random-IO, which a HDD is not particular good at.
A single vdev in the pool means everything goes onto it. Including the logging of ZFS...
Does the achieved speed give you trouble?
Or are you just concerned about the numbers?

CalebSnell · May 16, 2023

Thanks for the response! So if it were a consistent 70MB/s I'd have less of an issue (I can accept 50% performance loss given the info you provided) but its how its fluctuating during sequential writes that is causing problems.

What actually prompted me to look into this was how daily backups from my pc to the SMB server were taking a lot longer than expected. It fluctuates from 14MB/s to 70MB/s when looking at the network graphs on the sender side and varies further if I look in iostat. Anywhere from nothing to 140MB/s. Like its writing in bursts.

Also this issue started recently, I used to get pretty consistent 120-140MBs speeds. The biggest change is how I created the second zfs pool but I've undone that. I also created a new vm to rule out some issue with having migrated the VMs around before.

Thanks again!!

CalebSnell · May 17, 2023

Some more information. Ran iostat and arcstat while copying a large file. You can see that transfers stall on the sender side but on the pool side its writing pretty consistently at 100-160MB/s

One thing that jumps out to me is that the avail column only shows 1.1G and it went as low as 76MB after I took the screenshot. There's definitely enough memory on the system. Could I have a setting misconfigured?

*Edit, i forgot earlier I posted the fdisk of the VMs and it shows 512bytes sector, so write amp definitely seems the culprit... is there some way to make sure whatever drive I create for a VM that it presents 4k sector size to the VM? thats the only thing I can think of that is the root issue... Thank you!!

*Edit again, it looks like 512 sector sizes in VMs are expected. 4k isn't really something that is done, at least as of the last couple years. So I guess it's not for sure a write amp issue. And like I said it started recently, so I don't know.

Dunuin · May 17, 2023

CalebSnell said:
*Edit, i forgot earlier I posted the fdisk of the VMs and it shows 512bytes sector, so write amp definitely seems the culprit... is there some way to make sure whatever drive I create for a VM that it presents 4k sector size to the VM? thats the only thing I can think of that is the root issue... Thank you!!

See here:
https://forum.proxmox.com/threads/how-to-do-4kn-virtio-disks.96809/post-419427

But I wasn't seeing a noticable performance nor write amplification difference in benchmarks when setting the sector size of virtio SCSI from 512B/512B to 512B/4K.

apoc · May 17, 2023

CalebSnell said:
It fluctuates from 14MB/s to 70MB/s

This kind auf saw-teeth graph typically indicates some caching taking place. Once the cache is saturated the performance drops, once the cache is emptied or reaches its low-watermark it starts buffering again. I have seen this literally a dozen of times. Caching can happen on the filesystem level, on the host level or even on the guest level.

CalebSnell said:
Also this issue started recently, I used to get pretty consistent 120-140MBs speeds. The biggest change is how I created the second zfs pool but I've undone that. I also created a new vm to rule out some issue with having migrated the VMs around before.

It is just a feeling but I don't think this is related to your second pool. Likely it is more related to your primary pool filling up. Disks also get slower once the inner areas are accessed. And that can be quite a lot. I have measured up to 30% decrease in performance on a particular disk. So this is nothing "minor". Of course it all depends on the disk, its size, rpm, if it is short-stroked, etc.

CalebSnell said:
that transfers stall on the sender side but on the pool side its writing pretty consistently at 100-160MB/s

To me this would indicate that the pool is busy and can't accept more IO. So if the sender stops or limits then there is a throttle in between, which causes this. May be one more point for my theory about the file system cache that wants to be emptied.
How is your copy-chain working? From Windows (Client to be backed up) to which that runs on the PVE? Is it a linux?

CalebSnell said:
One thing that jumps out to me is that the avail column only shows 1.1G and it went as low as 76MB after I took the screenshot. There's definitely enough memory on the system. Could I have a setting misconfigured?

Again this would back my theory for file system caching.
On my end I have limited the usage of linux (guest, backup-server on PVE) memory to start purging data down to disk more early. Might be worth a try:
https://www.baeldung.com/linux/file-system-caching
/e: https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

CalebSnell · May 17, 2023

Thanks again apoc, appreciate the insight.

So as far as caching is concerned, for testing, I have sync set to always, so the speeds we're seeing and testing are directly against the drive. I recreated the second pool since its unlikely to be the issue, and it had sync set to standard. Uploads cruise along at 700MB/s until the ZFS cache fills up and then it fluctuates between 70-100MB/s and nothing until it's done. This new pool is also only 30% full, but that's still pretty far into the platter so I agree there.

For my copy chain, I've tested a few different things. The first one is the backup, so from a Windows 11 client using SMB to copy files to the SMB server. This server is an Ubuntu 18.04.6 server. I also tested it with an ubuntu 22 server, and a Windows 11 Veeam server. The second thing I tested is a copy from the C: drive to the D: drive on a Windows 11 desktop I have virtualized on another node. C: is an SSD backed virtual disk and D: is the ZFS backed virtual disk.

Honestly it seems more like the cache is taking a long time to empty, so even if we do it early, it's still going to get filled and present the low write speed behavior and fluctuate. When I check arcstat I can see that I have 10G available, and it slowly fills up while iostat shows that the disk is being written to with expected speeds, but the actual transfer is much slower. This makes me think it's something to do with how data is being written to the drive by the cache or ZFS, but block level transfers (like if I move a disk from one pool to another) happen at 140+MB/s, so clearly ZFS can write to the drives fine, right?

I'll do some more testing with those links you sent me and see if I can strike some balance, wouldn't be surprised if I changed something and just forgot about it. Will let y'all know.

Thanks!!

CalebSnell · May 17, 2023

Okay so after looking into this some more, I don't think the cache referenced in those links is related. I have sync=always, so writes should flush immediately to disk, right? So you're definitely right that the pool isn't accepting more IO. As I mentioned above I tested copying on Linux and Windows, both from a remote client and from the VM itself. When I tested a copy from the VM itself to itself, I sourced the file from another drive that is SSD backed. And the pools are around 30% full.

Let me know if I can clear anything up, or provide any more info.

apoc · May 18, 2023

CalebSnell said:
have sync set to always, so the speeds we're seeing and testing are directly against the drive.

On all layers?
There are settings in the guest as well as on the host and also on your client. Just to make sure...
Also some people suggest to turn off ncq on the disk when using ZFS (on the host) as well as disabling queueing - all that makes things slower but more predictable typically.

CalebSnell said:
This makes me think it's something to do with how data is being written to the drive by the cache or ZFS,

I'm not a zfs master but from my understanding I'd expect that "production" io is somewhat handled different by the cache compared to ZFS internals (e.g. send-receive, etc). In every pool there is logging happening, which leads to write amplification. That's the reason why the concept of slog-devices exists.

CalebSnell said:
Uploads cruise along at 700MB/s until the ZFS cache fills

Yeah that's memory speed. No spinning disk drive will achieve that

CalebSnell said:
cache fills up and then it fluctuates between 70-100MB/s and nothing until it's done.

That's the kind of saw-tooth pattern which comes from full cache...
A part is restaged, frees up memory, then it fills up again and so on...

apoc · May 18, 2023

Did you use zpool in sync io?
Guess not if arc fills up...
Give it a try:
https://dannyda.com/2020/08/23/how-to-check-change-modify-zfs-syncstandard-disabled-always/

CGC · May 25, 2023

Very new to Proxmox, but a longtime ZFS user here.

1.) If you've got ZFS on the host and you're trying to run ZFS in a guest that will produce extremely poor quality disk performance.

2.) Use -o ashift=12 for drive formats - That turns the eight bit 256 byte sector to a twelve bit 4096 byte sector. This is the way for all drives, even those that advertise you can access them 256 bytes at a time. Never seen any bigger number used, intuition says larger blocks are better, but this is ZFS, its own sovereign kingdom.

3.) What's in /etc/modprobe.d/zfs.conf - here is what's in mine, this is the ARC in memory. Make it large, it will auto-surrender ram to the OS if that becomes necessary.

options zfs zfs_arc_max=34359738368
options zfs zfs_prefetch_disable=0

4.) You didn't post a zpool status - are these spindles being used with an SSD for cache?

pool: frantic

NAME STATE READ WRITE CKSUM
frantic ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD10JFCX-68N6GN0_WD-WXE1A256KF6U ONLINE 0 0 0
ata-WDC_WD10JFCX-68N6GN0_WD-WXB1A899SZCU ONLINE 0 0 0
cache
nvme0n1p2 ONLINE 0 0 0

5.) For details on what is happening Netdata can see inside ZFS - try it once, you'll never go without it.

https://github.com/netdata

6.) Netdata will whine endlessly unless you do something like this in /etc/sysctl.conf

net.core.netdev_budget=5500
net.core.netdev_budget_usecs=55000
net.core.rmem_default=134217728
net.core.rmem_max=134217728

7.) Become a ZFS ninja but reading this, even though it's eleven years old. There are changes in booting and encryption but the tuning advice is still golden.

https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

CalebSnell · May 28, 2023

Hi CGC,

This is great info - thank you! Just to answer your direct questions first:

1). ZFS is only configured on the hosts. The guests use whatever the guest OS suggests as recommended defaults, usually LVM for Ubuntu.
2). I do use ashift=12. i've also tried ashift=9 for kicks but got the same performance, no surprise there.
3). I did not have this file. I created it and added your suggested values. I have 64GB of RAM on this node, so I'll start w/ 32GB.
4).

I do not use SSD cache as the main purpose of this setup is to act as a simple back-up solution. I also do tons of large backups (several hundred gigabytes if I decide to change file structure for example), so i don't want to toast an SSD just trying to tune this in, lol.

But yeah, I don't need to get crazy speeds, just within the ballpack of what one would expect for a spinning drive. For this reason, I have sync set to always, however, I have tested with sync=standard and I get great speeds (700MB/s) until the zfs cache fills and the speeds go back to whatever write bottleneck is happening between the guest and the spinning drive. For reference, I see speeds of around 30-70MB/s for the file transfer, and about twice that for the pool when I check iotop. I used to get double that in the guest, and double that again in iotop.

Again, not expecting miracles, but back-ups definitely take significantly longer now because they're averaging 30-70MB/s instead of 100MB/s or more, for large files.

5/6/7). This is very cool, I'll do some research on this and try to get some good info. I'll also review the document you linked. Thank you!

I'll test the changes I made to the max arc size and change sync back to standard just to see how the normal behavior plays out now that I've made that change.

Thanks again!
Caleb

CalebSnell · May 28, 2023

OK, tested the suggested changes. Left side is with sync=always, right is sync=standard.

Speeds peak a little bit higher, but there are more frequent hard stops where no data is transmitted, and they're still below what I used to see for write speeds on the same drives. Interesting, making changes to ZFS perameters during this time will take a while to finish, usually until the transfer is done. I'm guessing that is expected behavior, as it doesn't want to make a change while a write is happening, but wanted to note it.

Again, i'll review those docs you posted and see if I can find anything else. Thanks!

CGC · May 28, 2023

OK, did not realize the goal here was a large scale backup.

My desktop boots from a 240GB SSD, but it's a Seagate Nytro, good for one write per day over its five year span. I wish I'd got a 480GB a couple years ago, because now that model is no longer on the market. The 480s are symmetric 535 meg/sec, the 240s are write limited to about 300 meg/sec. There are 12 cores, 128GB, I give ZFS 32 GB to start. My home is a pair of WD Red 1TB 2.5" NAS. I have a single IronWolf 4TB that's for stuff I can replace, albeit at a crawl via cable modem, if I lose the disk.

I have an NVMe card with a 256GB SSD on it that I've decided to sacrifice - so it's swap and a pair of 64GB cache partitions. I did that config four years ago, it just keeps grinding along. So much for my data center drive fetish I guess.

I recently started playing with Proxmox. Got out my spare workstation, found a 240GB Nytro that's only good for 0.7 DWPD. Promox seemed really nice, so after a week I got a cheap 10TB refurb HGST drive. The 10TB is all ZFS and I gave it 64GB of cache.

That being said, the experience you have with ARC filling then speed dropping is both intuitively normal and what I experience. The 4TB in my system sometimes gets copies of 50 - 100GB, virtual machine backups. The L2ARC on SSD does NOT NOT NOT behave like memory and I've never got my head around it. I guess it's read => ram at full throttle, then the SSD writes are limited by the drain rate of data going to spindles. Since I'm reading from a pair of not very speedy 2.5" drives, then writing to a 5400RPM NAS device ... dunno.

ZFS's never make a mistake file systems come at a cost. Are you in a position to try this with a single drive instead of a mirror? Are you in a position to use a single smaller spindle that's large enough to take one of your backups as a staging area ahead of the mirror? If a cache fails you take a performance hit, but not a data loss hit. That's why I keep waiting for that NVMe storage to die, but I'm not worried about it.

And if you are doing this for backup/restore ... have you explored how snapshots work? If you're rolling backwards and forwards and copying data both ways, that's painful compared to the "less than a couple seconds" for a ZFS snapshot and equally zippy performance when you want to roll back.

ARC is flexible but if you have things that aren't, there's no point to employing more for that pool than is required to buffer the data in flight.

This is just a funny feeling, not specific recall, but there might be some tweaks within zfs.conf given that you're thinking read/write streaming.

Are there non-ZFS storage pools where VMs live, or is one of those work space and the other a frequent mirror of the workspace?

Proxmox offers a nice graphical abstraction layer for every thing in it where I'm already familiar, but I really like that it's a light weight skin over familiar ZFS commands. I have yet to encounter a situation where I ignore Proxmox and go at things directly, but I have only been using it for a couple weeks ...

CalebSnell · May 30, 2023

Once again, thank you for the info, CGC!

I just tested one of the drives outside of the mirror using LVM and I see around 220MB/s peak probably closer to 180MB/s sustained, which is close to the mfgr provided specs, and above what I would expect. So the drives are at least working properly.

For some context, I've had this mirror set-up for a few years now and originally I (by mistake) purchased SMR drives instead of CMR drives. I assume you're familiar, but if not, SMR drives take many more operations to write a sector of data since the sectors are shingled over eachother to increase platter density. Real world speeds of these drives once you get out of the SMR cache portion of the platter are around 20-40MB/s, so I upgraded to a pair of Ironwolf Pro drives and a pair of Toshiba N400 drives. I think my sore point here is that if I were fine with the 40MB/s speed I see now, I could have kept the slow SMR drives, lol.

When I do backups, its from a normal workstation (not a VM) using SMB to the SMB server on one of the nodes. With sync=always, I get 700MB/s, so I know the 10G link is good, right up until the cache fills as we discussed earlier.

And yeah, I do plan to look into ZFS snapshots. Right now one of my Nodes is just running as a host for a Windows VM so I can play games on my TV and not need to use my main workstation. Eventually that node will be a mirror hardware-wise of the other node, so I can use ZFS snapshots to be able to revert and keep a copy of the VMs on the other node.

So most VMs have a split set-up, boot drive on an SSD and the data drive on the ZFS pool. Some have their boot drive on both (primarily the SMB server), and I've tested w/ all VMs powered off and made a fresh one that has a boot drive on the SSD and data on the ZFS pool, and I see the same speeds, unfortunately.

Also, this morning I noticed that both of my ZFS pools were reporting read/write errors and were offline. I ran a scrub and not seeing any errors so far, but definitely weird. The other node is fine, did not report any errors, and actually has the same motherboard, and I see the same speeds there. So I think that may have been a coincidental situation but I'll definitely be monitoring and getting my Veeam VM off this node just in case anything happens.

Haven't forgot about the docs either, I will plan to look into those and see if I can find any tweaks. I wouldn't be surprised if maybe a default changed that made perf worse for spinning drives in my environment. Is there any chance you could share the zfs.conf for your pair of WD Red's NAS? Or do you run default there?

Thanks!

Slow Dual ZFS Mirror Write Performance

Member

Member

Member

Attachments

Member

Member

Member

Famous Member

Member

Member

Distinguished Member

Famous Member

Member

Member

Famous Member

Famous Member

New Member

Member

Member

New Member

Member

We value your privacy