ZFS write stops after a few seconds

zenowl77

Member
Feb 22, 2024
87
12
8
ZFS write is about 300-700mb/s for a few seconds and then it stops completely and will after a minute resume at 6-18mb/s sometimes as high as 50-70mb/s with random drops again because the drive cannot keep up. (Lsi card only writes to the drives at about 160mb/s)

Similar to what this person is experiencing on reddit

The first reply summarises it quite well.

“I've had a similar experience with ZFS and iSCSI. I'm guessing that when you start copying you get really fast speeds for a short time, then the whole things stops completely? As best I can tell, this has to do with async writes not working well with iSCSI. What seems to be happening is that the copy starts and pulls in data as fast as it can, which is much faster than the drives can keep up. Then once the cache fills up, the whole transfer stops until enough of the cached data has been written to disk. And then some more data comes in and it gets jammed again. Then other reads and writes start freezing as well while they wait for the cache to be drained, which only exacerbates the problem. On really bad occasions, I've had the iSCSI system get completely overwhelmed, to the point where I had to restart the system.
You can try setting sync=always on the zvol, which will unfortunately hurt your write performance on small transfers, but should keep things from getting to the point where it's causing the system to fall apart.
I'm not really an expert on any of this, just sharing my experience, so I could be totally wrong on the details, but I'd suggest forcing sync writes and see if that fixes the issue. At a minimum, doing so should provide more info as to how to actually solve the problem.”

I have tried zfs a few times and this seems to happen every time i use it.

Any ideas on how to solve this issue?
 
That looks definitive for usage of bad disks like consumer ssd's/nvme's or in case of hdd's with SMR technology.
 
The disk with ZFS is an HGST 12TB 7200rpm enterprise drive.(in my case at least)

Edit:
It should probably also be noted i seem to be having these issues inside vms with virtual disks (set to no cache) and have in the past seen it on ssh file transfers, etc

When i tried direct drive to drive transfers with rsync in proxmox with the console it did not seem to have the same problem.
 
Last edited:
ZFS write is about 300-700mb/s for a few seconds and then it stops completely
Written somewhere else in this forum:

Most times it boils down to buffer bloat: too large buffers, enabled write caches and/or slow destination with consumer SSDs which accept data, fill their internal buffers and need to write it back to the actual storage area some (5 to 30) seconds later - and block I/O for a surprising long time, leading to a "stuttering" effect.

So the generic recommendation is to check all physical connections, check Smart-Data and the error log on all disks, disable write caches for the virtual PVE VM-disks and use Enterprise class SSDs.
 
That’s exactly what i thought was the case and what the first post detailed. I just wanted to find a way to mitigate it with settings somehow.

i am using enterprise grade disks and no write cache on the vm disks. The connections are also good, no errors in smart data.
 
try set zfs_dirty_data_max

 
try set zfs_dirty_data_max

thank you for this suggestion, i appreciate it, tried it and it just causes the problem to happen nearly instantly instead of after 5-8 seconds or so. but i suppose that is useful for debugging at least, as now i will see instantly when it is resolved and can always remove that option later.
 
tried it and it just causes the problem to happen nearly instantly instead of after 5-8 seconds or so.
The disk with ZFS is an HGST 12TB 7200rpm enterprise drive.(in my case at least)

Edit:
It should probably also be noted i seem to be having these issues inside vms with virtual disks (set to no cache) and have in the past seen it on ssh file transfers, etc

When i tried direct drive to drive transfers with rsync in proxmox with the console it did not seem to have the same problem.
Sounds like the drive can't deal with the writes (from VMs) unless it buffers (which only delays the problem).
Do you run ZFS on a single HDD or is this a raidz1 with multiple drives (which tend to behaves much worse than a BBU RAID5)?
Might there be a terrible mismatch between the block size inside the VM, the volblocksize, the ashift of the ZFS pool and the sector size of the HDD? That might cause terrible write amplification (and additional reads and seeks for each write).
 
if Windows guest, update scsi driver to version 266.
it is in fact a windows guest, i hadn't realized a new version came out, thank you for the suggestion, downloading now. will report back whether this solves it.
Sounds like the drive can't deal with the writes (from VMs) unless it buffers (which only delays the problem).
Do you run ZFS on a single HDD or is this a raidz1 with multiple drives (which tend to behaves much worse than a BBU RAID5)?
Might there be a terrible mismatch between the block size inside the VM, the volblocksize, the ashift of the ZFS pool and the sector size of the HDD? That might cause terrible write amplification (and additional reads and seeks for each write).
it is a single disk (too much stuff and can't really afford more drives yet, i would like to get some cheap enterprise drives to do a raid 5 or something in the future, just havent been able to afford it quite yet) and the VM disk is only a single secondary storage disk, not actually running the VM off of it. (for the VM i installed to SSD and made a second virtual disk on another and then mirrored it inside windows, probably not optimal but it boosts speeds and removes a lot of the stuttering windows tends to do) i set the block size to the same in the vm disk as the ZFS disk on the host when formatting the disk so hopefully there aren't any issues there.
 
so far after tweaks and driver updates it is slightly smoother/slower but still hitting a wall and stopping writes.

seems to stop for around a minute at a time. which is slower than the disk by far (160mb/s) with the zfs_dirty_data_max setting set to 256mb, since it is stopping for a minute every few hundred mb. it should be able to write 256mb in under 2 seconds, 5-10 if it is going a bit slow, in a minute it should be able to do almost 10GB.

the system also has 96GB ram / 40gb free currently so it isn't like it is running out of ram.

IO Delay is jumping between 15, 50 and 70 a lot.
 
more pool info: (no compression is off for the current dataset being written to)


Code:
NAME                                  PROPERTY              VALUE                    SOURCE
twelve                                type                  filesystem               -
twelve                                creation              Sat Nov 30  9:21 2024    -
twelve                                used                  7.79T                    -
twelve                                available             3.15T                    -
twelve                                referenced            2.62M                    -
twelve                                compressratio         1.03x                    -
twelve                                mounted               yes                      -
twelve                                quota                 none                     default
twelve                                reservation           none                     default
twelve                                recordsize            128K                     default
twelve                                mountpoint            /twelve                  default
twelve                                sharenfs              off                      default
twelve                                checksum              blake3                   local
twelve                                compression           on                       local
twelve                                atime                 off                      local
twelve                                devices               on                       default
twelve                                exec                  on                       default
twelve                                setuid                on                       default
twelve                                readonly              off                      default
twelve                                zoned                 off                      default
twelve                                snapdir               hidden                   default
twelve                                aclmode               discard                  default
twelve                                aclinherit            restricted               default
twelve                                createtxg             1                        -
twelve                                canmount              on                       default
twelve                                xattr                 on                       default
twelve                                copies                1                        default
twelve                                version               5                        -
twelve                                utf8only              off                      -
twelve                                normalization         none                     -
twelve                                casesensitivity       sensitive                -
twelve                                vscan                 off                      default
twelve                                nbmand                off                      default
twelve                                sharesmb              off                      default
twelve                                refquota              none                     default
twelve                                refreservation        none                     default
twelve                                guid                  5779471076221037442      -
twelve                                primarycache          metadata                 local
twelve                                secondarycache        all                      default
twelve                                usedbysnapshots       0B                       -
twelve                                usedbydataset         2.62M                    -
twelve                                usedbychildren        7.79T                    -
twelve                                usedbyrefreservation  0B                       -
twelve                                logbias               latency                  default
twelve                                objsetid              54                       -
twelve                                dedup                 off                      default
twelve                                mlslabel              none                     default
twelve                                sync                  standard                 default
twelve                                dnodesize             legacy                   default
twelve                                refcompressratio      1.38x                    -
twelve                                written               2.62M                    -
twelve                                logicalused           8.06T                    -
twelve                                logicalreferenced     3.43M                    -
twelve                                volmode               default                  default
twelve                                filesystem_limit      none                     default
twelve                                snapshot_limit        none                     default
twelve                                filesystem_count      none                     default
twelve                                snapshot_count        none                     default
twelve                                snapdev               hidden                   default
twelve                                acltype               off                      default
twelve                                context               none                     default
twelve                                fscontext             none                     default
twelve                                defcontext            none                     default
twelve                                rootcontext           none                     default
twelve                                relatime              on                       default
twelve                                redundant_metadata    all                      default
twelve                                overlay               on                       default
twelve                                encryption            off                      default
twelve                                keylocation           none                     default
twelve                                keyformat             none                     default
twelve                                pbkdf2iters           0                        default
twelve                                special_small_blocks  0                        default
twelve                                prefetch              all                      default
 
Your primary cache is set to metadata (only)?! That will cause much more reads for misaligned writes. redundant_metadata could be set to most in order to reduce the amount of sync writes. You could try a faster checksum but I assume it's not the CPU overhead that's causing the problem.

This is not Proxmox specific: ZFS on a single HDD is terrible and not fixable AFAIK. Use a different filesystem or an SSD with PLP.

In your case with such a big HDD, try a small enterprise SSD as a special device for all the metadata. Since you don't have redundancy, you only need one. This will make everything quicker and only the large actual data needs to go to the HDD. https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
 
yeah, i had just done that for testing purposes to see if it would help the writes, it did not. i will try that and see if it makes a difference, thank you. yes it is definitely not the cpu, i am running a i7-7820x 8c/16t and the file writes are just usually single one file at a time writes, it can handle that for sure and has almost no load from anything except the one vm running and a few lxc's idling at 0. whatever usage.

i am considering a few really cheap sas drives for a sas raid 0 for zfs (looking at maybe 4x3tb or something if i can) just to avoid using the 12tb on anything except backups.

i will consider that one for sure thank you for the suggestion that could be really helpful in this situation
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!