High iowait during NFS transfers on VM

ronin4518

New Member
Apr 22, 2026
3
0
1
Hi all,

I'm running a 3-node Proxmox VE cluster with shared storage on TrueNAS via NFS, and I'm experiencing high iowait on my Nextcloud VM during concurrent uploads. The VM suffers from 70-85% iowait as soon as 5 concurrent uploads happen.
I tried fio command (with sync option) and I have high iowait too !
Some research explain we must to use ISCSI for database (object storage = better performance).

Do you have an idea for this kind of problem ?

Infrastructure


  • 3-node Proxmox VE cluster (latest 9.x)
  • OVS-based networking with dedicated VLANs (management, storage, DMZ)
  • 10 GbE storage network (iperf3 confirms ~9.3 Gb/s between VM and NAS)
  • TrueNAS SCALE hot NAS
    • Pool DATA-POOL : RAIDZ1, 4× SSD
    • SLOG mirror on 2× SSD (active, used during sync writes)
    • L2ARC 130 GB
  • All VM disks stored on the NAS via NFS datastore (no local storage)
  • Datasets served to the VM directly via NFS mounts as well (for app data)

Option for NFS : (rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,nconnect=8,timeo=600,retrans=2,sec=sys)
 
Last edited:
Since iperf3 looks fine, but the same issue can also be reproduced with fio using synchronous writes, this looks less like a simple 10 GbE bandwidth problem and more like a storage latency / synchronous write bottleneck.
On Linux NFS, sync has a significant performance cost because writes must be flushed before the system call returns.
On the TrueNAS/OpenZFS side, ZIL/SLOG mainly helps with synchronous writes, so unless the workload is actually sync-heavy, adding a SLOG alone is unlikely to improve things.
Also, nconnect=8 can sometimes help improve throughput by using multiple parallel TCP connections, but it does not eliminate commit latency itself.

To narrow it down, I think it would be useful to benchmark the following three paths separately:
  • Proxmox host -> NFS datastore
  • Guest VM virtual disk only
  • Direct NFS mount inside the guest only

I would also like to check the VM disk settings.
Specifically, the following settings might help:
  • Bus/Device: SCSI
  • SCSI controller: VirtIO SCSI single
  • IO Thread: enabled
  • Cache: none or writeback
There may also be some Nextcloud-specific factors involved, but I am not very familiar with Nextcloud, so I probably cannot comment much on that part.
 
There are a number of things in your set up that are suboptimal. First and foremost, if you can get the Nextcloud database off of NFS and on to any kind of local storage, you will see a massive improvement. Running databases over NFS is really the wrong architecture. Database engines use fsync() on every write, and with five concurrent uploads driving five concurrent transactions over NFS, each of those fsync() calls becomes a synchronous ZFS write that must commit through the full SLOG/RAIDZ1 path.

Second running VMs on NFS is also suboptimal. Every Nextcloud upload crosses NFS twice, once through the in-VM app data mount, and again through the Proxmox NFS datastore hosting the VM disk itself. The resulting iowait compounds across both round-trip stall points.

Finally, RAIDZ1 is the wrong topology for the storage behind all of this. RAIDZ1 penalizes IOPS. Every write requires a full-stripe parity calculation and concurrent writes serialize rather than parallelize. Under 5 concurrent NFS clients hammering sync writes, the single RAIDZ1 vdev becomes a bottleneck where mirror vdevs would handle the parallel I/O far more effectively.

Here's what I would try changing in order of priority/ease of implementing (potentially)
1. I would use the async option for the NFS mounts serving the data storage. This should be relatively low risk if you have a proper UPS set up
2. Move the Nextcloud DB off of NFS. Local fast storage will give the best results. Otherwise a zvol presented as iscsi or virtio-blk. The reason iSCSI/zvol works better is that block-level access lets the guest OS own the fsync() semantics, whereas NFS forces synchronous behavior all the way down to ZFS.
3. Stop storing VM disks on NFS and transition to block storage to eliminate the NFS overhead OR If you can, add NVME disks locally for VM storage and use NFS for bulk storage only.
4. Ditch RAIDZ1 and move to mirrored VDEVs for your NFS storage pool
5. If your SLOG disks are SATA or SAS, move to NVME storage or even better, Optane storage.
 
  • Like
Reactions: UdoB
Hi both,

Thanks for the thorough architectural review. Your analysis is spot on in principle, but I have some hard constraints that shape our options:

  1. HA requires shared storage — 3-node Proxmox cluster uses live migration and failover, so local storage isn't an option for anything in the HA pool.
  2. No free drive bays in the NAS — I can't add a dedicated mirror pool or a special vdev without replacing existing disks, which would require a full migration.
  3. Capacity requirement — I need the ~12 TB usable from RAIDZ1. Switching the existing pool to 2× mirrors would halve capacity, which isn't viable with our current data footprint.
I Confirmed VM disk config matches best practices: VirtIO SCSI single, iothread=1, cache=writeback, aio=io_uring.

I can migrate MariaDB to an iSCSI zvol on the same pool (block-level access, proper fsync semantics from the guest OS, no need for new hardware). This addresses your point about "databases on NFS is the wrong architecture" — I keep NFS for the Nextcloud datadir (bulk / throughput workload) and move the DB to iSCSI (IOPS / sync workload).

If I want to use my local storage, I can specify temp_dir on local NVME with better perf ?
 
  1. HA requires shared storage — 3-node Proxmox cluster uses live migration and failover, so local storage isn't an option for anything in the HA pool.
note that you can use HA with local zfs replicated between node. (but you can loss the last non synced datas snapshot).
you could have better resilience than a single NAS (or do you have 2 truenas nodes with a shared array ?)


About your truenas, what are the ssd model ? enterprise ssd wih supercapacitor/plp ? no shit consumer drive ? (because they will be super slow with zfs, even worse with raidz1)
 
Thanks for the follow-up. Two things:

On local ZFS replicated storage:

I considered it but for Nextcloud I need RPO = 0. Users sync files from desktop clients continuously, so even a 1-5 minute data loss window on a node failure would mean re-sync chaos on client side. Shared NAS with RPO = 0 is the right trade-off for this specific workload. No secondary TrueNAS with shared array — just a single hot NAS for production + a cold NAS for backups/archives.

SSD model : WD Red SA500To

I'm already looking for Proxmox Training actually (and English too :))
 
SSD model : WD Red SA500To
They are not going to works fine with zfs, sorry.
you really need enteprise drive for zfs. (at minimum for the slog devices).


just look at this thread with same disk (it was with ceph, but it's the some zfs, you need drive with fast sync write)


wd are around 150 iops 4k write for synchronous write ....
 
Last edited:
Here's a total misunderstanding of the nfs capabilities: With 400Gbit eth on a nfs fileserver with a good couple nvme's setup you could easily reach >30GB/s r+w throughput which again easily could serve vm images and shares and your nfs clients. With just 100Gbit eth would mostly not be able to app-compute (not a benchmark tool) the data this that fast! A nfs mount with single socket get around 3GB/s which is 2,5x times that of a 10Gbit interface and so usage of nconnect to use or discuss is useless. With nconnect=8 on a nfs client you were able to 100% fill a nfs server 100Gbit connection if the server is able to r+w local >13GB/s itself. ZFS isn't great at sync writes at all and even here with a mirrored SSD pair which looks like they were sata/sas will limit writes to max 500MB/s which is far less than half of the 10Gbit interface potential the server has. General concept isn't that bad but implementation and hardware limits it's real usage.
 
Last edited:
Personally I would add local storage to each server for local ZFS or even Ceph. I would only put bulk storage on the NAS. And maybe Proxmox HA isn't the way to go. A three node K3S or K8S cluster with Longhorn for shared storage might be a better way to go. You can still add a NAS to K3S/K8S for bulk storage.