Spotty Disk Performance with NFS Share

jschlager008

New Member
Jun 21, 2023
7
0
1
Running a ProxmoxVE cluster with multiple Ubuntu VMs with the disks hosted on a TrueNAS NFS share. Occasionally one or more of the VMs will experience drastically reduced disk performance and I'm having a hard time pinpointing the cause.
Code:
pve-manager/7.4-3/9002ab8a (running kernel: 5.15.102-1-pve)

Average performance:
Code:
ubuntu@vm1:~ $ sudo dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.53294 s, 424 MB/s

Reduced performance:
Code:
ubuntu@vm2:~ $ sudo dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.2573 s, 87.6 MB/s

All reported statistics that I can see on both the Proxmox server and the TrueNAS server appear fine. Can someone help guide me in the right direction here?
 
Hello, what drives are you using on the NFS share?
 
Could you please be more specific about the model of the drives? The reason I ask is because you generally need good datacenter drives for ZFS or Proxmox VE in general, a consumer grade (in contrast to datacenter grade) SMR HDD will quickly fill its cache and become extremely slow.
 
Afaik these drives are relatively old - dating from 2013 and have not that typically amount of cache (only 64MB) compared to modern hdds of that class. Although they are classified as enterprise drives they seem to be not suitable anymore for today‘s workloads, especially for zfs. Just my 2 cents ;-)
 
Wouldn't a drive cache issue effect all VMs equally? This issue only seems to effect one or two VMs at a time.

Is there a way to confirm if this is the actual issue?
 
Wouldn't a drive cache issue effect all VMs equally? This issue only seems to effect one or two VMs at a time.
its hard to time when the drive cache and/or NAS cache is exhausted or flushed.
sudo dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
"dd" is not the best way to measure performance. "fio" is much more reliable and preferred.
Is there a way to confirm if this is the actual issue?
You need to perform many more tests, average out the results, remove any inconsistent ambient traffic and CPU activity.

While the KB article below is not NFS oriented, it should give you an idea of the complexity of performance measuring and some methods to do so.
In particular, the IO path from VM's kernel to network port is complex and can be highly affected by CPU contention. If your VMs are not NUMA aligned, constant context switching could be heavily affecting your results:
https://kb.blockbridge.com/technote/proxmox-vs-vmware-nvmetcp/#proxmox--virtio-scsi--raw

I'd recommend establishing a baseline first. Establish that all VMs that you are comparing work expectedly with local storage. Then follow on with pure network testing, ie via IPerf, then utilize FIO to get 5-10 runs on each VM, while maintaining the same conditions/workload.

Otherwise, sorry to say, but your results are meaningless since we dont know what the differences are in VMs properties, their load, the overall load of the system etc.

You also want to baseline your NFS directly from hypervisor.
As you can see, measuring performance is a science.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!