Slow NFS Performance

Nick Coons

Member
Oct 13, 2018
6
1
8
46
To start, the issue that we're having is slow NFS performance, but only in one instance.

We have two storage servers, let's call them sto01 and sto02. They're identical, with a RAID6 array made up of 2TB SSDs.

We have two Proxmox servers, let's call them pve01 and pve02. They're also both identical.

Both Proxmox servers are set up with NFS mounts to both storage servers. Everything is connected to a 10GbE switch with 10GbE NICs. We've been running this setup for a few months, with each Proxmox server having VMs stored on both storage servers, and everything has been great.

Until two weeks ago. One morning, all VMs stored on sto01 started running very slowly. In order to improve performance, we shut down the VMs, moved the images over to sto02, reconfigured pve01 and pve02 to reference them on sto02, then booted them back up and all is well.. except that sto01 is now out of commission.

Here's the testing that we've done so far:
  • R/W tests on the individual SSDs, all as fast as we expect.
  • R/W tests on the RAID array as a whole, >300MB/sec.
  • R/W tests over NFS from each of pve01 and pve02, close to 300MB/sec.
  • Creating a 500GB VM from pve01 or pve02, we run into errors, specifically:
Code:
Formatting '/netfs/nfs/sto01/proxmox/images/3000/vm-3000-disk-0.qcow2', fmt=qcow2 cluster_size=65536 preallocation=metadata compression_type=zlib size=536870912000 lazy_refcounts=off refcount_bits=16
TASK ERROR: unable to create VM 3000 - error during cfs-locked 'storage-sto01' operation: unable to create image: got lock timeout - aborting command"

When we received this error, it had gotten about 59GBs into creating the 500GB qcow2 file. When it errored, it appear on the front-end to have given up. However, the file kept growing but was renamed to .nfs00000000059b082b00000002, and once it reached 500GB it was deleted.

Because we're mounting the NFS mount manually (through autofs) on the Proxmox servers, the storage in Proxmox is just set up as a directory. When we initially set these up, we tried to set them up as NFS mounts specifically in Proxmox. We ran into an issue with this, but it was awhile ago so I don't recall the details (I think they were disappearing regularly).

What's confusing is that each of the two servers (both Proxmox and storage) are set up identically, so I don't know A) Why this is happening with sto01 but not sto02, and B) Why it started all of the sudden (which worries me, because if we haven't resolved it and it happens to sto02 for some reason, then we're out of storage servers for the VMs).

My /ext/exports files on the storage servers look like this:

Code:
/opt/storage/network         172.31.2.0/255.255.255.0(rw,no_root_squash,subtree_check)
/opt/storage/network/proxmox 172.31.2.0/255.255.255.0(rw,no_root_squash,subtree_check)

And my autofs configs to mount them on the Proxmox servers look like this:

Code:
sto01    -fstype=nfs4,rw,soft,intr,rsize=8192,wsize=8192    172.31.2.17:/opt/storage/network
sto02    -fstype=nfs4,rw,soft,intr,rsize=8192,wsize=8192    172.31.2.18:/opt/storage/network

Any insight into how I can change my configuration for expected performance would be greatly appreciated! And if there's any information that I missed that I can provide, please let me know.
 
  • Like
Reactions: ryanstorm
Just the performance tests I indicated in my initial posts (tested the individual SSDs, the RAID array as a whole, and NFS performance). All of those did well, which would seem to rule out any issues with the SSDs, the controller, or any network item. Do you know of any hardware issues I should check for that reveal themselves only in Proxmox over NFS but all underlying performance tests would be solid?
 
Nothing specific now. I can only suggest this for now:
1] monitor sys values
2] regular fast checks nfs availabilty
3] regular fast checks datadisk availabilty
4] regular fast checks raid controller raid availabilty

For now it looks as not problem with performance, but with size. And because you can simulate problem, did you tried to create very big file on storage (directly, over network, etc)?
 
I don't understand when you say "monitor sys values" (which sys values?), or what a "regular fast check" is. Can you clarify?

Yes, I've created a large (500GB) file over NFS outside of Proxmox, which resulted in no errors and had a transfer rate of 278MB/sec. It's only when I do something within Proxmox that I run into either performance issues (inside a running VM I might get 9MB/sec speeds) or errors (creating a new VM image). And only when the storage is sto01. sto02 works fine (and sto01 used to work fine until about two weeks ago).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!