To start, the issue that we're having is slow NFS performance, but only in one instance.
We have two storage servers, let's call them sto01 and sto02. They're identical, with a RAID6 array made up of 2TB SSDs.
We have two Proxmox servers, let's call them pve01 and pve02. They're also both identical.
Both Proxmox servers are set up with NFS mounts to both storage servers. Everything is connected to a 10GbE switch with 10GbE NICs. We've been running this setup for a few months, with each Proxmox server having VMs stored on both storage servers, and everything has been great.
Until two weeks ago. One morning, all VMs stored on sto01 started running very slowly. In order to improve performance, we shut down the VMs, moved the images over to sto02, reconfigured pve01 and pve02 to reference them on sto02, then booted them back up and all is well.. except that sto01 is now out of commission.
Here's the testing that we've done so far:
When we received this error, it had gotten about 59GBs into creating the 500GB qcow2 file. When it errored, it appear on the front-end to have given up. However, the file kept growing but was renamed to .nfs00000000059b082b00000002, and once it reached 500GB it was deleted.
Because we're mounting the NFS mount manually (through autofs) on the Proxmox servers, the storage in Proxmox is just set up as a directory. When we initially set these up, we tried to set them up as NFS mounts specifically in Proxmox. We ran into an issue with this, but it was awhile ago so I don't recall the details (I think they were disappearing regularly).
What's confusing is that each of the two servers (both Proxmox and storage) are set up identically, so I don't know A) Why this is happening with sto01 but not sto02, and B) Why it started all of the sudden (which worries me, because if we haven't resolved it and it happens to sto02 for some reason, then we're out of storage servers for the VMs).
My /ext/exports files on the storage servers look like this:
And my autofs configs to mount them on the Proxmox servers look like this:
Any insight into how I can change my configuration for expected performance would be greatly appreciated! And if there's any information that I missed that I can provide, please let me know.
We have two storage servers, let's call them sto01 and sto02. They're identical, with a RAID6 array made up of 2TB SSDs.
We have two Proxmox servers, let's call them pve01 and pve02. They're also both identical.
Both Proxmox servers are set up with NFS mounts to both storage servers. Everything is connected to a 10GbE switch with 10GbE NICs. We've been running this setup for a few months, with each Proxmox server having VMs stored on both storage servers, and everything has been great.
Until two weeks ago. One morning, all VMs stored on sto01 started running very slowly. In order to improve performance, we shut down the VMs, moved the images over to sto02, reconfigured pve01 and pve02 to reference them on sto02, then booted them back up and all is well.. except that sto01 is now out of commission.
Here's the testing that we've done so far:
- R/W tests on the individual SSDs, all as fast as we expect.
- R/W tests on the RAID array as a whole, >300MB/sec.
- R/W tests over NFS from each of pve01 and pve02, close to 300MB/sec.
- Creating a 500GB VM from pve01 or pve02, we run into errors, specifically:
Code:
Formatting '/netfs/nfs/sto01/proxmox/images/3000/vm-3000-disk-0.qcow2', fmt=qcow2 cluster_size=65536 preallocation=metadata compression_type=zlib size=536870912000 lazy_refcounts=off refcount_bits=16
TASK ERROR: unable to create VM 3000 - error during cfs-locked 'storage-sto01' operation: unable to create image: got lock timeout - aborting command"
When we received this error, it had gotten about 59GBs into creating the 500GB qcow2 file. When it errored, it appear on the front-end to have given up. However, the file kept growing but was renamed to .nfs00000000059b082b00000002, and once it reached 500GB it was deleted.
Because we're mounting the NFS mount manually (through autofs) on the Proxmox servers, the storage in Proxmox is just set up as a directory. When we initially set these up, we tried to set them up as NFS mounts specifically in Proxmox. We ran into an issue with this, but it was awhile ago so I don't recall the details (I think they were disappearing regularly).
What's confusing is that each of the two servers (both Proxmox and storage) are set up identically, so I don't know A) Why this is happening with sto01 but not sto02, and B) Why it started all of the sudden (which worries me, because if we haven't resolved it and it happens to sto02 for some reason, then we're out of storage servers for the VMs).
My /ext/exports files on the storage servers look like this:
Code:
/opt/storage/network 172.31.2.0/255.255.255.0(rw,no_root_squash,subtree_check)
/opt/storage/network/proxmox 172.31.2.0/255.255.255.0(rw,no_root_squash,subtree_check)
And my autofs configs to mount them on the Proxmox servers look like this:
Code:
sto01 -fstype=nfs4,rw,soft,intr,rsize=8192,wsize=8192 172.31.2.17:/opt/storage/network
sto02 -fstype=nfs4,rw,soft,intr,rsize=8192,wsize=8192 172.31.2.18:/opt/storage/network
Any insight into how I can change my configuration for expected performance would be greatly appreciated! And if there's any information that I missed that I can provide, please let me know.