PVE 8.4.1 NFS timeout with NetApp

M Anon

New Member
Mar 11, 2025
7
0
1
We have a NetApp box that we use for storing VM disks for both VMware and Proxmox (we're transitioning out of VMware). Each type has it's own NFS share (i.e. for VMware VM disks, it uses ESX_VMs and for Proxmox, there is another NFS share named PVE_VMs). Originally, the export settings for all shares were the same (support v3, v4, v4.1 and v4.2). We didn't notice any problems when we were doing the initial testing with a couple of VMs using the Proxmox NFS share.

Now that we're moving more and more of our VMs from VMware to PVE, we suddenly started noticing that our Windows VMs on PVE were encountering blue screens. Looked at the logs and we see several "pvestatd[1889]: storage 'PVE_VMs' is not online" messages (roughly about every 20mins). After some Googling, we tried these steps but it didn't change anything:

1. Forced PVE host to use NFSv3 while keeping NetApp share on NFSv3/v4/v4.1/v4.2
2. Foced both PVE host and NetApp share to NFSv3 only

We don't have any problems with the hosts on VMware, just the Proxmox ones so we're right now convinced it's Proxmox. Note that the server we're using for Proxmox was originally used for VMware and cabled the same as the remaining ESXi hosts. We also have a mirror of this set up in another server room with its own PVE host, ESXi host, own NetApp and own switch (exact mirror of the one we have) and they're all exhibiting the same symptoms. MTUs are kept the same 1500, we use a LACP bond for storage (2 x 10Gbps), a separate LACP bond for management, backup, and migration (2 x 1Gbps) and a separate LACP bond for production (2 x 1Gbps).

Has anyone encountered this before? Any pointers where else to look or what else to try?

We do have support subscription but the timezone is vastly different so I thought I'd try asking here while waiting for support to reply.
 
Try an easier bond mode, eg "0" balance-rr set on netapp and pve and remove according lacp bonding to the switch ports for.
Btw. I would still doing v4.2 mount on pve while allow on netapp v3...v4.2 because of pve is doing "nfs v3 showmount" requests for pvestatd. Instead of netapp we use rocky9 to serve all vm images via v4.2 to pve cluster.