NetApp & ProxMox VE

To answer the ask re: storage.cfg - IPs have been removed
there is a variance on the version purely for troubleshooting at this point
I see no glaring issues - Going to test dd asap

root@PVE-VX01:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content iso,backup,vztmpl

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

rbd: Containers
content rootdir,images
krbd 0
pool performance_pool

rbd: Container2
content images
krbd 0
pool bulk_pool

cephfs: LogsISOsTemplatesEtc
path /mnt/pve/LogsISOsTemplatesEtc
content vztmpl,backup,iso
fs-name LogsISOsTemplatesEtc

nfs: Priority1-001
export /NFS_P1_001
path /mnt/pve/Priority1-001
server x.x.x.x
content images,rootdir
options vers=4.2
prune-backups keep-all=1

nfs: Priority1-002
export /NFS_P1_002
path /mnt/pve/Priority1-002
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority1-003
export /NFS_P1_003
path /mnt/pve/Priority1-003
server x.x.x.x
content images
prune-backups keep-all=1

nfs: Priority2-001
export /NFS_P2_001
path /mnt/pve/Priority2-001
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority2-002
export /NFS_P2_002
path /mnt/pve/Priority2-002
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority2-003
export /NFS_P2_003
path /mnt/pve/Priority2-003
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority2-004
export /NFS_P2_004
path /mnt/pve/Priority2-004
server x.x.x.x
content images
prune-backups keep-all=1

nfs: Priority2-005
export /NFS_P2_005
path /mnt/pve/Priority2-005
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority2-006
export /NFS_P2_006
path /mnt/pve/Priority2-006
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority2-007
export /NFS_P2_007
path /mnt/pve/Priority2-007
server x.x.x.x
content images
prune-backups keep-all=1

nfs: Priority3-001
export /NFS_P3_001
path /mnt/pve/Priority3-001
server x.x.x.x
content images
options vers=4.1
preallocation off
prune-backups keep-all=1

nfs: Priority3-002
export /NFS_P3_002
path /mnt/pve/Priority3-002
server x.x.x.x
content images
prune-backups keep-all=1

nfs: Priority3-003
export /NFS_P3_003
path /mnt/pve/Priority3-003
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1

nfs: Priority3-004
export /NFS_P3_004
path /mnt/pve/Priority3-004
server x.x.x.x
content images
options vers=4.2
prune-backups keep-all=1
showmount shows
Export list for
/NFS_P1_001 (everyone)
/NFS_P1_002 (everyone)
/NFS_P1_003 (everyone)
/NFS_P2_001 (everyone)
/NFS_P2_002 (everyone)
/NFS_P2_003 (everyone)
/NFS_P2_004 (everyone)
/NFS_P2_005 (everyone)
/NFS_P2_006 (everyone)
/NFS_P2_007 (everyone)
/NFS_P3_001 (everyone)
/NFS_P3_002 (everyone)
/NFS_P3_003 (everyone)
/NFS_P3_004 (everyone)
 
Last edited:
Chiming in, I work with Brad. MTUs are 9216 everywhere on our "storage" VLAN. All shares were added via the GUI. We've tried disabling kerberos as suggested in another thread (and only disabling certain versions). We've also set individual share permissions to 777 but no change.

Initial add was weird as it would sit for a long time before showing the NFS share list, probably around 20-30 seconds, sometimes restarting the query via re-engaging the drop down. Once the list popped up it added immediately.
If you use MTU 9216 everywhere, only on the switches or also on the interfaces? Many systems only allow an MTU 9000 for data packets.
Have you ever done ping tests with a full size packet? e.g. with MTU 9000 a ping with packet size 8972?
 
If you use MTU 9216 everywhere, only on the switches or also on the interfaces? Many systems only allow an MTU 9000 for data packets.
Have you ever done ping tests with a full size packet? e.g. with MTU 9000 a ping with packet size 8972?
If their nodes are set to 9216, I would even go up to 9188 in ping.

Are you using LACP? Other network HA technology? Drop down to single cable/path.
Can you install vanilla Ubuntu or Debian on the same type of server with the same exact network config, does it work?

We've seen some very bizarre network issues, some were caused by bad NIC chip, others by bad mlag cable between core switches that affected only one particular flow of one rack.

Best of luck in your troubleshooting.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
If you use MTU 9216 everywhere, only on the switches or also on the interfaces? Many systems only allow an MTU 9000 for data packets.
Have you ever done ping tests with a full size packet? e.g. with MTU 9000 a ping with packet size 8972?
We have solved this, Falk was right. My first thought was that can't be an issue as we mount NFS shares from the NetApp on Linux VMs regularly in our environment, but then it dawned on me that the VMs likely weren't configured for jumbo frames. @Bradomski ran with this first thing this morning with our NetEng and discovered immediate issues on your recommended test.

Our general Setup:
-We are running bonded 802.3ad 25gbps DACs at 9216 with no issues for Cluster Traffic and Ceph VLAN (tested extensively during setup, dozens of iperf streams, etc)
-Bonded 10gbps Fiber for corporate network access, MTU 1500
-Storage Network VLAN is identical to the Cluster Traffic VLAN and on the SWITCH showed everything at 9216

What Happened:
The switchports for the storage VLAN all *showed* 9216, but that was a failure to standardize/communicate on our part. Our Primary NetEng didn't reduce those to 9000, even though the NetApp and our Production ESXi cluster had the interfaces on that VLAN all set to 9000. You would think fragmentation wouldn't just break connections but on the receiving interface there's no feedback mechanism to fragment properly so yeah it breaks it.

Changed Bond1 Storage Interface MTU to 9000 and it works beautifully. We are migrating our low impact VMs as we speak. Good catch Falk!

We decided not to change the NetApp MTU to match as it's a live production system, and the difference shouldn't have a noticeable impact on performance.
 
Last edited:
  • Like
Reactions: Falk R.
If their nodes are set to 9216, I would even go up to 9188 in ping.

Are you using LACP? Other network HA technology? Drop down to single cable/path.
Can you install vanilla Ubuntu or Debian on the same type of server with the same exact network config, does it work?

We've seen some very bizarre network issues, some were caused by bad NIC chip, others by bad mlag cable between core switches that affected only one particular flow of one rack.

Best of luck in your troubleshooting.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Yeah we were very certain that wasn't an issue, this is an under warranty VXRAIL Cluster that was previously running ESXi and VSAN using all ports in our production environment with no issues. Our 640s that now run production might have been suspect as they're a gen old, but unlikely that we had card issues from the vxrail servers sitting in a rack running doing nothing for months.
 
You would think fragmentation wouldn't just break connections but on the receiving interface there's no feedback mechanism to fragment properly so yeah it breaks it.
I would not discount any sort of bizarre symptoms when there is an MTU mismatch.

Glad the issue in the network was easily fixable and did not require a re-architecture of the infrastructure.

Enjoy Proxmox


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I would not discount any sort of bizarre symptoms when there is an MTU mismatch.

Glad the issue in the network was easily fixable and did not require a re-architecture of the infrastructure.

Enjoy Proxmox


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Yeah it can get weird. Our company's product involves live moving networks and MTU is something we deal with regularly. Really was just a bad assumption on my part from switchport configurations that I didn't doublecheck on on the prod side stacks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!