NFS Problems

smitty4x4

Member
Dec 2, 2021
7
0
6
52
Hey all,

I am trying to get ProxMox set up in order to migrate off of Vmware. But I'm running into issues with our NFS mounts.

Currently we have 2 NFS mounts, FreeNAS and NetApp....The FreeNAS mounts just fine. The NetApp mounts, but gives constant hangs and timeouts when trying to access. We are not seeing this on the Vmware side...Any help would be appreciated. The storage is on an isolated 10G network.

Nov 30 12:27:18 ntp-clo1-proxmox-01 kernel: [11883.267099] nfs: server 10.10.0.55 not responding, still trying
Nov 30 12:27:18 ntp-clo1-proxmox-01 kernel: [11883.267101] nfs: server 10.10.0.55 not responding, still trying
Nov 30 12:58:00 ntp-clo1-proxmox-01 kernel: [13725.003257] nfs: RPC call returned error 512
Nov 30 12:58:00 ntp-clo1-proxmox-01 kernel: [13725.003257] nfs: RPC call returned error 512
Nov 30 12:58:00 ntp-clo1-proxmox-01 kernel: [13725.003276] nfs: RPC call returned error 512
Nov 30 14:20:23 ntp-clo1-proxmox-01 kernel: [18668.248337] nfs: server 10.10.0.55 not responding, still trying
Nov 30 14:20:55 ntp-clo1-proxmox-01 kernel: [18700.015082] nfs: server 10.10.0.55 OK

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.0.52 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::eef4:bbff:fed6:acb0 prefixlen 64 scopeid 0x20<link>
ether ec:f4:bb:d6:ac:b0 txqueuelen 1000 (Ethernet)
RX packets 40711809 bytes 59097891957 (55.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 20307064 bytes 28209187404 (26.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

root@ntp-clo1-proxmox-01:~# nfsstat -rc
Client rpc stats:
calls retrans authrefrsh
717806 100 717808

showmount -e 10.10.0.55
Export list for 10.10.0.55:
/ (everyone)
 
Which PVE version are you running? Please provide the complete output of pveversion -v.
 
proxmox-ve: 7.1-1 (running kernel: 5.13.19-1-pve) pve-manager: 7.1-4 (running version: 7.1-4/ca457116) pve-kernel-5.13: 7.1-4 pve-kernel-helper: 7.1-4 pve-kernel-5.13.19-1-pve: 5.13.19-2 ceph-fuse: 15.2.15-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.0 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-1 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.0-14 libpve-guest-common-perl: 4.0-3 libpve-http-server-perl: 4.0-3 libpve-storage-perl: 7.0-15 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.9-4 lxcfs: 4.0.8-pve2 novnc-pve: 1.2.0-3 proxmox-backup-client: 2.0.14-1 proxmox-backup-file-restore: 2.0.14-1 proxmox-mini-journalreader: 1.2-1 proxmox-widget-toolkit: 3.4-2 pve-cluster: 7.1-2 pve-container: 4.1-2 pve-docs: 7.1-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.3-3 pve-ha-manager: 3.3-1 pve-i18n: 2.6-1 pve-qemu-kvm: 6.1.0-2 pve-xtermjs: 4.12.0-1 qemu-server: 7.1-3 smartmontools: 7.2-1 spiceterm: 3.2-2 swtpm: 0.7.0~rc1+2 vncterm: 1.7-1 zfsutils-linux: 2.1.1-pve3
 
Do you specify the NFS version when mounting the storage?
If you've configured it via the GUI you can select the version in the `Advanced` options.
 
Do you specify the NFS version when mounting the storage?
If you've configured it via the GUI you can select the version in the `Advanced` options.
I have tried forcing NFS v3 from advanced and still no joy. It mounts fine, just when I go to `ls` inside the mounted directory, it hangs.
 
Could you provide your network config? (cat /etc/network/interfaces)
Is the firewall active?
 
Firewall is not running....eno1 is the storage nic. I haven't set up redundant links as of yet until I can get this sorted. MTU is 9000 across the switch and everything connected.

Code:
auto eno1
iface eno1 inet static
    address 10.10.0.52/24
    mtu 9000
#Storage

iface eno2 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.7.0.23/24
    gateway 10.7.0.1
    bridge-ports eno4
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094

Code:
pve-firewall status
Status: disabled/running
 
Is there anything in the logs on the NFS server?
You could try running tcpdump -envi eno1 -w client.pcap on the host and the same, with the right interface instead of `eno1` on the server. Then mount the storage and run some commands.
 
Nothing jumps out on the NetApp side....I see the following on the proxmox side :

[520550.054265] nfs: server 10.10.0.55 not responding, still trying
[520599.186645] nfs: server 10.10.0.55 OK
[520979.470599] nfs: RPC call returned error 512
[520979.470606] nfs: RPC call returned error 512
[521075.011006] nfs: RPC call returned error 512


I did a TCP dump and I see communication and the likes, but this jumped out :
V3 Access Reply [Access Denied: XAR XAW XAL], [Allowed: RD LU MD XT DL]

What doesn't make sense is this host WAS a VMware host before I put Proxmox on it. It had no trouble accessing the NFS storage on the NetApp.
Same IP, same network port configurations, etc...Simply rebuilt as proxmox.

If you have any other ideas that would be great.
 
Another odd thing....I rebooted and then issued :

Code:
pvesm status

Name              Type     Status           Total            Used       Available        %
ColoDS01           nfs     active     22548578304     21256283072      1292295232   94.27%
FREEnasDS2         nfs     active     49301499392     10922451200     38379048192   22.15%
local              dir     active        98497780         3074756        90373476    3.12%
local-lvm      lvmthin     active      1793060864               0      1793060864    0.00%

Code:
ls /mnt/pve/ColoDS01/template
cache  iso

Then it hung when I did `ls /mnt/pve/ColoDS01/`

Code:
pvesm status
got timeout
unable to activate storage 'ColoDS01' - directory '/mnt/pve/ColoDS01' does not exist or is unreachable
Name              Type     Status           Total            Used       Available        %
ColoDS01           nfs   inactive               0               0               0    0.00%
FREEnasDS2         nfs     active     49301498240     10923109888     38378388352   22.16%
local              dir     active        98497780         3074772        90373460    3.12%
local-lvm      lvmthin     active      1793060864               0      1793060864    0.00%
 
Does it work if you mount it manually via the mount command?
 
Same result if I mount if manually....After looking at the storage, I believe this may be a bug on the NetApp side. Well a bug and the fact that ProxMox is using a newer kernel. Confirmed we are still on Ontap 9.1....Looking to schedule an upgrade and will report if this fixes things.
 
Not sure if this is a Proxmox or TrueNAS/Ganesha issue but I just switched over my TrueNAS Core server to TrueNAS Scale (22.02-RC.1-2, Kernel 5.10.70+truenas) for testing and run into a similar issue.
I mount a NFS share (nfsv4 with kerberos) in Proxmox and my VM backup job targets this storage. The share is mounted successfully and I can see content with "ls" and the Proxmox Web GUI. Copying a file from the nfs share to local storage works fine. But my backups stop after "INFO: transferred XX GiB in xx seconds" and before getting to the "INFO: archive file size: XX GiB" line.
The zst archive is never created but a .dat file is present on the share (slightly larger than the local zst backup). In this state trying to "ls" the share does not work either (it hangs as well). Neither does listing the contents in the Proxmox Web GUI work.
Some log lines:
Code:
kernel: rpc_check_timeout: 35 callbacks suppressed
nfs: server truenas01.ipa.mydomain.com not responding, still trying
nfs: RPC call returned error 512
#  this one could have been when I force unmounted the share:
pvedaemon[2189736]: Warning: unable to close filehandle GEN21432 properly: Bad file descriptor at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 764.
It also hangs if I create a backup on local storage and then try to rsync/move it to the NFS share manually (seems stuck after transfer when doing validation as the file has the same size as the source but is shown with a temp filename i.e. ".<filename>.XXXXX ).
When I was still on TrueNAS Core this ran fine so it does not seem to be a network or hardware issue (compatibility maybe?). Also I can push and pull files to/from the same NFS share using the same mount options without an issue from a current Arch install (5.15.7) but it also hangs on a current CentOS 8 Stream install (4.18.0-348.2.1.el8_5.x86_64 - actually a VM on the Proxmox host).
I had two backups finish while trying various things but could not replicate and usually the share just hangs. I tried mounting with nfsv4, 4.1 and 4.2. Tried adding soft mount option, but no change.
It is as if the information that the transfer is finished never reaches the client (my uneducated guess).
Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-4.2.6-1-pve: 4.2.6-36
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

There have been many kernel commits regarding NFS since 5.13.x so maybe that is why it works with 5.15.7.
edit:
Just did another test with the kernel 5.15.7 client and it got stuck as well when uploading.
First I downloaded a ~60GB file from the share which worked fine. I then reuploaded the same file to the share (different path/name) and when the transfer was done (size local = size on share) it got stuck. Client log shows:
Code:
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: INFO: task rsync:7398 blocked for more than 122 seconds.
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:       Tainted: G           OE     5.15.7 #1
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: task:rsync           state:D stack:    0 pid: 7398 ppid:  7397 flags:0x00000000
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: Call Trace:
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  <TASK>
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  __schedule+0x30f/0x930
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  schedule+0x59/0xc0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  io_schedule+0x42/0x70
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  wait_on_page_bit_common+0x10e/0x390
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  ? filemap_invalidate_unlock_two+0x40/0x40
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  wait_on_page_writeback+0x22/0x80
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  __filemap_fdatawait_range+0x8b/0xf0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  filemap_write_and_wait_range+0x85/0xf0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  nfs_wb_all+0x22/0x120 [nfs]
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  nfs4_file_flush+0x6b/0xa0 [nfsv4]
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  filp_close+0x2f/0x70
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  __x64_sys_close+0xd/0x40
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  do_syscall_64+0x38/0x90
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RIP: 0033:0x7f2d91212fe7
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RSP: 002b:00007ffe4b97c888 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RAX: ffffffffffffffda RBX: 00007f2d90d76fe8 RCX: 00007f2d91212fe7
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: R10: 0000000000000080 R11: 0000000000000246 R12: 00007ffe4b97c9a0
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel: R13: 00000000ffffffff R14: 00007ffe4b97d9a0 R15: 0000000000000000
Dec 12 17:02:47 arch01.ipa.mydomain.com kernel:  </TASK>
 
Last edited:
quick update: it appears to be caused by using sec=krb5p for the nfs share. krb5 and krb5i work fine. Possibly a ganesha issue.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!