NFS mount drops randomly - and won't reconnect

{c}guy_123 · Oct 16, 2023

I'm running a TrueNas Scale server on another machine that serves up nfs shares to my network. One Proxmox server on my network (v 8.0.3) just all of a sudden drops one nfs mount, and it won't reconnect. All other machine on my network can connect and use that exact same share without issue.

I'm not sure where to look for the problem? Re-booting the machine makes no difference.

I thought it might be a hardware issue, so I replaced the nic card, but the issue persists.

And it's Proxmox itself, ( something with the NFS system I'd guess ) becacuse VM's on that server can still connect to that TrueNas server. iperf3 works correctly.

Syslog shows:

Code:

pvestatd[3071]: unable to activate storage 'TrueNas-PVE' - directory '/mnt/pve/TrueNas-PVE' does not exist or is unreachable
Oct 15 08:54:31 pve pvestatd[3071]: unable to activate storage 'TrueNas-PVE' - directory '/mnt/pve/TrueNas-PVE' does not exist or is unreachable

What gives?

Thanks in advance.

bbgeek17 · Oct 16, 2023

What are the outputs of:
cat /etc/pve/storage.cfg
mount
showmount -e 'truenasip'
showmount -a 'truenasip'
rpcinfo 'truenasip'
Can you mount manually? ie : mkdir /tmp/test; mount truenasip:/mnt/test
If you are using hostname in storage.cfg, does the hostname resolve correctly?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

{c}guy_123 · Oct 16, 2023

bbgeek17 said:
What are the outputs of:
cat /etc/pve/storage.cfg
mount
showmount -e 'truenasip'
showmount -a 'truenasip'
rpcinfo 'truenasip'
Can you mount manually? ie : mkdir /tmp/test; mount truenasip:/mnt/test
If you are using hostname in storage.cfg, does the hostname resolve correctly?

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Code:

root@pve:~# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content iso,backup,vztmpl

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        sparse 1

dir: ProxmoxVMs
        path /mnt/pve/NvmeDisk1
        content snippets,images,iso,rootdir,vztmpl
        prune-backups keep-all=1
        shared 0

zfspool: pvetank
        pool pvetank
        content images,rootdir
        mountpoint /pvetank
        sparse 0

nfs: TrueNas-PVE
        export /mnt/Tank/NFSshare/PVE
        path /mnt/pve/TrueNas-PVE
        server 192.168.0.108
        content vztmpl,rootdir,backup,snippets,images,iso
        prune-backups keep-all=1

Code:

root@pve:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=32842540k,nr_inodes=8210635,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=6575368k,mode=755,inode64)
rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,noacl)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=23196)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/nvme0n1p1 on /mnt/pve/NvmeDisk1 type ext4 (rw,relatime)
rpool on /rpool type zfs (rw,relatime,xattr,noacl)
rpool/ROOT on /rpool/ROOT type zfs (rw,relatime,xattr,noacl)
rpool/data on /rpool/data type zfs (rw,relatime,xattr,noacl)
pvetank on /pvetank type zfs (rw,xattr,noacl)
pvetank/subvol-104-disk-0 on /pvetank/subvol-104-disk-0 type zfs (rw,xattr,posixacl)
pvetank/vzdump on /mnt/vztmp type zfs (rw,xattr,posixacl)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
192.168.0.108:/mnt/Tank/NFSshare/PVE on /mnt/pve/TrueNas-PVE type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.0.108,mountvers=3,mountport=35091,mountproto=udp,local_lock=none,addr=192.168.0.108)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6575364k,nr_inodes=1643841,mode=700,inode64)

Code:

root@pve:~# showmount -e 192.168.0.108
Export list for 192.168.0.108:
/mnt/HugeStorage/VMDisks   *
/mnt/HugeStorage/Music     *
/mnt/SmallerStorage/NFL    *
/mnt/Tank/4kMovies         *
/mnt/HugeStorage/Downloads *
/mnt/Shucked8TB/Movies     *
/mnt/Tank/TV_Shows         *
/mnt/Tank/NFSshare         *

Code:

root@pve:~# showmount -a 192.168.0.108
All mount points on 192.168.0.108:
192.168.0.125:/mnt/Shucked8TB/Movies
192.168.0.125:/mnt/Tank/4kMovies
192.168.0.12:/mnt/Tank/NFSshare
192.168.0.12:/mnt/Tank/NFSshare/PVE
192.168.0.27:/mnt/Tank/NFSshare/PVE
192.168.0.2:/mnt/HugeStorage/Downloads
192.168.0.2:/mnt/Shucked8TB/Movies
192.168.0.2:/mnt/Tank/TV_Shows
192.168.0.5:/mnt/HugeStorage/Music
192.168.0.5:/mnt/Tank/NFSshare

Code:

root@pve:~# rpcinfo 192.168.0.108
rpcinfo: can't contact rpcbind: : RPC: Timed out

and manually mounting the share using the command: mount -t nfs 192.168.0.108:/mnt/Tank/NFSshare/PVE /mnt/test/ timed out... and never completed.

Not sure what to do next... seems there's a problem with the RPC timing out?

bbgeek17 · Oct 16, 2023

Yes, that would be an issue because pvedaemon is using "rpcinfo" to query the health of the NFS server.
The cause can be anywhere in the path: bad cable, bad port, bad channel config, firewall, etc.
You may need to use tcpdump to see where packets are being dropped.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

{c}guy_123 · Oct 16, 2023

bbgeek17 said:
Yes, that would be an issue because pvedaemon is using "rpcinfo" to query the health of the NFS server.
The cause can be anywhere in the path: bad cable, bad port, bad channel config, firewall, etc.
You may need to use tcpdump to see where packets are being dropped.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Ok, but if VM's on that machine can connect and transfer data from that same server... it can't be a cable or switch issue, can it?
Is there anything else that could cause the rpc timing out ??
and thanks for your help so far! Really appreciate it.

bbgeek17 · Oct 16, 2023

{c}guy_123 said:
Ok, but if VM's on that machine can connect and transfer data from that same server... it can't be a cable or switch issue, can it?

I dont know what your network looks like. May be you have dedicated nic for VMs? May be they are on different subnet? May be there is an LACP channel and VMs hash to a different port than the host? May be there is a bad chip on the NIC that drops packets only for pve/nas combination? I've seen things... May be you need to reboot your NAS or restart NFS there.

My goal was to demonstrate that this could be caused by a wide range of underlying network issues or misconfigurations. Its up to you to eliminate things as you see fit.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

{c}guy_123 · Oct 16, 2023

bbgeek17 said:
I dont know what your network looks like. May be you have dedicated nic for VMs? May be they are on different subnet? May be there is an LACP channel and VMs hash to a different port than the host? May be there is a bad chip on the NIC that drops packets only for pve/nas combination? I've seen things... May be you need to reboot your NAS or restart NFS there.

My goal was to demonstrate that this could be caused by a wide range of underlying network issues or misconfigurations. Its up to you to eliminate things as you see fit.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Thanks but I'm trying to narrow down what might be the issue - using a logical process. It can't be cable/hardware related if the VM's assigned to that interface work, but not proxmox itself. Yes, I have a dedicated interface for the vm's and another one for the management interface. One is 10Gb the other just a regular gigabit nic.

It's got to be something to do with nfs client on proxmox I'm guessing, or a firewall issue... but I don't have firewall enabled and iptables -L shows accept policy across the board.

Here's an interesting thing... running iperf3 -c 192.168.0.108 (my truenas server) gives the following results:

Code:

Connecting to host 192.168.0.108, port 5201
[  5] local 192.168.0.27 port 50584 connected to 192.168.0.108 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   323 KBytes  2.65 Mbits/sec    3   8.74 KBytes
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   323 KBytes   265 Kbits/sec    6             sender
[  5]   0.00-10.04  sec  0.00 Bytes  0.00 bits/sec                  receiver

This does the exact same thing after repeated attempts.

But from one of my VM's on that machine we get this:

Code:

Connecting to host 192.168.0.108, port 5201
[  4] local 192.168.0.66 port 50196 connected to 192.168.0.108 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   273 MBytes  2.29 Gbits/sec
[  4]   1.00-2.00   sec   473 MBytes  3.97 Gbits/sec
[  4]   2.00-3.00   sec   501 MBytes  4.20 Gbits/sec
[  4]   3.00-4.00   sec   468 MBytes  3.92 Gbits/sec
[  4]   4.00-5.00   sec   486 MBytes  4.08 Gbits/sec
[  4]   5.00-6.00   sec   219 MBytes  1.84 Gbits/sec
[  4]   6.00-7.00   sec   378 MBytes  3.18 Gbits/sec
[  4]   7.00-8.00   sec   471 MBytes  3.94 Gbits/sec
[  4]   8.00-9.00   sec   419 MBytes  3.52 Gbits/sec
[  4]   9.00-10.00  sec   342 MBytes  2.87 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec  3.94 GBytes  3.38 Gbits/sec                  sender
[  4]   0.00-10.00  sec  3.94 GBytes  3.38 Gbits/sec                  receiver

So clearly the nic card itself (and therefor the cable, and switch ) are working.

Only NFS and data from proxmox itself that isn't .... this really has me stumped... Any additional thoughts?

bbgeek17 · Oct 16, 2023

You are getting 2.65 Mbits/sec on a gigabit interface with retries (lost packets) - that too me does not indicate that everything on that path is working...

Keep in mind that PVE is using Debian as an underlying OS. The NFS client is standard linux package used on millions of hosts at this moment. I'd buy a power ball ticket if you think you found an NFS client bug.

If your VMs sit on a 10G bridge and your PVE sits on 1G to the same network (multiple switches?), then they are not using the same paths...

good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

{c}guy_123 · Oct 16, 2023

bbgeek17 said:
You are getting 2.65 Mbits/sec on a gigabit interface with retries (lost packets) - that too me does not indicate that everything on that path is working...

Keep in mind that PVE is using Debian as an underlying OS. The NFS client is standard linux package used on millions of hosts at this moment. I'd buy a power ball ticket if you think you found an NFS client bug.

If your VMs sit on a 10G bridge and your PVE sits on 1G to the same network (multiple switches?), then they are not using the same paths...

good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

I didn't say I thought there was a bug in the NFS client... more likely there's a software issue on my server maybe causing a conflict somehow.

and I'm NOT getting 2.65 Mb/sec on a 10G interface... This is revealing the problem... no data is transferred from proxmox to my nas... It starts out ok, but then goes to zero.. like I posted below (with results from the NAS side)

and yes I have multiple switches, but they are all on the same network.

Do you have any idea why I might have the following from iperf3 on the proxmox server to my Nas?

Code:

Accepted connection from 192.168.0.27, port 58328
[  5] local 192.168.0.108 port 5201 connected to 192.168.0.27 port 58330
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  0.00 Bytes  0.00 bits/sec                  receiver

It makes a connection but no data is transferred... I also moved the cable from my 10G switch to my 1G switch just to eliminate that as a probable cause... but the exact same results happened.

and I've also replaced the 10G nic on both ends of that connection (my nas and my proxmox server in question) again, same results.

So strange.

bbgeek17 · Oct 16, 2023

No, I dont have a definitive answer for your. There are many variables in play, even more now than when you started the thread: multiple switches, inter-switch communications, multiple NICs, routing, etc. Is this a home or business environment? If latter, is there a network group who can assist?

If its the former, then start by process of elimination, stick to basic TCP/IP, ie iPerf and ICMP ping. Run direct cables between your servers for baseline. Check your MTU across the entire path.
Look for a guide on using tcpdump for troubleshooting network connectivity.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

{c}guy_123 · Oct 17, 2023

bbgeek17 said:
No, I dont have a definitive answer for your. There are many variables in play, even more now than when you started the thread: multiple switches, inter-switch communications, multiple NICs, routing, etc. Is this a home or business environment? If latter, is there a network group who can assist?

If its the former, then start by process of elimination, stick to basic TCP/IP, ie iPerf and ICMP ping. Run direct cables between your servers for baseline. Check your MTU across the entire path.
Look for a guide on using tcpdump for troubleshooting network connectivity.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Just wanted to say thank you for all your help. I did figure out the issue and it turned out to be an ip address/gateway being assigned to both nic's in my server, when only one should have been - since they are all on the same subnet.

bbgeek17 · Oct 18, 2023

{c}guy_123 said:
Just wanted to say thank you for all your help. I did figure out the issue and it turned out to be an ip address/gateway being assigned to both nic's in my server, when only one should have been - since they are all on the same subnet.

My pleasure. I should have asked for your network configuration output, we may have arrived to solution quicker.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

NFS mount drops randomly - and won't reconnect

{c}guy_123

Active Member

bbgeek17

Distinguished Member

{c}guy_123

Active Member

bbgeek17

Distinguished Member

{c}guy_123

Active Member

bbgeek17

Distinguished Member

{c}guy_123

Active Member

bbgeek17

Distinguished Member

{c}guy_123

Active Member

bbgeek17

Distinguished Member

{c}guy_123

Active Member

bbgeek17

Distinguished Member

We value your privacy