[SOLVED] NFS not online after NFS server reboot

freakits_jino · Sep 14, 2023

Hello,

So we are currently facing one issue with one nfs mount which was mounted on our cluster. We only used it for VM backups.

We started facing this issue after our NFS server(debian 11) was rebooted.

pvestatd is flooding the log with:
Sep 14 23:06:00 node01 pvestatd[3331]: storage 'nfsbackup' is not online
Sep 14 23:06:00 node01 pvestatd[3331]: status update time (10.175 seconds)
Sep 14 23:06:10 node01 pvestatd[3331]: storage 'nfsbackup' is not online
Sep 14 23:06:10 node01 pvestatd[3331]: status update time (10.176 seconds)
Sep 14 23:06:21 node01 pvestatd[3331]: storage 'nfsbackup' is not online
Sep 14 23:06:21 node01 pvestatd[3331]: status update time (10.177 seconds)

1. STORAGE.CFG file
root@node01:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content vztmpl,iso,backup

zfspool: local-zfs
pool rpool/data
content rootdir,images
sparse 1

rbd: ceph-NVMe
content rootdir,images
krbd 0
pool ceph-NVMe

nfs: nfsbackup
export /opt/DISK2/CruxCluster2_VM_BACKUP
path /mnt/pve/nfsbackup
server 192.168.13.16
content iso,backup
prune-backups keep-last=3

2. PVE VERSION
#pveversion
pve-manager/7.3-3/c3928077 (running kernel: 5.15.74-1-pve

Kindly help in how to de resolve this with the minimum downtime as this is a production cluster.

bbgeek17 · Sep 14, 2023

The first thing to do is to check that NFS services are running properly and you have network connectivity between the servers. There is no PVE specific troubleshooting steps that you can take. Basic network/NFS connectivity for Linux servers is your target.

Things to do:
- ping between hosts, if MTU is not standard - check large ping size and MTU consistency
- showmount -e NFS_IP
- showmount -a NFS_IP
- rpcinfo NFS_IP
- manual NFS mount
etc

good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

freakits_jino · Sep 15, 2023

So, I am able to reach the NFS server from all my cluster nodes. Ping, telnet works. showmount gets a timeout.
I am able to mount on all my nodes manually, but the logs still are flooded with storage 'nfsbackup' is not online. Kindly refer outputs below as requested.

root@node01:~# ping -c 4 192.168.13.16
PING 192.168.13.16 (192.168.13.16) 56(84) bytes of data.
64 bytes from 192.168.13.16: icmp_seq=1 ttl=64 time=0.158 ms
64 bytes from 192.168.13.16: icmp_seq=2 ttl=64 time=0.213 ms
64 bytes from 192.168.13.16: icmp_seq=3 ttl=64 time=0.259 ms
64 bytes from 192.168.13.16: icmp_seq=4 ttl=64 time=0.193 ms

--- 192.168.13.16 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3053ms
rtt min/avg/max/mdev = 0.158/0.205/0.259/0.036 ms

root@node01:~# ping -c 4 -s 1500 192.168.13.16
PING 192.168.13.16 (192.168.13.16) 1500(1528) bytes of data.
1508 bytes from 192.168.13.16: icmp_seq=1 ttl=64 time=0.306 ms
1508 bytes from 192.168.13.16: icmp_seq=2 ttl=64 time=0.324 ms
1508 bytes from 192.168.13.16: icmp_seq=3 ttl=64 time=0.282 ms
1508 bytes from 192.168.13.16: icmp_seq=4 ttl=64 time=0.297 ms

--- 192.168.13.16 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3074ms
rtt min/avg/max/mdev = 0.282/0.302/0.324/0.015 ms
----------------------------------------------------------------
root@node01:~# telnet 192.168.13.16 2049
Trying 192.168.13.16...
Connected to 192.168.13.16.
Escape character is '^]'.
^CConnection closed by foreign host.
----------------------------------------------------------------
root@node01:~# showmount -e 192.168.13.16
rpc mount export: RPC: Timed out

real 2m30.606s
user 0m0.000s
sys 0m0.003s
----------------------------------------------------------------
root@node01:~# showmount -a 192.168.13.16
rpc mount dump: RPC: Timed out

real 2m29.517s
user 0m0.003s
sys 0m0.000s
----------------------------------------------------------------
root@node01:/etc/pve# pvesm nfsscan 192.168.13.16
rpc mount export: RPC: Timed out
command '/sbin/showmount --no-headers --exports 192.168.13.16' failed: exit code 1
----------------------------------------------------------------
Output of Mount command after I mounted the nfsshare manually.

bbgeek17 · Sep 15, 2023

showmount is what PVE is using to probe the health. If that times out you will continue having problems. You need to resolve it. Run showmount locally on the NFS server against itself, if that doesnt work - either some services are not running/failed, or a firewall rule is blocking them.
If it works locally - then you have a device on the network blocking the traffic.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

freakits_jino · Sep 15, 2023

After running showmount locally on the NFS server:

jignesh@bkp2:~$ time sudo showmount -e localhost
Export list for localhost:
/opt/DISK1/backuphdd/OTHERS/elasticsearch-backup 192.168.13.102
/opt/DISK2/XXXXCluster2_VM_BACKUP 192.168.13.33,192.168.13.32,192.168.13.31

real 0m0.016s
user 0m0.010s
sys 0m0.001s

Output remains the same even if I use the IP of the server.
On the other server 192.168.13.102, the mount is working fine. it is a plain Debian server.

bbgeek17 · Sep 15, 2023

do you use non-standard MTU anywhere along the path?
We know that the mount is working on PVE as well. Does "showmount" work on the other server?

Default vanilla PVE installation does not block outgoing NFS traffic. It worked, according to you, until recent NFS server reboot. Logically, the loss of ability to query RPC information (showmount) could not have been induced by PVE which didnt change, according to you.

As a next step you may want to use network sniffer to determine where the break down occurs.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

freakits_jino · Sep 18, 2023

I just resolved this issue, and it turned out to be a very simple fix.

I ran a packet sniff on the NFS server, where rpcinfo was successfully captured. However, showmount, which uses the same RPC protocol, did not work, and the packet sniffer did not capture any packets.

I decided to check rpcinfo and showmount from another plain Unix server. rpcinfo worked, but showmount failed after a timeout. I tried the same with another NFS server and client on the same network, and both rpcinfo and showmount worked. This led me to start investigating further because we had ruled out any physical firewall-related issues.

As it turns out, the problem was something local on the NFS server. After examining the output of rpcinfo and using the ss/netstat command, I concluded that, after a reboot, the listening ports for the Local Address : port had changed.

I simply allowed those ports in UFW, and everything went back to normal on my Proxmox cluster.

Thanks a ton, @bbgeek17 for the support during this troubleshooting process.

bbgeek17 · Sep 18, 2023

freakits_jino said:
I simply allowed those ports in UFW, and everything went back to normal on my Proxmox cluster.

Keep in mind that these are dynamically allocated ports. They will change again on next reboot or service restart.

You may want to look into resources similar to this one: https://access.redhat.com/documenta...ministration_guide/s2-nfs-nfs-firewall-config

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Search

Search

[SOLVED] NFS not online after NFS server reboot

freakits_jino

Member

bbgeek17

Distinguished Member

freakits_jino

Member

bbgeek17

Distinguished Member

freakits_jino

Member

bbgeek17

Distinguished Member

freakits_jino

Member

bbgeek17

Distinguished Member