NFS storage not online?

Cameron L

Member
Apr 20, 2018
9
1
8
52
I have a new proxmox cluster undergoing testing.

# pveversion
pve-manager/5.1-51/96be5354 (running kernel: 4.13.16-2-pve)

Syslog reports that NFS storage is not online each minute. Not always the same storage -- I have 3 exports and 2 nfs servers (Ubuntu 16.04). Sometimes it'll report one storage offline, sometime two. I haven't noticed all three the same minute. But usually it complains about at least one each minute.

The nodes and storage are in a dedicated storage vlan on a 10GbT switch that also handles other traffic.

pvestatd[56318]: storage 'nfs01' is not online

Despite this, I'm not seeing any other symptoms. The storage is accessible and seems performant. It seems some process is just checking over-zealously and reporting errors where it doesn't exist.

rpcinfo -p <ip of nfs server> shows expected output:
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 4002 mountd
100005 1 tcp 4002 mountd
100005 2 udp 4002 mountd
100005 2 tcp 4002 mountd
100005 3 udp 4002 mountd
100005 3 tcp 4002 mountd
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100227 2 tcp 2049
100227 3 tcp 2049
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100227 2 udp 2049
100227 3 udp 2049
100021 1 udp 47296 nlockmgr
100021 3 udp 47296 nlockmgr
100021 4 udp 47296 nlockmgr
100021 1 tcp 31439 nlockmgr
100021 3 tcp 31439 nlockmgr
100021 4 tcp 31439 nlockmgr

pvesm consistently provides the expected output:
nfs01:101/vm-101-disk-1.qcow2 qcow2 34359738368 5101
nfs01:101/vm-101-state-ONE.raw raw 1299887104 5101
nfs01:iso/bionic-desktop-amd64-20180417.iso iso 1871708160

Any clue what's causing the syslog entries?
 
We run:

# /sbin/showmount --no-headers --exports <server>

with a timeout of 2 seconds. The warning is logged if that fails.
 
Still chasing this some, as time permits. Here's what I know so far.

NFS servers are Ubuntu 16.04. I've disabled the firewall between the NFS servers and the proxmox nodes (all ports allowed between those IPs).

I set up a while loop to run '/sbin/showmount --no-headers --exports <NFS server IP>' every 2 seconds and watched it. The command *always* returns valid information. Usually it will return quickly (under 1 second). Sometimes it takes just over 15 seconds. I don't understand why this is. [Assuming that I should "always blame DNS", I've ensured all interfaces, public and private, are in both DNS and /etc/hosts. The delay persists.]

NFS performance is fine for normal operations (300MB/s - 500MB/s, depending on operation -- probably limited by RAID and/or disks) without any obvious problems.

For now, I've changed the timeout near the end of /usr/share/perl5/PVE/Storage/NFSPlugin.pm from 2 seconds to 20 seconds, which masks the symptom, but doesn't solve the underlying problem.

My older cluster (proxmox 4, with Ubuntu 14.04 NFS servers) was set up using the same procedures and doesn't exhibit this problem.

There's obviously something different in the NFS server for Ubuntu 16.04 or the NFS client for Debian 9.x that is causing this delay.

Mainly following up to benefit anyone else who sees a similar problem, but would welcome any other troubleshooting thoughts.

At this point, I'm planning to try to replicate the setup with an Ubuntu 14.04 NFS server to see if I still see the timeout.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!