[SOLVED] NFS hangs after time

j0k4b0 · Mar 19, 2020

Hello,

since some days I have issues with my NFS backup shares.

Maybe this isn't a proxmox issue, but I only have that problem on my pve cluster so maybe someone has an idea.

Basicly I have two nfs servers and two proxmox clusters - the first in DC1 and another in DC2.

On the first datacenter (DC1), I have the issue with my NFS shares, the second datacenter works perfectly with the same NFS shares.
This bring me to the end, that the NFS server is working correctly and the issue must be on client side.

On DC1 I have 3 active Proxmox Nodes and on all nodes I have added the NFS shares and on all nodes the probles exists.

The Problem:
When I add a NFS storage, proxmox creates the mount successfull. With the UI and SSH I can access the files and directories inside the share.
After 1 or 2 minutes, the share seems to be hanging. I can't use df -h or ls -la or the proxmox UI anymore. I only get timeouts with the following log messages:

Code:

Mar 19 08:35:56 fra1-pvec01-m03 pvestatd[2409]: unable to activate storage 'fra1-nfs1-pvec01' - directory '/mnt/pve/fra1-nfs1-pvec01' does not exist or is unreachable

The problem persists after reboot, delete all storages and recreate them, reboot NFS server.

After I enabled debugging for the NFS Client (rpcdebug -m nfs -s all) I got the followoing errors:

Code:

Mar 19 08:35:56 fra1-pvec01-m03 kernel: [30713.680764] NFS reply getattr: -512
Mar 19 08:35:56 fra1-pvec01-m03 kernel: [30713.680767] nfs_revalidate_inode: (0:72/164888578) getattr failed, error=-512
….
Mar 19 08:36:04 fra1-pvec01-m03 kernel: [30721.866531] NFS: nfs_weak_revalidate: inode 164888578 is valid
Mar 19 08:36:04 fra1-pvec01-m03 kernel: [30721.866536] NFS: revalidating (0:72/164888578)
Mar 19 08:36:04 fra1-pvec01-m03 kernel: [30721.866538] NFS call  getattr

This is the current mount:

Code:

/mnt/pve/fra1-nfs1-pvec01 from 10.0.0.60:/backups/proxmox/pvec01
Flags: rw,sync,noexec,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.60,mountvers=3,mountport=20048,mountproto=udp,local_lock=all,addr=10.0.0.60

Any idea? Or any idea how to debug that futher?

Thanks!

tim · Mar 19, 2020

Try to mount it with the soft option. I guess it's the connection between DC1 and your NFS shares, but can't say much about it without knowing the topology and network utilization.

j0k4b0 · Mar 19, 2020

Hello,
I tried with soft - no luck.

Network can be an issue - but basicly it's working. Showmount for example is working the whole time. And the other DC can access the nfs the whole time without issues.

Here a little picture from my networking:

tim · Mar 19, 2020

Doesn't look very complicated. The soft option should at least prevent that the request get stuck indefinitely, but yeah NFS is sometimes really a pain. Are the servers all the same (kernel version)? I guess this is some sort oft incompatibility maybe.

j0k4b0 · Mar 19, 2020

Hi,
it's not complicated - I think.

NFS Server:

Code:

Linux fra1-nfsbackup1.infra.<name>.intern 4.4.216-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 09:13:43 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

PVE Nodes:

Code:

Linux fra1-pvec01-m03 5.3.10-1-pve #1 SMP PVE 5.3.10-1 (Thu, 14 Nov 2019 10:43:13 +0100) x86_64 GNU/Linux

The main confusing part is: the other DC (same kernel) is working fine and 4 vms inside the DC1 cluster also can work with the nfs. No issues.

Showmount, Ping etc. are working without any issues or delay.

Some more log infos found:

Code:

Mar 19 16:42:01 fra1-nfsbackup1.infra.<name>.intern kernel: NFSD: laundromat service - starting
Mar 19 16:43:12 fra1-nfsbackup1.infra.<name>.intern kernel: NFSD: laundromat_main - sleeping for 56 seconds

Any more debug ideas?

tim · Mar 19, 2020

Next step would be tcpdump/wireshark to see whats going on.

j0k4b0 · Mar 20, 2020

Hi,
solved by disable the bond interface on NFS Server.

Have no idea why this is the problem, because other systems had no issues.

quanto11 · Nov 19, 2022

Hello together,

I had similar problems as described in this post and was able to find the error after a full two days.

I had a Truenas running with link aggregation and additionally set an MTU of 9000 on the interface. After I set the MTU to default (1500) all NFS problems disappeared. Strangely enough, a test environment with the same hardware, had no problems with the MTU of 9000.

Just for info, in case someone has similar problems.

UdoB · Nov 20, 2022

Probably that is a solution of a different problem.

I realize the original post is two years old, nevertheless: the diagram from j0k4b0 shows clients and server in different networks. Under some conditions this will work partially - and that's what he experienced. (ARP will do its best to allow communication by pulling down IP-queries down to ethernet-protocol level.) All devices are connected to a single switch, probably without segmentation via VLANs. The missing part is a router connecting those networks on IP-level.

Code:

~$ ipcalc 10.0.0.128/26
Address:   10.0.0.128           00001010.00000000.00000000.10 000000
Netmask:   255.255.255.192 = 26 11111111.11111111.11111111.11 000000
Wildcard:  0.0.0.63             00000000.00000000.00000000.00 111111
=>
Network:   10.0.0.128/26        00001010.00000000.00000000.10 000000
HostMin:   10.0.0.129           00001010.00000000.00000000.10 000001
HostMax:   10.0.0.190           00001010.00000000.00000000.10 111110
Broadcast: 10.0.0.191           00001010.00000000.00000000.10 111111
Hosts/Net: 62                    Class A, Private Internet

~$ ipcalc 10.0.0.60/26
Address:   10.0.0.60            00001010.00000000.00000000.00 111100
Netmask:   255.255.255.192 = 26 11111111.11111111.11111111.11 000000
Wildcard:  0.0.0.63             00000000.00000000.00000000.00 111111
=>
Network:   10.0.0.0/26          00001010.00000000.00000000.00 000000
HostMin:   10.0.0.1             00001010.00000000.00000000.00 000001
HostMax:   10.0.0.62            00001010.00000000.00000000.00 111110
Broadcast: 10.0.0.63            00001010.00000000.00000000.00 111111
Hosts/Net: 62                    Class A, Private Internet

Search

Search

[SOLVED] NFS hangs after time

j0k4b0

Active Member

tim

Proxmox Staff Member

j0k4b0

Active Member

tim

Proxmox Staff Member

j0k4b0

Active Member

tim

Proxmox Staff Member

j0k4b0

Active Member

quanto11

Member

UdoB

Distinguished Member

We value your privacy

[SOLVED] NFS hangs after time

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Member

Distinguished Member

​

We value your privacy