pvestatd - NFS mount "storage is not online"

Giovanni · Nov 23, 2022

I'm trying to debug a condition where it seems that using NFS v4.2 there is a brief period of connection timeout between proxmox and my NFS server.

Looks like `pvestatd` monitors storage mounts and prints some useful messages - can these be made more verbose?

Code:

Nov 23 15:43:16 centrix pvestatd[2100]: got timeout
Nov 23 15:43:38 centrix pvedaemon[1697453]: <root@pam> successful auth for user 'root@pam'
Nov 23 15:43:54 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:43:54 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:04 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:04 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:14 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:14 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:24 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:24 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:34 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:34 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:44 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:44 centrix pvestatd[2100]: status update time (10.039 seconds)
Nov 23 15:44:54 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:54 centrix pvestatd[2100]: status update time (10.048 seconds)
Nov 23 15:45:25 centrix pvestatd[2100]: storage 'unraid' is not online

I am not exactly sure what the default "timeout settings" are for NFS mounts setup via proxmox and if these are configurable?

Code:

nas.gfm:/mnt/cached on /mnt/pve/unraid type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.254,local_lock=none,addr=192.168.1.54)

600 seconds seems to be number. Are NFS timeouts common/expected? so far my logs show the timeout but it looks like its recovering within timeout period... I do want to fix this if its not normal. My NFS server isn't hosting anything else but this proxmox node.

bbgeek17 · Nov 23, 2022

In a stable network/infrastructure NFS should not experience timeouts at all.

"man 5 nfs":

Code:

 The time in deciseconds (tenths of a second) the NFS client waits for a response before it retries an NFS request.

                      For NFS over TCP the default timeo value is 600 (60 seconds).  The NFS client performs linear backoff: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds.

That said, the timeo value is completely unrelated to the healthcheck failure you are experiencing. I am going to copy my reply to similar issue from here:
https://forum.proxmox.com/threads/delay-of-1-node-for-reconnection-on-nfs.107981/#post-464088

Code:

The error you are getting is based on "check_connection()" response from NFS Plugin.

NFS Plugin does:
Code:
sub check_connection {
    my ($class, $storeid, $scfg) = @_;

    my $server = $scfg->{server};
    my $opts = $scfg->{options};

    my $cmd;
    if (defined($opts) && $opts =~ /vers=4.*/) {
        my $ip = PVE::JSONSchema::pve_verify_ip($server, 1);
        if (!defined($ip)) {
            $ip = PVE::Network::get_ip_from_hostname($server);
        }

        my $transport = PVE::JSONSchema::pve_verify_ipv4($ip, 1) ? 'tcp' : 'tcp6';

        # nfsv4 uses a pseudo-filesystem always beginning with /
        # no exports are listed
        $cmd = ['/usr/sbin/rpcinfo', '-T', $transport, $ip, 'nfs', '4'];
    } else {
        $cmd = ['/sbin/showmount', '--no-headers', '--exports', $server];
    }

    eval { run_command($cmd, timeout => 10, outfunc => sub {}, errfunc => sub {}) };
    if (my $err = $@) {
        return 0;
    }

    return 1;
}

Find out what happens to rpcinfo and showmount. These are not PVE specific, its generic TCP/NFS layer of Linux. If you have 3 nodes and NAS, and only one node has an issue, the most logical explanation is that the problem is with that node.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Giovanni · Nov 24, 2022

thanks for the pointer. I exclusively use NFS version 4.2 - taking a hint from the script you shared I wrote an endless loop script.

Code:

#!/bin/bash
# Basic script to run on an endless loop on a NFS client
# idea is to catch failure in rpcinfo and log timestamp.

counter=1
while :
do
    echo "Run " $counter + $(date)
    /usr/sbin/rpcinfo -T tcp 192.168.1.54 nfs 4;
    if [ $? -eq 1 ]
    then
        echo "NFS server unresponsive."
        break
    fi
    ((counter++))
    sleep 1
done
echo "NFS monitor script finished. "

Will run this on another session and look to see if it fails. If you have any other suggestions let me know.

fiona · Nov 24, 2022

Hi,
please note that the rpcinfo command is run with a timeout of 10 seconds, and since pvestatd reports a status update time of slightly more than 10 seconds, I'd suspect that you are hitting that timeout (rather than the command really failing).

Giovanni · Nov 24, 2022

fiona said:
Hi,
please note that the rpcinfo command is run with a timeout of 10 seconds, and since pvestatd reports a status update time of slightly more than 10 seconds, I'd suspect that you are hitting that timeout (rather than the command really failing).

Just to make sure I understand, there is a different RPC made by pvestatd that is waiting for response that is timing out after 10 seconds causing the `pvestatd` message in syslog?

OK my script returned an instance timestamp for me to try to correlate.

Code:

Run  234 + Thu 24 Nov 2022 03:33:03 AM EST
rpcinfo: RPC: Timed out
program 100003 version 4 is not available
NFS server unresponsive.
NFS monitor script finished.
root@centrix:~# ./nfs.sh
Run  1 + Thu 24 Nov 2022 03:33:55 AM EST
rpcinfo: RPC: Timed out
program 100003 version 4 is not available
NFS server unresponsive.
NFS monitor script finished.

On my NFS server side syslog I see

Code:

Nov 24 03:30:44 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:30:47 nas systemd[1]: check-mk-agent@51-1068-996.service: Deactivated successfully.
Nov 24 03:31:47 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:31:49 nas systemd[1]: check-mk-agent@52-1068-996.service: Deactivated successfully.
Nov 24 03:32:49 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:32:51 nas systemd[1]: check-mk-agent@53-1068-996.service: Deactivated successfully.
Nov 24 03:33:52 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:33:54 nas systemd[1]: check-mk-agent@54-1068-996.service: Deactivated successfully.
Nov 24 03:34:15 nas pcp-pmie[2263]: High 1-minute load average 12.4load@nas
Nov 24 03:34:32 nas rpc.mountd[1162]: v4.2 client detached: 0x1f64f3b0637f1247 from "192.168.1.254:828"

Proxmox node it says

Code:

Nov 24 03:30:27 centrix pveproxy[2147]: worker 753346 started
Nov 24 03:32:09 centrix pvestatd[3110168]: status update time (8.123 seconds)
Nov 24 03:33:04 centrix pvestatd[3110168]: got timeout
Nov 24 03:33:20 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:20 centrix pvestatd[3110168]: status update time (10.045 seconds)
Nov 24 03:33:30 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:30 centrix pvestatd[3110168]: status update time (10.042 seconds)
Nov 24 03:33:40 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:40 centrix pvestatd[3110168]: status update time (10.042 seconds)
Nov 24 03:33:50 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:50 centrix pvestatd[3110168]: status update time (10.041 seconds)
Nov 24 03:34:00 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:34:00 centrix pvestatd[3110168]: status update time (10.043 seconds)

Recovery (on NFS server) is to restart mountd - otherwise proxmox host won't be able to re-establish NFS. Weird thing is no kernel crash in logs in the server side etc.

Code:

root@nas:/opt/snapraid-runner# /usr/bin/rpcinfo -u bee mountd
bee: RPC: Unknown host
root@nas:/opt/snapraid-runner# systemctl status nfs-mountd.service
● nfs-mountd.service - NFS Mount Daemon
     Loaded: loaded (/lib/systemd/system/nfs-mountd.service; static)
     Active: active (running) since Thu 2022-11-24 03:39:29 EST; 1min 55s ago
    Process: 82985 ExecStart=/usr/sbin/rpc.mountd (code=exited, status=0/SUCCESS)
   Main PID: 82988 (rpc.mountd)
      Tasks: 1 (limit: 9362)
     Memory: 576.0K
        CPU: 2ms
     CGroup: /system.slice/nfs-mountd.service
             └─82988 /usr/sbin/rpc.mountd

Nov 24 03:39:29 nas systemd[1]: Starting NFS Mount Daemon...
Nov 24 03:39:29 nas rpc.mountd[82988]: Version 2.6.1 starting
Nov 24 03:39:29 nas systemd[1]: Started NFS Mount Daemon.
root@nas:/opt/snapraid-runner# systemctl restart nfs-mountd.service
root@nas:/opt/snapraid-runner# systemctl status nfs-mountd.service
● nfs-mountd.service - NFS Mount Daemon
     Loaded: loaded (/lib/systemd/system/nfs-mountd.service; static)
     Active: active (running) since Thu 2022-11-24 03:41:30 EST; 1s ago
    Process: 84102 ExecStart=/usr/sbin/rpc.mountd (code=exited, status=0/SUCCESS)
   Main PID: 84106 (rpc.mountd)
      Tasks: 1 (limit: 9362)
     Memory: 780.0K
        CPU: 4ms
     CGroup: /system.slice/nfs-mountd.service
             └─84106 /usr/sbin/rpc.mountd

Nov 24 03:41:30 nas systemd[1]: nfs-mountd.service: Deactivated successfully.
Nov 24 03:41:30 nas systemd[1]: Stopped NFS Mount Daemon.
Nov 24 03:41:30 nas systemd[1]: Starting NFS Mount Daemon...
Nov 24 03:41:30 nas rpc.mountd[84106]: Version 2.6.1 starting
Nov 24 03:41:30 nas systemd[1]: Started NFS Mount Daemon.
Nov 24 03:41:31 nas rpc.mountd[84106]: v4.2 client attached: 0x1f64f3b1637f2e3a from "192.168.1.254:871"

I'm a bit baffled here because if I go by what the NFS server logs say - the proxmox client disconnected? what would you suggest I do to debug this further?

fiona · Nov 24, 2022

Giovanni said:
Just to make sure I understand, there is a different RPC made by pvestatd that is waiting for response that is timing out after 10 seconds causing the `pvestatd` message in syslog?

The log message is printed whenever updating all status takes more than 5 seconds. The 10 seconds timeout is within the check_connection call in the NFS plugin:

Code:

eval { run_command($cmd, timeout => 10, outfunc => sub {}, errfunc => sub {}) };

Giovanni said:
The one thing I do note is that under high load in the NFS server the command `/usr/sbin/rpcinfo -T tcp 192.168.1.54 nfs 4` which is usually returned immediately may take up to 3 seconds to come back under high load.

Not sure if this is the trigger for pvestatd since you mentioned it waits 10 seconds long?

Were the 3 seconds taken from the timings from your script? Did the storage 'unraid' is not online messages re-appear when you increased the load?

Dan15 · Jan 9, 2023

fiona said:
Hi,
please note that the rpcinfo command is run with a timeout of 10 seconds, and since pvestatd reports a status update time of slightly more than 10 seconds, I'd suspect that you are hitting that timeout (rather than the command really failing).

I'm debugging a similar "storage '...' is not online" error, and this forum thread came up as the first result.

This check is not correct. It's totally valid to use NFS v4 without rpcbind. The only thing required by NFS v4 is TCP port 2049. I have an NFS server that is not running rpcbind and thus won't respond to rpcinfo calls. It works fine for everything else, but Proxmox incorrectly states that the storage is not online.

fiona · Jan 9, 2023

Hi,

Dan15 said:
I'm debugging a similar "storage '...' is not online" error, and this forum thread came up as the first result.

This check is not correct. It's totally valid to use NFS v4 without rpcbind. The only thing required by NFS v4 is TCP port 2049. I have an NFS server that is not running rpcbind and thus won't respond to rpcinfo calls. It works fine for everything else, but Proxmox incorrectly states that the storage is not online.

if you have a better alternative check, we're happy to to consider it. The check should still work for setups that don't use the default port of course. Please see https://pve.proxmox.com/wiki/Developer_Documentation if you'd like to send a patch for it.

innostus · Jan 10, 2023

So, basically, Proxmox PVE does not support NFS v4 in it's 'pure' form with just port 2049? Is there a workaround?

This seems to be the same problem as https://forum.proxmox.com/threads/problem-adding-nfs4-storage.120774/#post-524652

fiona · Jan 23, 2023

I sent a patch to improve the relevant check to the development mailing list for discussion.

Dan15 · Jan 24, 2023

fiona said:
I sent a patch to improve the relevant check to the development mailing list for discussion.

Thank you! I don't know Perl but this seems good to me.

There's a comment "defaults to port 7 when passing in 0" but I don't see port 7 anywhere in that patch... Is that a typo in the comment?

fiona · Jan 24, 2023

Dan15 said:
Thank you! I don't know Perl but this seems good to me.

There's a comment "defaults to port 7 when passing in 0" but I don't see port 7 anywhere in that patch... Is that a typo in the comment?

That's what the tcp_ping() function uses when the argument is 0 (or undefined), here is the implementation

Code:

 644     if (!$port) {
 645         # Net::Ping defaults to the echo port
 646         $port = 7;

It's just a secondary reason why 0 shouldn't be passed along. It's not directly relevant, but I added the comment so that future people working on the code will know.

Ben Sommer · Mar 15, 2023

Hi
I have Proxmox VE 7.3.6 and try to use my PetaSAN 3.1 with nfs 4.2. I used the patch from this thread but it does not work for me. I like to use it in my productive environment. I have no problem to buy a subscription, if there is a solution in the enterprise repository. Any hints for me?

Best Ben

fiona · Mar 15, 2023

Hi,
did you run systemctl reload-or-restart pvedaemon.service pveproxy.service pvestatd.service or similar after applying the patch? What is the output of pveversion -v and what is the exact error you get? Does it work if you try to mount the NFS manually on the CLI? There might be additional hints in /var/log/syslog.

Ben Sommer · Mar 15, 2023

Hi Fiona
thx for your help, I did a 2nd reload from the systemctl and now it will accept my nfs share. Don´t know why

Best regards

Ben

Search

Search

pvestatd - NFS mount "storage is not online"

Giovanni

Renowned Member

bbgeek17

Distinguished Member

Giovanni

Renowned Member

fiona

Proxmox Staff Member

Giovanni

Renowned Member

fiona

Proxmox Staff Member

Dan15

New Member

fiona

Proxmox Staff Member

innostus

New Member

fiona

Proxmox Staff Member

Dan15

New Member

fiona

Proxmox Staff Member

Ben Sommer

Active Member

fiona

Proxmox Staff Member

Ben Sommer

Active Member