pvestatd - NFS mount "storage is not online"

Giovanni

Renowned Member
Apr 1, 2009
109
11
83
I'm trying to debug a condition where it seems that using NFS v4.2 there is a brief period of connection timeout between proxmox and my NFS server.

Looks like `pvestatd` monitors storage mounts and prints some useful messages - can these be made more verbose?

Code:
Nov 23 15:43:16 centrix pvestatd[2100]: got timeout
Nov 23 15:43:38 centrix pvedaemon[1697453]: <root@pam> successful auth for user 'root@pam'
Nov 23 15:43:54 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:43:54 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:04 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:04 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:14 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:14 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:24 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:24 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:34 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:34 centrix pvestatd[2100]: status update time (10.040 seconds)
Nov 23 15:44:44 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:44 centrix pvestatd[2100]: status update time (10.039 seconds)
Nov 23 15:44:54 centrix pvestatd[2100]: storage 'unraid' is not online
Nov 23 15:44:54 centrix pvestatd[2100]: status update time (10.048 seconds)
Nov 23 15:45:25 centrix pvestatd[2100]: storage 'unraid' is not online

I am not exactly sure what the default "timeout settings" are for NFS mounts setup via proxmox and if these are configurable?

Code:
nas.gfm:/mnt/cached on /mnt/pve/unraid type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.254,local_lock=none,addr=192.168.1.54)

600 seconds seems to be number. Are NFS timeouts common/expected? so far my logs show the timeout but it looks like its recovering within timeout period... I do want to fix this if its not normal. My NFS server isn't hosting anything else but this proxmox node.
 
In a stable network/infrastructure NFS should not experience timeouts at all.

"man 5 nfs":
Code:
 The time in deciseconds (tenths of a second) the NFS client waits for a response before it retries an NFS request.

                      For NFS over TCP the default timeo value is 600 (60 seconds).  The NFS client performs linear backoff: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds.

That said, the timeo value is completely unrelated to the healthcheck failure you are experiencing. I am going to copy my reply to similar issue from here:
https://forum.proxmox.com/threads/delay-of-1-node-for-reconnection-on-nfs.107981/#post-464088


Code:
The error you are getting is based on "check_connection()" response from NFS Plugin.

NFS Plugin does:
Code:
sub check_connection {
    my ($class, $storeid, $scfg) = @_;

    my $server = $scfg->{server};
    my $opts = $scfg->{options};

    my $cmd;
    if (defined($opts) && $opts =~ /vers=4.*/) {
        my $ip = PVE::JSONSchema::pve_verify_ip($server, 1);
        if (!defined($ip)) {
            $ip = PVE::Network::get_ip_from_hostname($server);
        }

        my $transport = PVE::JSONSchema::pve_verify_ipv4($ip, 1) ? 'tcp' : 'tcp6';

        # nfsv4 uses a pseudo-filesystem always beginning with /
        # no exports are listed
        $cmd = ['/usr/sbin/rpcinfo', '-T', $transport, $ip, 'nfs', '4'];
    } else {
        $cmd = ['/sbin/showmount', '--no-headers', '--exports', $server];
    }

    eval { run_command($cmd, timeout => 10, outfunc => sub {}, errfunc => sub {}) };
    if (my $err = $@) {
        return 0;
    }

    return 1;
}

Find out what happens to rpcinfo and showmount. These are not PVE specific, its generic TCP/NFS layer of Linux. If you have 3 nodes and NAS, and only one node has an issue, the most logical explanation is that the problem is with that node.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
thanks for the pointer. I exclusively use NFS version 4.2 - taking a hint from the script you shared I wrote an endless loop script.

Code:
#!/bin/bash
# Basic script to run on an endless loop on a NFS client
# idea is to catch failure in rpcinfo and log timestamp.

counter=1
while :
do
    echo "Run " $counter + $(date)
    /usr/sbin/rpcinfo -T tcp 192.168.1.54 nfs 4;
    if [ $? -eq 1 ]
    then
        echo "NFS server unresponsive."
        break
    fi
    ((counter++))
    sleep 1
done
echo "NFS monitor script finished. "

Will run this on another session and look to see if it fails. If you have any other suggestions let me know.
 
Hi,
please note that the rpcinfo command is run with a timeout of 10 seconds, and since pvestatd reports a status update time of slightly more than 10 seconds, I'd suspect that you are hitting that timeout (rather than the command really failing).
 
Hi,
please note that the rpcinfo command is run with a timeout of 10 seconds, and since pvestatd reports a status update time of slightly more than 10 seconds, I'd suspect that you are hitting that timeout (rather than the command really failing).

Just to make sure I understand, there is a different RPC made by pvestatd that is waiting for response that is timing out after 10 seconds causing the `pvestatd` message in syslog?

OK my script returned an instance timestamp for me to try to correlate.

Code:
Run  234 + Thu 24 Nov 2022 03:33:03 AM EST
rpcinfo: RPC: Timed out
program 100003 version 4 is not available
NFS server unresponsive.
NFS monitor script finished.
root@centrix:~# ./nfs.sh
Run  1 + Thu 24 Nov 2022 03:33:55 AM EST
rpcinfo: RPC: Timed out
program 100003 version 4 is not available
NFS server unresponsive.
NFS monitor script finished.


On my NFS server side syslog I see
Code:
Nov 24 03:30:44 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:30:47 nas systemd[1]: check-mk-agent@51-1068-996.service: Deactivated successfully.
Nov 24 03:31:47 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:31:49 nas systemd[1]: check-mk-agent@52-1068-996.service: Deactivated successfully.
Nov 24 03:32:49 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:32:51 nas systemd[1]: check-mk-agent@53-1068-996.service: Deactivated successfully.
Nov 24 03:33:52 nas systemd[1]: Started Checkmk agent (PID 1068/UID 996).
Nov 24 03:33:54 nas systemd[1]: check-mk-agent@54-1068-996.service: Deactivated successfully.
Nov 24 03:34:15 nas pcp-pmie[2263]: High 1-minute load average 12.4load@nas
Nov 24 03:34:32 nas rpc.mountd[1162]: v4.2 client detached: 0x1f64f3b0637f1247 from "192.168.1.254:828"

Proxmox node it says
Code:
Nov 24 03:30:27 centrix pveproxy[2147]: worker 753346 started
Nov 24 03:32:09 centrix pvestatd[3110168]: status update time (8.123 seconds)
Nov 24 03:33:04 centrix pvestatd[3110168]: got timeout
Nov 24 03:33:20 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:20 centrix pvestatd[3110168]: status update time (10.045 seconds)
Nov 24 03:33:30 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:30 centrix pvestatd[3110168]: status update time (10.042 seconds)
Nov 24 03:33:40 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:40 centrix pvestatd[3110168]: status update time (10.042 seconds)
Nov 24 03:33:50 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:33:50 centrix pvestatd[3110168]: status update time (10.041 seconds)
Nov 24 03:34:00 centrix pvestatd[3110168]: storage 'unraid' is not online
Nov 24 03:34:00 centrix pvestatd[3110168]: status update time (10.043 seconds)

Recovery (on NFS server) is to restart mountd - otherwise proxmox host won't be able to re-establish NFS. Weird thing is no kernel crash in logs in the server side etc.
Code:
root@nas:/opt/snapraid-runner# /usr/bin/rpcinfo -u bee mountd
bee: RPC: Unknown host
root@nas:/opt/snapraid-runner# systemctl status nfs-mountd.service
● nfs-mountd.service - NFS Mount Daemon
     Loaded: loaded (/lib/systemd/system/nfs-mountd.service; static)
     Active: active (running) since Thu 2022-11-24 03:39:29 EST; 1min 55s ago
    Process: 82985 ExecStart=/usr/sbin/rpc.mountd (code=exited, status=0/SUCCESS)
   Main PID: 82988 (rpc.mountd)
      Tasks: 1 (limit: 9362)
     Memory: 576.0K
        CPU: 2ms
     CGroup: /system.slice/nfs-mountd.service
             └─82988 /usr/sbin/rpc.mountd

Nov 24 03:39:29 nas systemd[1]: Starting NFS Mount Daemon...
Nov 24 03:39:29 nas rpc.mountd[82988]: Version 2.6.1 starting
Nov 24 03:39:29 nas systemd[1]: Started NFS Mount Daemon.
root@nas:/opt/snapraid-runner# systemctl restart nfs-mountd.service
root@nas:/opt/snapraid-runner# systemctl status nfs-mountd.service
● nfs-mountd.service - NFS Mount Daemon
     Loaded: loaded (/lib/systemd/system/nfs-mountd.service; static)
     Active: active (running) since Thu 2022-11-24 03:41:30 EST; 1s ago
    Process: 84102 ExecStart=/usr/sbin/rpc.mountd (code=exited, status=0/SUCCESS)
   Main PID: 84106 (rpc.mountd)
      Tasks: 1 (limit: 9362)
     Memory: 780.0K
        CPU: 4ms
     CGroup: /system.slice/nfs-mountd.service
             └─84106 /usr/sbin/rpc.mountd

Nov 24 03:41:30 nas systemd[1]: nfs-mountd.service: Deactivated successfully.
Nov 24 03:41:30 nas systemd[1]: Stopped NFS Mount Daemon.
Nov 24 03:41:30 nas systemd[1]: Starting NFS Mount Daemon...
Nov 24 03:41:30 nas rpc.mountd[84106]: Version 2.6.1 starting
Nov 24 03:41:30 nas systemd[1]: Started NFS Mount Daemon.
Nov 24 03:41:31 nas rpc.mountd[84106]: v4.2 client attached: 0x1f64f3b1637f2e3a from "192.168.1.254:871"

I'm a bit baffled here because if I go by what the NFS server logs say - the proxmox client disconnected? what would you suggest I do to debug this further?
 
Last edited:
Just to make sure I understand, there is a different RPC made by pvestatd that is waiting for response that is timing out after 10 seconds causing the `pvestatd` message in syslog?
The log message is printed whenever updating all status takes more than 5 seconds. The 10 seconds timeout is within the check_connection call in the NFS plugin:
Code:
eval { run_command($cmd, timeout => 10, outfunc => sub {}, errfunc => sub {}) };

The one thing I do note is that under high load in the NFS server the command `/usr/sbin/rpcinfo -T tcp 192.168.1.54 nfs 4` which is usually returned immediately may take up to 3 seconds to come back under high load.

Not sure if this is the trigger for pvestatd since you mentioned it waits 10 seconds long?
Were the 3 seconds taken from the timings from your script? Did the storage 'unraid' is not online messages re-appear when you increased the load?
 
  • Like
Reactions: Giovanni
Hi,
please note that the rpcinfo command is run with a timeout of 10 seconds, and since pvestatd reports a status update time of slightly more than 10 seconds, I'd suspect that you are hitting that timeout (rather than the command really failing).

I'm debugging a similar "storage '...' is not online" error, and this forum thread came up as the first result.

This check is not correct. It's totally valid to use NFS v4 without rpcbind. The only thing required by NFS v4 is TCP port 2049. I have an NFS server that is not running rpcbind and thus won't respond to rpcinfo calls. It works fine for everything else, but Proxmox incorrectly states that the storage is not online.
 
Hi,
I'm debugging a similar "storage '...' is not online" error, and this forum thread came up as the first result.

This check is not correct. It's totally valid to use NFS v4 without rpcbind. The only thing required by NFS v4 is TCP port 2049. I have an NFS server that is not running rpcbind and thus won't respond to rpcinfo calls. It works fine for everything else, but Proxmox incorrectly states that the storage is not online.
if you have a better alternative check, we're happy to to consider it. The check should still work for setups that don't use the default port of course. Please see https://pve.proxmox.com/wiki/Developer_Documentation if you'd like to send a patch for it.
 
Thank you! I don't know Perl but this seems good to me.

There's a comment "defaults to port 7 when passing in 0" but I don't see port 7 anywhere in that patch... Is that a typo in the comment?
That's what the tcp_ping() function uses when the argument is 0 (or undefined), here is the implementation
Code:
 644     if (!$port) {
 645         # Net::Ping defaults to the echo port
 646         $port = 7;

It's just a secondary reason why 0 shouldn't be passed along. It's not directly relevant, but I added the comment so that future people working on the code will know.
 
Hi
I have Proxmox VE 7.3.6 and try to use my PetaSAN 3.1 with nfs 4.2. I used the patch from this thread but it does not work for me. I like to use it in my productive environment. I have no problem to buy a subscription, if there is a solution in the enterprise repository. Any hints for me?

Best Ben
 
Hi,
did you run systemctl reload-or-restart pvedaemon.service pveproxy.service pvestatd.service or similar after applying the patch? What is the output of pveversion -v and what is the exact error you get? Does it work if you try to mount the NFS manually on the CLI? There might be additional hints in /var/log/syslog.
 
Hi Fiona
thx for your help, I did a 2nd reload from the systemctl and now it will accept my nfs share. Don´t know why :)

Best regards

Ben
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!