Backup failed with storage 'proxbackup' is not online - Problem with rpcinfo?

crs369

New Member
Sep 15, 2022
11
1
3
I'm getting an error from my backup jobs making a backup from my VM's on my NFS-Storage on my TrueNAS:

VMID NAME STATUS TIME SIZE FILENAME
107 pihole-02 FAILED 00:01:09 storage 'proxbackup' is not online


Same message is in syslog:

Oct 1 14:10:05 proxmox-00 pvestatd[1246]: storage 'proxsharedssd' is not online
Oct 1 14:11:49 proxmox-00 pvestatd[1246]: storage 'proxshared' is not online
Oct 1 14:11:50 proxmox-00 pvestatd[1246]: storage 'proxbackup' is not online
Oct 1 14:12:00 proxmox-00 pvestatd[1246]: storage 'proxsharedssd' is not online

In the forum the problem is discused but the tips doesn't work for me, so i digging deeper to take a look.

I'm looking in the Code /usr/share/perl5/PVE/Storage/NFSPlugin.pm where the error is detected and let run the following code on the proxmox node:

while true; do date; /usr/sbin/rpcinfo -T tcp 192.168.0.10 nfs 4; sleep 1; done

Starting the backup job i getting the following output:
...
Sat 01 Oct 2022 02:06:45 PM CEST
program 100003 version 4 ready and waiting
Sat 01 Oct 2022 02:06:46 PM CEST
rpcinfo: RPC: Unable to receive; errno = Connection reset by peer
program 100003 version 4 is not available

Sat 01 Oct 2022 02:06:47 PM CEST
program 100003 version 4 ready and waiting
...

It seems that the program rpcinfo fails from time to time to get the state from the NFS-Mounts.

But i didn't find any useful Information to track this down any further - i hope somebody can me route in the right direction.

I'm usind the latest Promox-Version with Kernel 5.19.7-1-pve (problem exits also on 5.15.x). My shared storage is on an
TrueNAS 12.0-U8.1 with NFS Version 4.

A workaround (not really - but it works) that i'm using is to modify NFSPlugin.pm as shown as bellow.
Code:
sub check_connection {
    my ($class, $storeid, $scfg) = @_;

    my $server = $scfg->{server};
    my $opts = $scfg->{options};

    my $cmd;
    if (defined($opts) && $opts =~ /vers=4.*/) {
        my $ip = PVE::JSONSchema::pve_verify_ip($server, 1);
        if (!defined($ip)) {
            $ip = PVE::Network::get_ip_from_hostname($server);
        }

        my $transport = PVE::JSONSchema::pve_verify_ipv4($ip, 1) ? 'tcp' : 'tcp6';

        # nfsv4 uses a pseudo-filesystem always beginning with /
        # no exports are listed
        $cmd = ['/usr/sbin/rpcinfo', '-T', $transport, $ip, 'nfs', '4'];
    } else {
        $cmd = ['/sbin/showmount', '--no-headers', '--exports', $server];
    }

    return 1;

    eval { run_command($cmd, timeout => 10, outfunc => sub {}, errfunc => sub {}) };
    if (my $err = $@) {
        return 0;
    }

    return 1;
 
It seems that the program rpcinfo fails from time to time to get the state from the NFS-Mounts.
rpcinfo is part of standard Linux toolset and there are no PVE specific troubleshooting steps. You have a multitude of things to determine and research:
- Did anything change in the environment?
- Are there any errors on NFS server side?
- Are there network layer errors as reported by netstat?
- Does the switch/router report any errors?
- Does the failure happen at any time or specific load times?
- Does it happen on other clients?
- Rule out physical infrastructure. Different: cable, port, nic, direct connect.
- Capture network trace on both sides: do failed request make it to server? does server reply in time? do replies make it to client?
- Is NFS server up to date? What does NFS vendor say?

There are many other avenues to explorer. Masking the issue by modifying basic health check is, in my opinion, last on the list.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
Thanks for your fast reply.

I did know that rpcinfo ist not part of Proxmox, but it is used from the Proxmox-Scripts to checking the connection.

I think there is a problem from the connection from rpcinfo to to the lokal RPC-Server - the error code say that there is at the moment no services hearing on that port - can this be a rate limted on the connection/firewall?

Most points from above i have checked before - all Logs on client and server are good, packets on nic and switch shows no errors, problem is there from the beginning ca. 3 months - rare without backup-jobs - always with, Hardware running fine - no others problems there.

My hack is not the right solution: But i think the NFS works reliable and only the connection checking with rpcinfo fails
 
I think there is a problem from the connection from rpcinfo to to the lokal RPC-Server
there is no local rpc server running on PVE side. It connects to your NFS rpc.

can this be a rate limted on the connection/firewall?
extremely unlikely. The rpc query is not even a blip among hundreds of thousands if not millions of packets being sent.

Collect a network trace on both sides during the issue to further isolate the culprit.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!