Error with NFS Storage

rahvin

New Member
Mar 18, 2024
8
2
3
I'm trying to configure some NFS storage for use by PVE and i've run into an issue that I likely self-created. Basically, me and NFS :( don't get along. Every time I try to configure it I end up screwing it up the first 5 tries. So as usual I messed up the NFS config and in trying to troubleshoot it I attached and deleted the NFS share from the PVE Storage a few times.

Don't know if I went too fast or what but it ended up orphaning a few directories that were blocking re-adding the share. I've fixed that thanks to a couple helpful threads I was able to find but the current issue blocking me has proved to be very difficult to search for a solution. Either I can't find the right syntax or I've created a very unusual problem.

At this point I've connected the NFS storage to proxmox, PVE created the underlying directories (vzdump, etc). I can mount the NFS share on my local machine and create files fine as well so I think the NFS is fine as this point. But when i attempt to move a VM drive to the share the share is reporting 0 available space (see below photo).

proxmox_error.png

I've been using Linux on the Server, though I'm not in IT so no professionally, for over 20 years and it's my daily driver for the last 5 or 6 but I just started using Proxmox. I'm not familiar with the layout or how the virtualization overlay system works, or even where things are logged as opposed to the base Debian I'm used to.

Can anyone point me in the direction of where to look for the problem here?
 
I've concealed some domain names and account names. Prior error I had some orphaned dir's in /etc/pve/(something) and an orphaned PID file that was preventing pve-manager from restarting, those problems are fixed. I just don't have any idea how to troubleshoot because I don't even know where proxmox mounts anything, how the control system works or what each of the individual daemons do. I'm enjoying learning but this is a bugger of an issue with a service that hates me. :) Any help is appreciated.

/etc/pve/storage-cfg
Code:
dir: local
    path /var/lib/vz
    content vztmpl,iso,backup

lvmthin: local-lvm
    thinpool data
    vgname pve
    content rootdir,images

lvmthin: raid
    thinpool raid
    vgname raid
    content images,rootdir
    nodes hppve

pbs: backup
    datastore zpool
    server 192.168.74.5
    content backup
    fingerprint b0:40::xxxxx
    prune-backups keep-all=1
    username root@pam

nfs: pve-nfs
    export /export/pve-vm
    path /mnt/pve/pve-nfs
    server omv.xxxx.xxxx
    content images,rootdir
    prune-backups keep-all=1
pvesm status
Code:
Name             Type     Status           Total            Used       Available        %
backup            pbs     active     19595477760     13532147072      6063330688   69.06%
local             dir     active       126348612        19073264       100811044   15.10%
local-lvm     lvmthin     active       313372672        32841456       280531215   10.48%
pve-nfs           nfs     active      1073216512         7515136      1065701376    0.70%
raid          lvmthin     active     78095708160     13822940344     64272767815   17.70%
mount
Code:
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=32851780k,nr_inodes=8212945,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=6580632k,mode=755,inode64)
/dev/mapper/pve-root on / type ext4 (rw,relatime,errors=remount-ro)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k,inode64)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=75168)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
ramfs on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/nvme1n1p2 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
/dev/md0 on /home type ext4 (rw,noatime,nodiratime)
/dev/sdb2 on /home/xxxxx type btrfs (rw,noatime,nodiratime,ssd,discard=async,space_cache,subvolid=5,subvol=/)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
ramfs on /run/credentials/systemd-sysctl.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=6580628k,nr_inodes=1645157,mode=700,inode64)
/dev/sdf1 on /media/20T type ext4 (rw,relatime)
omv.xxxxxxxxxxx/export/pve-vm on /mnt/pve/pve-nfs type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.74.30,mountvers=3,mountport=49354,mountproto=udp,local_lock=none,addr=192.168.74.30)
df
Code:
Filesystem                          1K-blocks       Used   Available Use% Mounted on
udev                                 32851780          0    32851780   0% /dev
tmpfs                                 6580632       2516     6578116   1% /run
/dev/mapper/pve-root                126348612   19073568   100810740  16% /
tmpfs                                32903152      64404    32838748   1% /dev/shm
tmpfs                                    5120          0        5120   0% /run/lock
efivarfs                                  128         13         111  10% /sys/firmware/efi/efivars
/dev/nvme1n1p2                        1046512      11912     1034600   2% /boot/efi
/dev/md0                             83973216         80    79661540   1% /home
/dev/sdb2                           116243456     512796   114674612   1% /home/rahvin
/dev/fuse                              131072         36      131036   1% /etc/pve
tmpfs                                 6580628         16     6580612   1% /run/user/0
/dev/sdf1                         21398776916 3823127300 16501367224  19% /media/20T
omv.xxxxxxxxxx.xxx:/export/pve-vm  1073216512    7515136  1065701376   1% /mnt/pve/pve-nfs
 
I dont see a reason for GUI to not show the sizing based on the output in your post. Do you have more than one node in your setup? If you do, is the CLI output from the same node where you were connected in GUI when you took the screenshot?

Is the GUI "error" still present? Can you see capacity reporting in other portions of the GUI (storage, etc)? Are there any errors in browser console?
Are there any errors in the output of "journalctl -n 500" immediately after you start the GUI wizard?
Have you tried to restart stats responsible daemon? : systemctl try-reload-or-restart pvedaemon pveproxy pvestatd pvescheduler
Does the problem persist through a reboot?
Please post "pveversion -v"



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Posting this will probably explain as much about my config as anything else. mppve is also my pbs node. As you can see the webconsole is still reporting 0 bytes.

hppve_webConsole.png

I'm not seeing any errors on the console other than the apt-get fails because of no subscription. I'm not seeing anything in the logs on hppve I know to check (messages, auth, kern, syslog, etc... I did install syslog-ng).

This shows the first error I've seen. The main node I've been working off of shows nothing but by 3rd node lppve is showing this in the logs.
journalctl -n 500
Code:
Mar 19 12:45:40 lppve pvestatd[1057]: status update time (11.057 seconds)
Mar 19 12:45:51 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:45:51 lppve pvestatd[1057]: status update time (11.090 seconds)
Mar 19 12:46:02 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:46:03 lppve pvestatd[1057]: status update time (11.084 seconds)
Mar 19 12:46:13 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:46:14 lppve pvestatd[1057]: status update time (11.059 seconds)
Mar 19 12:46:24 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:46:25 lppve pvestatd[1057]: status update time (11.074 seconds)
Mar 19 12:46:35 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:46:36 lppve pvestatd[1057]: status update time (11.090 seconds)
Mar 19 12:46:46 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:46:47 lppve pvestatd[1057]: status update time (11.062 seconds)
Mar 19 12:46:57 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:46:58 lppve pvestatd[1057]: status update time (11.075 seconds)
Mar 19 12:47:09 lppve pvestatd[1057]: storage 'pve-nfs' is not online
Mar 19 12:47:09 lppve pvestatd[1057]: status update time (11.098 seconds)

This at least narrows it down. I was focused on the hppve server because it's hosting the openmediavault instance.

On lppve, Restarting the daemons resulted in nothing. I'm migrating the VMs to mppve so I can restart the server but I decided to try to move the disk from the mppve to nfs and see what happens. Same 0bytes error. Below is what the log showed.

1710877363492.png

Any further ideas?

I'm trying to go slow here with the system because I"m not familiar with it and I think going fast caused the issue to begin with. I'm not very good with perl so I hope I don't need to read through that script to figure out what's wrong. On the bright side at least its perl5 and not 6. ;)
 
Forgot the pveversion (you can see the node I ran it on but they are all installed from the same version and up to date)
Code:
proxmox-ve: 8.1.0 (running kernel: 6.5.11-7-pve)
pve-manager: 8.1.4 (running version: 8.1.4/ec5affc9e41f1d79)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
ceph: 18.2.1-pve2
ceph-fuse: 18.2.1-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.2
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.1
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.1.0
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve4
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.4-1
proxmox-backup-file-restore: 3.1.4-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.5
proxmox-widget-toolkit: 4.1.4
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.4
pve-edk2-firmware: 4.2023.08-4
pve-firewall: 5.0.3
pve-firmware: 3.9-2
pve-ha-manager: 4.0.3
pve-i18n: 3.2.1
pve-qemu-kvm: 8.1.5-3
pve-xtermjs: 5.3.0-3
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.2-pve2
root@mppve:~#
 
So in the GUI you show an "error" on nodeX, while the CLI you ran in comment #3 are from nodeY? Thats not helpful, I'd say counterproductive.
When working with a multi-node setup please be specific where the commands are being run or GUI operations are being performed.

The GUI clearly shows a "?" next to NFS storage for two out of three nodes, supported by the "not online" messages in the system log.
This is an indication of either intermittent or completely broken connectivity from those two nodes. You need to investigate and fix them.

Each node has independent NFS connection to the storage and each must be healthy.
The NFS statistics and health probes are based on "rpcinfo" and "showmount" output, test that from each node. You can also run "pvesm" commands on each node.

In summary, you have network connectivity issues from majority of the nodes in your cluster.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I was fiddling around and I just found something that may be relevant.

hppve is the main node, it was the node I started the cluster on. If I try to move the vm disk to either of the sub-nodes (mppve and lppve) I see 0 bytes on the nfs share. But if I try to move to hppve it sees the full TB available.

Is this a cluster issue?
 
So in the GUI you show an "error" on nodeX, while the CLI you ran in comment #3 are from nodeY? Thats not helpful, I'd say counterproductive.
When working with a multi-node setup please be specific where the commands are being run or GUI operations are being performed.

The GUI clearly shows a "?" next to NFS storage for two out of three nodes, supported by the "not online" messages in the system log.
This is an indication of either intermittent or completely broken connectivity from those two nodes. You need to investigate and fix them.

Each node has independent NFS connection to the storage and each must be healthy.
The NFS statistics and health probes are based on "rpcinfo" and "showmount" output, test that from each node. You can also run "pvesm" commands on each node.

In summary, you have network connectivity issues from majority of the nodes in your cluster.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I apologize if I wasn't being clear. The node names are visible in the command line on both log outputs listed and between each code block I explained I was moving the VM to a different node and trying again which output different error messages than the first node. Which I thought might be helpful information.

I thought this was clear enough but clearly not. And I'm equally sorry my eyesight isn't good enough that I didn't see the little grey question mark on the node name until you pointed it out. Thank you.
 
So for anyone that runs into a similar error in the future. On both sub-nodes nfs-common was masked.

Why a just installed debian base would have nfs-common masked is beyond me and nothing like what I was expecting on a just installed system. This has narrowed the error enough that I can better troubleshoot so thanks for the help.
 
  • Like
Reactions: bbgeek17
So final results for searchers in the future.

The package nfs-common was masked in apt and would not start as a result. On both nodes the service would not enable with the command "systemctl enable nfs-common nor would it unmasq with the command. I had to manually remove the symbolic link in /lib/systemd/system/nfs-common.service.

After removing the sym-link file I enabled the service with "systemctl enable nfs-common" and "systemctl start nfs-common". After nfs-common was restarted I restarted the corosync and cluster service on each node. This cleared the error on the mppve node. In troubleshooting why it wasn't fixing the error on the lppve node I discovered a typo in the dns config on lppve. After fixing the typo and again restarting the services on the node I cleared this error as well.

Thanks for the help.

Does anyone know if there is a flow chart or other diagram that shows how the pve daemons communicate and their individual roles in a flow chart like format that I can use to better visualize how the proxmox system works?
 
  • Like
Reactions: alfre2

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!