NFS storage breaks in 8.3.2

mmmzon · Friday at 15:25

Good morning, forum

I have two separate nodes at home, both running 8.3.2. All my VMs were running storage on local NAS via NFS shares attached within ProxMox UI. This setup has been working for close to 2 years now, with no hiccups. After upgrading to 8.3.2, NFS shares dropped dead on me yesterday morning, sending nodes into a reboot frenzy. NAS is working fine, nothing changed in LAN connectivity, and I can attach the shares via NFS on other machines without any issues.

Nodes were rebuilt, since I suspected issues with the update itself. I re-installed 8.2.2 version (that was the ISO I had locally), re-added NFS shares, and ran the upgrade.

Today in the morning, one of the nodes (node 2) again lost NFS share access. Node 1 for now is fine. Before losing access, on Node 2 I noticed status of all VMs, node, storage, etc. change to unknown, i.e., individual elements had gray question mark instead of the expected green check.

Code:

root@mox2:~# pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-6-pve)
pve-manager: 8.3.2 (running version: 8.3.2/3e76eec21c4a14a7)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-6
proxmox-kernel-6.8.12-6-pve-signed: 6.8.12-6
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.1.2
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.2-1
proxmox-backup-file-restore: 3.3.2-2
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.3
pve-cluster: 8.0.10
pve-container: 5.2.3
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-2
pve-ha-manager: 4.0.6
pve-i18n: 3.3.2
pve-qemu-kvm: 9.0.2-4
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.3
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

and here is the storage configuration

Code:

root@mox2:~# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content vztmpl,backup,iso

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

dir: vm-qcow2
        path /mnt/data/vm-qcow2
        content images
        preallocation off
        prune-backups keep-all=1
        shared 0

dir: trunas-iso
        path /mnt/nas-iso
        content iso
        prune-backups keep-all=1
        shared 0

dir: trunas-vm
        path /mnt/nas-vm
        content images
        preallocation off
        prune-backups keep-all=1
        shared 0

dir: trunas-snapshot
        path /mnt/nas-snapshot
        content snippets
        prune-backups keep-all=1
        shared 0

Listing of the VM images on node1 works fine, it does not produce anything on node 2

Code:

root@mox1:~# ls -lah /mnt/nas-vm/images/
total 37K
drwxr-xr-x 25 root root 25 Jan 11 15:02 .
drwxr-xr-x  3 root root  3 May  4  2024 ..
drwxr-----  2 root root  4 May  4  2024 100
drwxr-----  2 root root  5 Nov 28 11:49 1000
drwxr-----  2 root root  4 May 31  2024 1005
drwxr-----  2 root root  5 May  5  2024 101
drwxr-----  2 root root  4 May  4  2024 102
drwxr-----  2 root root  4 May  4  2024 103
drwxr-----  2 root root  4 May  4  2024 104
drwxr-----  2 root root  4 May  4  2024 105
drwxr-----  2 root root  2 Nov  2 14:10 106
drwxr-----  2 root root  4 May  4  2024 107
drwxr-----  2 root root  4 May  4  2024 108
drwxr-----  2 root root  4 May 30  2024 109
drwxr-----  2 root root  4 May 31  2024 110
drwxr-----  2 root root  5 Nov 20 11:37 111
drwxr-----  2 root root  4 May  4  2024 1111
drwxr-----  2 root root  4 May 30  2024 113
drwxr-----  2 root root  4 May 29  2024 114
drwxr-----  2 root root  4 Nov  6 19:58 117
drwxr-----  2 root root  4 Sep  1 08:32 118
drwxr-----  2 root root  3 Nov  2 09:17 119
drwxr-----  2 root root  2 Nov  2 17:32 120
drwxr-----  2 root root  3 Jan 14 15:09 127
drwxr-----  2 root root  2 Sep 14 17:18 201

Code:

root@mox2:~# ls -lah /mnt/nas-vm/images/
total 8.0K
drwxr-xr-x 2 root root 4.0K Jan 17 06:50 .
drwxr-xr-x 3 root root 4.0K Jan 17 06:50 ..

Any attempt to mount NFS shares on node2 just hangs

root@mox2:~# mount -a

with no result or error code. I did find the same kind of mounting error in boot logs as well.

At this time, I am not sure what to do next here. Re-installing again is a pain in the neck, and there is zero guarantee it will do anything good. On the other hand, running old code version (8.2.x) is not something I'd fancy for a long term support as well.

mmmzon · Friday at 15:30

After a good long while ...

Code:

root@mox2:~# mount -a
mount.nfs: Connection timed out

but the NAS remains accessible with no issues

Code:

root@mox2:~# ping 192.168.150.240
PING 192.168.150.240 (192.168.150.240) 56(84) bytes of data.
64 bytes from 192.168.150.240: icmp_seq=1 ttl=64 time=0.240 ms
64 bytes from 192.168.150.240: icmp_seq=2 ttl=64 time=0.214 ms
64 bytes from 192.168.150.240: icmp_seq=3 ttl=64 time=0.183 ms
^C
--- 192.168.150.240 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2081ms
rtt min/avg/max/mdev = 0.183/0.212/0.240/0.023 ms

and when running a packet capture on the NAS side in shell, I can clearly see NFS communication between node2 and NAS, but the mounting still fails for reasons unknown.

Code:

Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2diso.mount: Mounting timed out. Terminating.
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2dvm.mount: Mounting timed out. Terminating.
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2diso.mount: Mount process exited, code=killed, status=15/TERM
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2diso.mount: Failed with result 'timeout'.
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2diso.mount: Unit process 1475 (mount.nfs) remains running after unit stopped.
Jan 17 06:50:43 mox2 systemd[1]: Failed to mount mnt-nas\x2diso.mount - /mnt/nas-iso.
Jan 17 06:50:43 mox2 systemd[1]: Dependency failed for remote-fs.target - Remote File Systems.
Jan 17 06:50:43 mox2 systemd[1]: remote-fs.target: Job remote-fs.target/start failed with result 'dependency'.
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2dvm.mount: Mount process exited, code=killed, status=15/TERM
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2dvm.mount: Failed with result 'timeout'.
Jan 17 06:50:43 mox2 systemd[1]: Failed to mount mnt-nas\x2dvm.mount - /mnt/nas-vm.
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2dsnapshot.mount: Mounting timed out. Terminating.
Jan 17 06:50:43 mox2 systemd[1]: Reached target pve-storage.target - PVE Storage Target.
Jan 17 06:50:43 mox2 systemd[1]: Starting lxc.service - LXC Container Initialization and Autoboot Code...
Jan 17 06:50:43 mox2 systemd[1]: Starting rrdcached.service - LSB: start or stop rrdcached...
Jan 17 06:50:43 mox2 systemd[1]: systemd-pcrphase.service - TPM2 PCR Barrier (User) was skipped because of an unmet condition check (ConditionPathExists=/sys/firmware/efi/efivars/StubPcrKernelImage-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f).
Jan 17 06:50:43 mox2 systemd[1]: Starting systemd-user-sessions.service - Permit User Sessions...
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2dsnapshot.mount: Mount process exited, code=killed, status=15/TERM
Jan 17 06:50:43 mox2 systemd[1]: mnt-nas\x2dsnapshot.mount: Failed with result 'timeout'.
Jan 17 06:50:43 mox2 systemd[1]: Failed to mount mnt-nas\x2dsnapshot.mount - /mnt/nas-snapshot.

bbgeek17 · Friday at 15:34

mmmzon said:
I did find the same kind of mounting error in boot logs as well.

It may be useful to post that.

mmmzon said:
t this time, I am not sure what to do next here.

You really have two options: a) stay on known good software level, b) troubleshoot the issue

Since PVE is using Linux Kernel NFS you can: a) install 8.x and pin or downgrade the Kernel b) install opt-in Kernel (6.11?)
Other than that: disable NFS in PVE to minimize boot/startup delays, and troubleshoot with manual operations.
I did not see mass reports of broken NFS in the forum, so it is possible that a particular combination of your NAS/PVE is responsible. Try to isolate the working/non-working combination and report back.

You can always try network trace. Additionally, PVE relies on "rpcinfo" output to probe the NAS. Make sure its working. "showmount" is another good one.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

mmmzon · Friday at 15:44

bbgeek17 said:
You really have two options: a) stay on known good software level, b) troubleshoot the issue

Yeah, I am thinking this might be an option - thanks.

waltar · Friday at 20:27

Wondering why you mount your nas by fstab and then config storage dir instead without fstab und datacenter storage type nfs ?
Still not any nfs problems with any pve release 7.4 ... 8.3.2, kernel 6.8.12-6 yet.

mmmzon · Friday at 21:36

waltar said:
Wondering why you mount your nas by fstab and then config storage dir instead without fstab und datacenter storage type nfs

Do you mean mounting it under UI itself? I tried to, but it times out, giving error "cfs-lock 'file-storage_cfg' error: got lock request timeout"

I am using TrueNAS and it has been working reliably for 2+ years with no hiccups and suddenly it just stopped, with no changes to the LAN configuration. I am spinning wheels to understand what changed here and the only thing that happened was the update to Proxmox nodes.

Search

Search

NFS storage breaks in 8.3.2

mmmzon

New Member

mmmzon

New Member

bbgeek17

Distinguished Member

mmmzon

New Member

waltar

Renowned Member

mmmzon

New Member