NFS storage suddenly has unexplainable "latency"/connection problems

kit-of4710

New Member
Mar 8, 2023
8
1
3
Hello,

after updating/upgrading our cluster 3 weeks ago (Linux 5.15.85-1-pve #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z) on all nodes) and adding another NFS storage, we ran into weird unexplainable NFS issues we haven't been able to troubleshoot ever since.

Short summary of the setup: A cluster of 3 asymetric Proxmox nodes (all with non-production repositories) running VMs of a NFS (v4 before, now v3) storage managed by a TrueNAS Core baremetal server equipped with HDDs and an L2ARC SSD. All systems are connected with a 40G fiber Ethernet connection via an Extreme X870 switch and the NFS connections use an exclusive VLAN. After discovering this issue, we disconnected the second NFS storage from Proxmox and the server from the NFS-VLAN because we first thought that was triggering this issue.

The problem is as follows: When something starts to write to the storage with about 10-100 MB/s (we still need to further test, but cannot due to the VMs being in use right now), the effected Proxmox node will nearly immediately issue following error:

Code:
pvestatd[5784]: unable to activate storage 'nfs-storage1' - directory '/mnt/pve/nfs-storage1' does not exist or is unreachable

Depending on how long the write access happens, either the VMs on given node will freeze when writing to the storage for few seconds if the write access does only take a few seconds. However, if the write access takes way longer, all nodes will begin issue the same error message continuously (one message per 10 seconds) until the write is completely aborted. The speed of the write access then also drops to zero for most of the time. And the worst: Each VM trying to access the disk during this "outage" will either fully crash (Windows guests) or will forcefully unmount their root partition and each process accessing the disk freezes permanently (Linux guests).
But while this happens, all nodes and the TrueNAS server are running idle on max 1-2% CPU usage, the HDDs are doing nothing and act fast if accessed directly (due to the write access speed dropped to zero), and the switch neither reports any dropped packets nor more network utilization than idle. And during these events, no server is issuing any more logs about it.

Our setup in general (using HDDs) does not seem to be the problem because everything was working flawless before: With our ZFS configuration, we could have write speeds up to 200 MB/s before, fully utilizing our disks, and still, thanks to the L2ARC and adaptive scheduling, without nothing really being effected (no VM crashes, no error logs at all, concurrent writes were slow but continuously working).

What we've tried/verified so far:
  • All servers are using the same NTP time server to synchronize their clocks.
  • We have used NFSv4 before the issue came up and changed the version to NFSv3 because it made the setup stable enough for day-to-day business. With NFSv4, the issue was triggered more often, at least once a day. With NFSv3, the error happens once to twice a week on average.
Has anyone ideas on what we can look into to find out, what causes this issue or how we even may be able to fix it?
 
It could be a variation or combination of this issue :
https://forum.proxmox.com/threads/pvestatd-nfs-mount-storage-is-not-online.118466/

It could also be some sort of incompatibility between newer kernel and NAS, you can try to downgrade the kernel:
https://forum.proxmox.com/threads/how-to-pin-unpin-a-specific-kernel.111732/

Overall, there is probably not enough information in your post to pinpoint the issue with 100% certainty. Helpful information would be:
1) what did you upgrade from?
2) what is the current : pveversion -v
3) can you reproduce the issue with direct/manual NFS mounts and variations of dd/fio
4) have you checked network interfaces for errors
etc

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: kit-of4710
Thanks bbgeek17 for the fast answer and the suggestions.

Thanks for pointing out, but I don't think that. That issue seems different as they needed to restart their unraid NFS server to restore connection. In our condition, most of the time, the issues "resolve themselves" as the load becomes less (but nonetheless, we tried restarting the NFS service of TrueNAS without success).

1) what did you upgrade from?
Node 1 & 2 were upgraded to Linux 5.15-60-1-pve at 2022-10-03 and stayed until very recently at that version.
Node 3 was introduced at 2022-10-22 and stayed at 5.15-64-1-pve. The mix of these Kernel versions were fully stable btw.

For upgrading in 2023-02-11 we upgraded all packages on Node 1 (as it is mostly a designated upgrade-testing / "emergency fallback in case of both other nodes are down"-host) but according to systemd-journal it stayed on the same kernel.

Then on 2023-02-15 (we had an explicit maintenance window now) we decided to upgrade the other two nodes as well after node 1 was running stable. But they upgraded immediately to Linux 5.15-85-1-pve, which we didn't saw at first.
Also at that maintenance window, we introduced the 2nd TrueNAS storage server (based on NVME SSDs) and started to copy some test VMs to it and run them and benchmark the storage server.
Then the error became apparent to us as not only this test VM crashed and Proxmox reported the errors from above about 2nd storage to the log, but also for the 1st storage server and some of the other VMs on the 1st storage started stroking or crashing. The errors for both storages appeared on all nodes, so even nodes, which didn't ran any VMs from the 2nd storage still had connection issues to it (I assume Proxmox regularly checks the connection to them just to be sure).

While the first debugging hours, we discovered that node 1 still had an older kernel, assumed the error may be raised by this difference and upgraded it as well.

2) what is the current : pveversion -v
All nodes now have the exactly same pveversion report (aside from having a different kernel history, was checked with pveversion -v | grep -vF pve-kernel-5 | sha512sum -). The pveversion -v extract of node 3:
Code:
proxmox-ve: 7.3-1 (running kernel: 5.15.85-1-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-4
pve-kernel-5.15: 7.3-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
openvswitch-switch: 2.15.0+ds1-2+deb11u2.1
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

3) can you reproduce the issue with direct/manual NFS mounts and variations of dd/fio
To not interfere with the current setup, we didn't tried it from another machine, but when Proxmox starts to having these issues (and logs those), we can confirm the latency by just using ls /mnt/pve/nfs-storage1 as it then takes something between 1 to 25 secs to complete. And trying to read files and outputting them to the console does have issues too as it stucks for ca. 4 sec each 10 sec.
Also, we can reproduce these issues by copying files from&to the NFS share with basic Linux tools (i.e. not using Proxmox tools).
We might try that in the next maintenance window.

4) have you checked network interfaces for errors
Honestly we just checked for any issues on the attached switch but not on the interfaces. Do you maybe know how to access those statistics?


And also: Thank you very, very much for making us looking in the old logs from before the incident happened: Because while reading the logs for checking for the Kernel versions before upgrading, we discovered that the issue also started to appear at 2023-02-07, before we introduced any changes as we can remember now, but no VMs were failing, as the issue only happened once or twice per night.
I think we will try to reconstruct what changed on or before the 2023-02-07 as we started updating the nodes later.
 
This issue fixed itself after upgrading both TrueNAS systems. Shortly before, we found an entry in the TrueNAS forum with exactly the issue we were seeing. So it was no issue coming from Proxmox and it was just a coincidence, that we first saw that issue after the Proxmox updates. Thankfully it is solved.

Thanks, bbgeek17, for helping.

PS/EDIT: I would like to mark this one as solved, but it seems that I cannot edit my original post to do that because it is already too old.
 
Last edited:
  • Like
Reactions: bbgeek17

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!