Hello,
after updating/upgrading our cluster 3 weeks ago (
Short summary of the setup: A cluster of 3 asymetric Proxmox nodes (all with non-production repositories) running VMs of a NFS (v4 before, now v3) storage managed by a TrueNAS Core baremetal server equipped with HDDs and an L2ARC SSD. All systems are connected with a 40G fiber Ethernet connection via an Extreme X870 switch and the NFS connections use an exclusive VLAN. After discovering this issue, we disconnected the second NFS storage from Proxmox and the server from the NFS-VLAN because we first thought that was triggering this issue.
The problem is as follows: When something starts to write to the storage with about 10-100 MB/s (we still need to further test, but cannot due to the VMs being in use right now), the effected Proxmox node will nearly immediately issue following error:
Depending on how long the write access happens, either the VMs on given node will freeze when writing to the storage for few seconds if the write access does only take a few seconds. However, if the write access takes way longer, all nodes will begin issue the same error message continuously (one message per 10 seconds) until the write is completely aborted. The speed of the write access then also drops to zero for most of the time. And the worst: Each VM trying to access the disk during this "outage" will either fully crash (Windows guests) or will forcefully unmount their root partition and each process accessing the disk freezes permanently (Linux guests).
But while this happens, all nodes and the TrueNAS server are running idle on max 1-2% CPU usage, the HDDs are doing nothing and act fast if accessed directly (due to the write access speed dropped to zero), and the switch neither reports any dropped packets nor more network utilization than idle. And during these events, no server is issuing any more logs about it.
Our setup in general (using HDDs) does not seem to be the problem because everything was working flawless before: With our ZFS configuration, we could have write speeds up to 200 MB/s before, fully utilizing our disks, and still, thanks to the L2ARC and adaptive scheduling, without nothing really being effected (no VM crashes, no error logs at all, concurrent writes were slow but continuously working).
What we've tried/verified so far:
after updating/upgrading our cluster 3 weeks ago (
Linux 5.15.85-1-pve #1 SMP PVE 5.15.85-1 (2023-02-01T00:00Z)
on all nodes) and adding another NFS storage, we ran into weird unexplainable NFS issues we haven't been able to troubleshoot ever since.Short summary of the setup: A cluster of 3 asymetric Proxmox nodes (all with non-production repositories) running VMs of a NFS (v4 before, now v3) storage managed by a TrueNAS Core baremetal server equipped with HDDs and an L2ARC SSD. All systems are connected with a 40G fiber Ethernet connection via an Extreme X870 switch and the NFS connections use an exclusive VLAN. After discovering this issue, we disconnected the second NFS storage from Proxmox and the server from the NFS-VLAN because we first thought that was triggering this issue.
The problem is as follows: When something starts to write to the storage with about 10-100 MB/s (we still need to further test, but cannot due to the VMs being in use right now), the effected Proxmox node will nearly immediately issue following error:
Code:
pvestatd[5784]: unable to activate storage 'nfs-storage1' - directory '/mnt/pve/nfs-storage1' does not exist or is unreachable
Depending on how long the write access happens, either the VMs on given node will freeze when writing to the storage for few seconds if the write access does only take a few seconds. However, if the write access takes way longer, all nodes will begin issue the same error message continuously (one message per 10 seconds) until the write is completely aborted. The speed of the write access then also drops to zero for most of the time. And the worst: Each VM trying to access the disk during this "outage" will either fully crash (Windows guests) or will forcefully unmount their root partition and each process accessing the disk freezes permanently (Linux guests).
But while this happens, all nodes and the TrueNAS server are running idle on max 1-2% CPU usage, the HDDs are doing nothing and act fast if accessed directly (due to the write access speed dropped to zero), and the switch neither reports any dropped packets nor more network utilization than idle. And during these events, no server is issuing any more logs about it.
Our setup in general (using HDDs) does not seem to be the problem because everything was working flawless before: With our ZFS configuration, we could have write speeds up to 200 MB/s before, fully utilizing our disks, and still, thanks to the L2ARC and adaptive scheduling, without nothing really being effected (no VM crashes, no error logs at all, concurrent writes were slow but continuously working).
What we've tried/verified so far:
- All servers are using the same NTP time server to synchronize their clocks.
- We have used NFSv4 before the issue came up and changed the version to NFSv3 because it made the setup stable enough for day-to-day business. With NFSv4, the issue was triggered more often, at least once a day. With NFSv3, the error happens once to twice a week on average.