VM Internal Disk (ATA) Errors (NFS Share disconnecting?)

Gandalf1783

New Member
Oct 4, 2022
6
1
3
Germany
Hey everyone,

I have some of my VM Disks on my Unraid NAS on a NFS share, connected via a 10Gbit link;
Now, after some time, i started noticing ATA Errors coming up in my Nextcloud VM. It has 2 Disks, one for data and one for the OS.

While the VM itself runs fine, the disk for the data suddenly returns errors:

1664890515138.png
1664890556753.png

Since my NAS did not return any errors (e.g. Parity errors) and wasnt shutdown, hasn't lost connection or anything, I cannot see why this is the case.
Upon trying to backup VMs in this state (using PBS), they exit with "Stale File handle" (I think it was error code -116). (Noticed due to automatic backups)

Now, this is not the only VM that this happened to. But also, for some VMs with disks on the exact same share, nothing happens. E.g. our Minecraft Server, runs fine without any of these Disk errors. They already occured 3-4 times.

I started to notice that my unraid disks seem to spin down after * some time *, but shouldn't they spin back up and read the data?
Is this actually an issue with the NAS?

I am running Proxmox 7.2-11, and these here are the package versions on the node:
proxmox-ve: 7.2-1 (running kernel: 5.13.19-2-pve) pve-manager: 7.2-11 (running version: 7.2-11/b76d3178) pve-kernel-helper: 7.2-12 pve-kernel-5.15: 7.2-11 pve-kernel-5.13: 7.1-9 pve-kernel-5.15.60-1-pve: 5.15.60-1 pve-kernel-5.13.19-6-pve: 5.13.19-15 pve-kernel-5.13.19-2-pve: 5.13.19-4 ceph-fuse: 15.2.15-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.24-pve1 libproxmox-acme-perl: 1.4.2 libproxmox-backup-qemu0: 1.3.1-1 libpve-access-control: 7.2-4 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.2-2 libpve-guest-common-perl: 4.1-2 libpve-http-server-perl: 4.1-3 libpve-storage-perl: 7.2-9 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 5.0.0-3 lxcfs: 4.0.12-pve1 novnc-pve: 1.3.0-3 proxmox-backup-client: 2.2.6-1 proxmox-backup-file-restore: 2.2.6-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.5.1 pve-cluster: 7.2-2 pve-container: 4.2-2 pve-docs: 7.2-2 pve-edk2-firmware: 3.20220526-1 pve-firewall: 4.2-6 pve-firmware: 3.5-3 pve-ha-manager: 3.4.0 pve-i18n: 2.7-2 pve-qemu-kvm: 7.0.0-3 pve-xtermjs: 4.16.0-1 qemu-server: 7.2-4 smartmontools: 7.2-pve3 spiceterm: 3.2-2 swtpm: 0.7.1~bpo11+1 vncterm: 1.7-1 zfsutils-linux: 2.1.5-pve1

After a reboot, the VM works again until a few hours or days later.In this case, the Nextcloud VM was running for 1 day and 1:39 hours before i noticed this.

Thanks!
 
So you have an Unraid running on physical hardware, providing NFS share to PVE, where your VM disks/qcow are located, and your see the errors inside the VM?
If so, the VMs dont know anything about NFS but errors such as stale NFS handles will bubble up as some sort of IO errors.

The problem can be anywhere from PVE Kernel, NIC, Cable, Switch, Cable, Unraid NIC, Unraid Software.
Start by looking at recent changes, ie kernel updates. There were some issue recently between certain NAS vendors and specific kernel versions on clients (PVE). Look at all logs/stats, try to isolate the issue.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Upon trying to backup VMs in this state (using PBS), they exit with "Stale File handle"

I remember a thread from awhile ago: [1]
While that thread talks about Samba; in the Unraid thread: [2] NFS gets also mentioned.

Not sure, if this is the same problem as yours; but maybe it ([2]) helps you in some way.

[1] https://forum.proxmox.com/threads/unable-to-activate-storage-cifs-mount-dies-randomly.58922
[2] https://forums.unraid.net/topic/86707-cifs-smb-stale-file-handle
 
I just found it intresting that its only happening in the Nextcloud VM, and not in others that are also on the same server over NFS.
Therefore I thought I'd ask here incase there would be known problems with NFS Shares or similar.



Answer to Neobin:
In [2], they talk about disabling cache which seemed to work. I dont know if they had a cache drive installed. I dont, but I will test that option.
 
I have taken a look at the spin-down settings. Its set to 30 mins by default.
I can't see why a spin-up would be too long so it would spit out the errors mentioned above.

The cache option is in my case already set to "None", for my Proxmox Share on Unraid.
I am trying to retrieve the unraid NFS logs now so I can debug it the next time it goes down.

Also, I have confirmed that the Data Disk im mentioning is indeed on a different disk that has to spin up if any Nextclouduser needs their data.
 
It seems like a reboot has solved everything.
I plugged in a power strip which seemed to have caused a short and well my breaker tripped.

So basically everything was forced to reboot.
I can only assume it was a update that was required a reboot in the NFS driver for proxmox.

It now runs since multiple days without any problems. Gonna close this as solved now, though a direct solution could not be found.
 
  • Like
Reactions: leesteken

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!