VM Internal Disk (ATA) Errors (NFS Share disconnecting?)

Gandalf1783 · Oct 4, 2022

Hey everyone,

I have some of my VM Disks on my Unraid NAS on a NFS share, connected via a 10Gbit link;
Now, after some time, i started noticing ATA Errors coming up in my Nextcloud VM. It has 2 Disks, one for data and one for the OS.

While the VM itself runs fine, the disk for the data suddenly returns errors:

Since my NAS did not return any errors (e.g. Parity errors) and wasnt shutdown, hasn't lost connection or anything, I cannot see why this is the case.
Upon trying to backup VMs in this state (using PBS), they exit with "Stale File handle" (I think it was error code -116). (Noticed due to automatic backups)

Now, this is not the only VM that this happened to. But also, for some VMs with disks on the exact same share, nothing happens. E.g. our Minecraft Server, runs fine without any of these Disk errors. They already occured 3-4 times.

I started to notice that my unraid disks seem to spin down after * some time *, but shouldn't they spin back up and read the data?
Is this actually an issue with the NAS?

I am running Proxmox 7.2-11, and these here are the package versions on the node:

proxmox-ve: 7.2-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-11
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-9
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-3
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

After a reboot, the VM works again until a few hours or days later.In this case, the Nextcloud VM was running for 1 day and 1:39 hours before i noticed this.

Thanks!

bbgeek17 · Oct 4, 2022

So you have an Unraid running on physical hardware, providing NFS share to PVE, where your VM disks/qcow are located, and your see the errors inside the VM?
If so, the VMs dont know anything about NFS but errors such as stale NFS handles will bubble up as some sort of IO errors.

The problem can be anywhere from PVE Kernel, NIC, Cable, Switch, Cable, Unraid NIC, Unraid Software.
Start by looking at recent changes, ie kernel updates. There were some issue recently between certain NAS vendors and specific kernel versions on clients (PVE). Look at all logs/stats, try to isolate the issue.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

bbgeek17 · Oct 4, 2022

You also mentioned spinning down of disks on Unraid - you should ask Unraid support whether spinning down/up is functioning properly. Its also possible it does not spin up fast enough, causing upper layer problems.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Neobin · Oct 4, 2022

Gandalf1783 said:
Upon trying to backup VMs in this state (using PBS), they exit with "Stale File handle"

I remember a thread from awhile ago: [1]
While that thread talks about Samba; in the Unraid thread: [2] NFS gets also mentioned.

Not sure, if this is the same problem as yours; but maybe it ([2]) helps you in some way.

[1] https://forum.proxmox.com/threads/unable-to-activate-storage-cifs-mount-dies-randomly.58922
[2] https://forums.unraid.net/topic/86707-cifs-smb-stale-file-handle

Gandalf1783 · Oct 4, 2022

I just found it intresting that its only happening in the Nextcloud VM, and not in others that are also on the same server over NFS.
Therefore I thought I'd ask here incase there would be known problems with NFS Shares or similar.

Answer to Neobin:
In [2], they talk about disabling cache which seemed to work. I dont know if they had a cache drive installed. I dont, but I will test that option.

Gandalf1783 · Oct 4, 2022

I have taken a look at the spin-down settings. Its set to 30 mins by default.
I can't see why a spin-up would be too long so it would spit out the errors mentioned above.

The cache option is in my case already set to "None", for my Proxmox Share on Unraid.
I am trying to retrieve the unraid NFS logs now so I can debug it the next time it goes down.

Also, I have confirmed that the Data Disk im mentioning is indeed on a different disk that has to spin up if any Nextclouduser needs their data.

Gandalf1783 · Oct 21, 2022

It seems like a reboot has solved everything.
I plugged in a power strip which seemed to have caused a short and well my breaker tripped.

So basically everything was forced to reboot.
I can only assume it was a update that was required a reboot in the NFS driver for proxmox.

It now runs since multiple days without any problems. Gonna close this as solved now, though a direct solution could not be found.

Search

Search

VM Internal Disk (ATA) Errors (NFS Share disconnecting?)

Gandalf1783

New Member

bbgeek17

Distinguished Member

bbgeek17

Distinguished Member

Neobin

Distinguished Member

Gandalf1783

New Member

Gandalf1783

New Member

Gandalf1783

New Member