I have 5 NFS shares coming from my Unraid server into my Proxmox cluster of 4 nodes (named Prox-1 through 4 respectively). These shares are:
I'm trying to have all shares auto-mount on boot and auto-reconnect if there's a network interruption or error. I saw in an older thread that running this code in crond every minute could automatically reconnect them:
However, this post was from 2017 so please let me know if there is another/better way to do this.
Currently, when I boot the Prox hosts, it will see the shares and connect to them no problem. As mentioned above, the Backup share will disconnect partway through the backup operation and doesn't automatically reconnect. I get this error when clicking on the share:
The shares are set to private with the Prox host's IP addresses and permissions specified in Unraid as 192.168.2.50(sec=sys,rw) 192.168.2.51(sec=sys,rw) 192.168.2.52(sec=sys,rw) 192.168.2.53(sec=sys,rw). Looking through the error messages, I see that the backup transfer starts, but then the write operations stop partway through.
(Logs from VMID 100 & 201 attached)
(fstab and config.cfg info attached)
To my untrained eyes, it seems like the logs are saying the disks can't keep up with the inflow of information to them and are erroring out. I do have an SSD cache disk that I can (and would prefer to) use in the Unraid server for obvious transfer speed rates. However, looking through various forum posts it seems like that can cause the NFS shares to result in 'stale file handle' errors when Mover runs to move the data from the cache disk to the array. Could someone point me to a workaround for this?
I'm open to using SMB if that would give better performance or prevent this issue, NFS just seemed to be the logical choice as both systems are Linux-based.
On a possibly related note, I have a local directory on Prox-1 (the directory named 'localbackups') that I am trying to share to the other 3 nodes. It seems to be shared out but I'm getting this 500 error when I try to access it:
Does this local backup directory need to be in fstab, in the main node or all nodes? If so, what is the sytax to put it in? Does PVE see it as a network share so it would just use the host's IP?
Lastly, is there a command to reconnect the NFS shares when they give this 500 error? Currently I only know to reboot the entire cluster to reconnect them. After the Backups share disconnected, I ran a mount command but it showed that everything was already mounted:
Thank you so much, please let me know if I can provide any other information. For my own clarity, the things I'm looking to fix are:
- AppData
- Archives
- Backups
- ISOs
- VM-Datastore
I'm trying to have all shares auto-mount on boot and auto-reconnect if there's a network interruption or error. I saw in an older thread that running this code in crond every minute could automatically reconnect them:
Code:
#!/bin/bash
list=$(ls /mnt/pve)
for i in $list
do
status=$(ls /mnt/pve/$i 2>&1)
if [[ $status =~ .*Stale.* ]]
then
umount /mnt/pve/$i
fi
done
However, this post was from 2017 so please let me know if there is another/better way to do this.
Currently, when I boot the Prox hosts, it will see the shares and connect to them no problem. As mentioned above, the Backup share will disconnect partway through the backup operation and doesn't automatically reconnect. I get this error when clicking on the share:
Code:
unable to activate storage 'Backups' - directory '/mnt/pve/Backups' does not exist or is unreachable (500)
The shares are set to private with the Prox host's IP addresses and permissions specified in Unraid as 192.168.2.50(sec=sys,rw) 192.168.2.51(sec=sys,rw) 192.168.2.52(sec=sys,rw) 192.168.2.53(sec=sys,rw). Looking through the error messages, I see that the backup transfer starts, but then the write operations stop partway through.
(Logs from VMID 100 & 201 attached)
(fstab and config.cfg info attached)
To my untrained eyes, it seems like the logs are saying the disks can't keep up with the inflow of information to them and are erroring out. I do have an SSD cache disk that I can (and would prefer to) use in the Unraid server for obvious transfer speed rates. However, looking through various forum posts it seems like that can cause the NFS shares to result in 'stale file handle' errors when Mover runs to move the data from the cache disk to the array. Could someone point me to a workaround for this?
I'm open to using SMB if that would give better performance or prevent this issue, NFS just seemed to be the logical choice as both systems are Linux-based.
On a possibly related note, I have a local directory on Prox-1 (the directory named 'localbackups') that I am trying to share to the other 3 nodes. It seems to be shared out but I'm getting this 500 error when I try to access it:
Code:
unable to activate storage 'localbackup' - directory is expected to be a mount point but is not mounted: '/mnt/pve/localbackup' (500)
Does this local backup directory need to be in fstab, in the main node or all nodes? If so, what is the sytax to put it in? Does PVE see it as a network share so it would just use the host's IP?
Lastly, is there a command to reconnect the NFS shares when they give this 500 error? Currently I only know to reboot the entire cluster to reconnect them. After the Backups share disconnected, I ran a mount command but it showed that everything was already mounted:
Code:
root@Prox-1:~# mount -av
/proc : already mounted
/mnt/pve/Archives : already mounted
/mnt/pve/AppData : already mounted
/mnt/pve/ISOs : already mounted
/mnt/pve/VM-Datastore : already mounted
/mnt/pve/Backups : already mounted
Thank you so much, please let me know if I can provide any other information. For my own clarity, the things I'm looking to fix are:
- Backup NFS shares disconnecting on backup operation
- Possibly related: Unraid "Mover" causing 500 disconnect errors
- These seem like large files that are mostly zeros (750G to 3.2G). Is there a way to just transfer the non-zero data or is this the recommended process?
- 'localbackup' directory not being shared with other hosts in Prox cluster
- Remount NFS shares without having to reboot the entire cluster.