Troubleshooting NFS stale mount without rebooting... feasible?

May 23, 2012
19
0
41
Hi forum.

I've found a problem in my 5 node cluster at OVH provider: one of my nodes got somehow cut from provider NFS storage services that provide backup, and its NFS mounts have got to a stalled status.

NFS servers are up, and the rest of nodes have them mounted and working fine. Morevover, connection from affected node to NFS servers works fine too... I'm sure the root cause is a temporary break in network routing from this particular node to the backup servers, that lead to a stalled/hung NFS mounts.

mount command shows the mounts as up, but they're not working. Tonight backups have failed, and the storages appear as disabled.
I gave a shot at pvesm command just in case it worked, but as I expected, it didn't... the NFS mount is in a hung state.

Any advice on how to recover from this? ... or rebooting the node is the only way to go (I'm reluctant to consider rebooting as a troubleshooting tool)

Thanks. regards.
 
hi,

does unmounting the share and remounting it not work?
you can try umount -l -f /mountpoint and try remounting the share
 
yep... seems it did the trick.

Now, since the mount isn't in fstab, mounting it doesn't work.
But I guess the troublesome part is done...now I've to figure out how to tell the node to re-mount.

EDIT:
OK... now I realized pvestatd on the node didn't liked that.
I had to restart it, but apparently it cannot mount the shares.

Cheers.
 
Last edited:
In our KB article we recommend the following command when storage configuration changes:
Code:
 systemctl try-reload-or-restart pvedaemon pveproxy pvestatd
https://kb.blockbridge.com/guide/proxmox/

We found that this insures that all parts of PVE are properly refreshed. However, I have not tested it in your particular case.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi... thank you very much for the help.

Now I'm realizing that when I tried pvesm disable it affected cluster wide.
Not only that, but now, since the node fails to mount the share, the mount is not restablished for the other nodes.
As I feared by early investigating before posting, this part of the proxmox clustering is very delicate, quite a weak point... a single NFS share temporary interruption on a node, and the whole cluster is somehow in trouble.

I'm really scared to launch that command on every node on the cluster... I feel the more I struggle to avoid a node reboot, the more things are getting worse cluster wise :-(
I have to wait until customer activity on the cluster VMs gets lower, and plan/prepare for an incident scenario before keeping messing with this.

EDIT
Managed to make the situation 'stable' again, unsolved, but 'stable'... Just if someone is reading this in the future. It's very easy, and just using the GUI:
- I just went to cluster storage, and edited NFS shares so, instead of 'all nodes', I specifically list all nodes except the troublesome one. This way, all nodes except the troublesome one, got the shares up and operational... The problem starts being isolated again.
- I restarted pvestatd on the affected node... And now pvestatd keeps up, so the node seems unaware of the underlying NFS problem looking 'green'.
- I disabled all the backup tasks on the affected node, until it gets fixed, to prevent errors.

Now I'm considering to either troubleshoot NFS on the node (outside PVE stuff) or rebooting it.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!