Troubleshooting NFS stale mount without rebooting... feasible?

iccbroadcast · Jul 13, 2022

Hi forum.

I've found a problem in my 5 node cluster at OVH provider: one of my nodes got somehow cut from provider NFS storage services that provide backup, and its NFS mounts have got to a stalled status.

NFS servers are up, and the rest of nodes have them mounted and working fine. Morevover, connection from affected node to NFS servers works fine too... I'm sure the root cause is a temporary break in network routing from this particular node to the backup servers, that lead to a stalled/hung NFS mounts.

mount command shows the mounts as up, but they're not working. Tonight backups have failed, and the storages appear as disabled.
I gave a shot at pvesm command just in case it worked, but as I expected, it didn't... the NFS mount is in a hung state.

Any advice on how to recover from this? ... or rebooting the node is the only way to go (I'm reluctant to consider rebooting as a troubleshooting tool)

Thanks. regards.

oguz · Jul 13, 2022

hi,

does unmounting the share and remounting it not work?
you can try umount -l -f /mountpoint and try remounting the share

iccbroadcast · Jul 13, 2022

yep... seems it did the trick.

Now, since the mount isn't in fstab, mounting it doesn't work.
But I guess the troublesome part is done...now I've to figure out how to tell the node to re-mount.

EDIT:
OK... now I realized pvestatd on the node didn't liked that.
I had to restart it, but apparently it cannot mount the shares.

Cheers.

bbgeek17 · Jul 13, 2022

In our KB article we recommend the following command when storage configuration changes:

Code:

 systemctl try-reload-or-restart pvedaemon pveproxy pvestatd

https://kb.blockbridge.com/guide/proxmox/

We found that this insures that all parts of PVE are properly refreshed. However, I have not tested it in your particular case.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

iccbroadcast · Jul 13, 2022

Hi... thank you very much for the help.

Now I'm realizing that when I tried pvesm disable it affected cluster wide.
Not only that, but now, since the node fails to mount the share, the mount is not restablished for the other nodes.
As I feared by early investigating before posting, this part of the proxmox clustering is very delicate, quite a weak point... a single NFS share temporary interruption on a node, and the whole cluster is somehow in trouble.

I'm really scared to launch that command on every node on the cluster... I feel the more I struggle to avoid a node reboot, the more things are getting worse cluster wise :-(
I have to wait until customer activity on the cluster VMs gets lower, and plan/prepare for an incident scenario before keeping messing with this.

EDIT
Managed to make the situation 'stable' again, unsolved, but 'stable'... Just if someone is reading this in the future. It's very easy, and just using the GUI:
- I just went to cluster storage, and edited NFS shares so, instead of 'all nodes', I specifically list all nodes except the troublesome one. This way, all nodes except the troublesome one, got the shares up and operational... The problem starts being isolated again.
- I restarted pvestatd on the affected node... And now pvestatd keeps up, so the node seems unaware of the underlying NFS problem looking 'green'.
- I disabled all the backup tasks on the affected node, until it gets fixed, to prevent errors.

Now I'm considering to either troubleshoot NFS on the node (outside PVE stuff) or rebooting it.

Search

Search

Troubleshooting NFS stale mount without rebooting... feasible?

iccbroadcast

Renowned Member

oguz

Proxmox Retired Staff

iccbroadcast

Renowned Member

bbgeek17

Distinguished Member

iccbroadcast

Renowned Member

We value your privacy