Graceful NFS/SMB recovery

boomam · May 28, 2023

Hi,
Does anyone know of a command i can run, or even a full fix, that will allow a PVE node to gracefully recover its SMB/NFS connections?
Every time I take my storage array offline to do maintenance, PVE refuses to connect to the array again until I reboot them or delete & re-create the connections to the relevant shares.

Is there a way to workaround this so i don't have to do reboots each time?

Thanks in advance.

BobhWasatch · May 28, 2023

The traditional solution is the automounter (autofs).

https://manpages.debian.org/bullseye/autofs/autofs.8.en.html

boomam · May 28, 2023

BobhWasatch said:
The traditional solution is the automounter (autofs).

https://manpages.debian.org/bullseye/autofs/autofs.8.en.html

I'm not sure this is fully understanding what I am getting at,
To simplify, to hopefully help clarify.
If my smb or nfs server goes offline, proxmox does not autorecover it's connection to the shares, I have to reboot each node to get them to connect again.

For your link, I am not trying to mount on boot, it already does that.

Hopefully that helps explain better?

BobhWasatch · May 28, 2023

It will also unmount after a timeout and mount again when something tries to use the share.

I have client machines where the home directories are on an NFS4 server. If I reboot the server eventually the client autofs remounts the shares and they start working again. Not necessarily quickly, but within a few minutes.

Don't know how/if this works with CIFS/SMB, I don't use that.

boomam · May 29, 2023

BobhWasatch said:
It will also unmount after a timeout and mount again when something tries to use the share.

I have client machines where the home directories are on an NFS4 server. If I reboot the server eventually the client autofs remounts the shares and they start working again. Not necessarily quickly, but within a few minutes.

Don't know how/if this works with CIFS/SMB, I don't use that.

Is your suggestion to run 'systemctl autofs.service reload' as a workaround, when needed?

BobhWasatch · May 29, 2023

Where did you get that idea? I guess I'm not being clear enough. My final attempt...

There are two popular automount solutions for Linux. Neither should require you to do anything manually but you do have to adopt some procedures.

The automounter built into systemd uses /etc/fstab to list the shares you want to mount. That one is the simple answer if you only have a few clients.

You put "noauto,x-systemd.automount" in the options for the share and systemd will mount it when someone accesses it. I do this on my Proxmox server for the external backup disk. I can disable the mount in Proxmox, umount it, swap disks and re-enable in Proxmox. The line in /etc/fstab is below. This works for NFS mounts too, you just change "/mnt/backup" to "shares:/myshare" or whatever:

LABEL=backup /mnt/backup ext4 noauto,x-systemd.automount 0 2

There is also the traditional "autofs". I say traditional because it goes way back to original Unix systems. But it is still popular for doing central management of the mount maps via various ways including LDAP. That's the one I use on my NFS clients because I have a FreeIPA server VM that also handles Kerberos authentication (required for secure NFS) and LDAP directory services for the clients.

Either one should help with your problem if you take some care that the share isn't in use when you shut down the storage (users logged out, Proxmox mount point disabled, whatever). Which one you should use depends on your setup.

boomam · May 29, 2023

BobhWasatch said:
Where did you get that idea? I guess I'm not being clear enough. My final attempt...

There are two popular automount solutions for Linux. Neither should require you to do anything manually but you do have to adopt some procedures.

The automounter built into systemd uses /etc/fstab to list the shares you want to mount. That one is the simple answer if you only have a few clients.

You put "noauto,x-systemd.automount" in the options for the share and systemd will mount it when someone accesses it. I do this on my Proxmox server for the external backup disk. I can disable the mount in Proxmox, umount it, swap disks and re-enable in Proxmox. The line in /etc/fstab is below. This works for NFS mounts too, you just change "/mnt/backup" to "shares:/myshare" or whatever:

LABEL=backup /mnt/backup ext4 noauto,x-systemd.automount 0 2

There is also the traditional "autofs". I say traditional because it goes way back to original Unix systems. But it is still popular for doing central management of the mount maps via various ways including LDAP. That's the one I use on my NFS clients because I have a FreeIPA server VM that also handles Kerberos authentication (required for secure NFS) and LDAP directory services for the clients.

Either one should help with your problem if you take some care that the share isn't in use when you shut down the storage (users logged out, Proxmox mount point disabled, whatever). Which one you should use depends on your setup.

Ok
We must be talking different languages.
What you suggest is theory from my point of view. Interesting to note, but not a solution or workaround that's valid for every day usage, nor scripting and scheduling.

It's fine, hopefully someone else will be able to advise on a process in proxmox to help with this issue.
I would assume there's 'something', as storage pathing changing or going offline is not uncommon in the enterprise space, so having to restart every node to resolve it is, amateur.

BobhWasatch · May 29, 2023

You cannot just pull a disk out of your server and have it keep working without making preparations for that (e.g. RAID). Just because the shares are remote does not mean you can shut them off whenever you feel like it and have it work without preparation. The OS thinks it is a disk, the apps think it is a disk.

boomam · May 29, 2023

BobhWasatch said:
You cannot just pull a disk out of your server and have it keep working without making preparations for that (e.g. RAID). Just because the shares are remote does not mean you can shut them off whenever you feel like it and have it work without preparation. The OS thinks it is a disk, the apps think it is a disk.

I don't know what relevance this has to my question sorry?
Did you intend this for this thread, or another?

Nearest I can think is a point about path recovery, which I can say is 100% wrong if so?
Using 10-15yr old enterprise storage shelves (either direct connect or IP based), recovery of connections is a minimum requirement and not a concept even 'competing' platforms have issues with.
If proxmox is marketing themselves as a SME/enterprise ready solution for virtual environments, then recovery of storage path is a bare minimum requirement in this day and age.

BobhWasatch · May 29, 2023

Remote shares look like disks to applications. They therefore must _act_ like disks as much as possible, in particular when the wite() call returns the data is saved somewhere, possibly in cache, but will eventually get to storage. If you shut off the storage, what should happen to programs that are using the disk at that moment? How should those in-flight transactions be handled?

There are only two answers...client waits for the server to return, possibly forever, or client times out and returns an error. The former can lead to hung clients, the latter can lead to data corruption. NFS by default does the first thing. Those old "storage shelves" must have come with some kind of fail-over/redundancy scheme, which is maybe why you spent the big bucks on them. The Proxmox answer to that is a Ceph cluster.

Or, you know, you can work within the limitations. Make sure the mount points are not being used and unmount the share before shutting down the server. You can do that manually or you can semi-automate it which is what I have been saying. There is no "just works" without supplying either effort or money.

boomam · May 29, 2023

BobhWasatch said:
Remote shares look like disks to applications. They therefore must _act_ like disks as much as possible, in particular when the wite() call returns the data is saved somewhere, possibly in cache, but will eventually get to storage. If you shut off the storage, what should happen to programs that are using the disk at that moment? How should those in-flight transactions be handled?

There are only two answers...client waits for the server to return, possibly forever, or client times out and returns an error. The former can lead to hung clients, the latter can lead to data corruption. NFS by default does the first thing. Those old "storage shelves" must have come with some kind of fail-over/redundancy scheme, which is maybe why you spent the big bucks on them. The Proxmox answer to that is a Ceph cluster.

Or, you know, you can work within the limitations. Make sure the mount points are not being used and unmount the share before shutting down the server. You can do that manually or you can semi-automate it which is what I have been saying. There is no "just works" without supplying either effort or money.

I am still not sure your point here, but said recovery - my point - is not exclusive to "old" hardware.
Disconnections and storage layer issues are common, ranging from hardware failures, to planned outages. Hypervisors should be able to recover from transient issues.

To be clear, the fact that proxmox does not "retry" is an issue.
Whilst your point on corruption/etc is valid...it's not in 2023. Even in this market segment - aka, free.

It's got nothing to do with cost, "retrys" cost nothing.
Most modern hypervisors are intelligent enough to "pause" a VM or cache writes to prevent said corruption, whilst it attempts a reconnect.

I do not see proxmox attempting reconnects, which is the core issue I am having, where the solution right now is a node restart.

To re-illiterate, my question is how to recover when storage blips off and on - aka, retry/recovery.

The theory and usage of techs like Ceph are of no use to this discussion, as they assume legacy thinking in hypervisor and storage layer relations - to be blunt.

To be clear, I do appreciate the effort to answer the question, but your response/s are not something that is seemingly relevant, and misunderstand the question in some way I feel.

So I'll say thank you for helping, no need to continue further, enjoy your weekend!.

I'll await other opinions and workarounds/solutions.

BobhWasatch · May 29, 2023

The answer is to un-mount the share before shutting down the server and re-mount it after. That's the actual answer. There are ways to make it more or less automatic but you don't seem to be interested in "theory".

boomam · May 29, 2023

BobhWasatch said:
The answer is to un-mount the share before shutting down the server and re-mount it after. That's the actual answer. There are ways to make it more or less automatic but you don't seem to be interested in "theory".

The problem is that that doesn't account for transient issues - equally, removing all your NFS/SMB shares (even with a planned outages) is not a solution, but a hacky workaround - but I appreciate the notion, perhaps this is just a limitation of Proxmox, others can maybe confirm.

boomam · May 29, 2023

Similar issue here, will play around with the suggested workarounds -

https://forum.proxmox.com/threads/n...-turn-on-or-restart-the-nas.29900/post-190524

A shame pve doesn't appear to do this gracefully, without working around the issue :-(

Search

Search

Graceful NFS/SMB recovery

boomam

Member

BobhWasatch

Famous Member

boomam

Member

BobhWasatch

Famous Member

boomam

Member

BobhWasatch

Famous Member

boomam

Member

BobhWasatch

Famous Member

boomam

Member

BobhWasatch

Famous Member

boomam

Member

BobhWasatch

Famous Member

boomam

Member

boomam

Member