Issues with NFS mounts on proxmox 5.1

Yanwoo

Member
Oct 27, 2017
21
0
6
64
Hi

I'm having trouble with NFS mounts on proxmox 5.1.

I've followed the advice I've read on here and mounted my NFS share on the host, and then made that available to containers with mount points. The NFS share works fine on the host and is accessible on the containers. So far so good.

The issue is with host shutdown, and container autostart. After I've connected the containers with the mount points, when I try and restart the host hangs on 'a stop job is running for PVE guests (*time* / no limit). After doing a hard reboot, the containers then fail to autostart with an 'exit code 1' ('systemctl start pve-container... failed').

I can start them fine from the GUI. And also I can shutdown them from the GUI as well, and then the host restart works better (there's no hang on the ...'pve guests' although it fails to unmount the NFS). I can also unmount them at the CLI without issue.

The containers I've tried this with are running Ubuntu 16.10 and 17.04. They are unprivileged.

I've tried changing the NFS links to soft, and that has allowed the NFS to be unmounted when I restart the host (when the containers are already shutdown), but the other issues remain.

I'm new to proxmox (and virtualisation) so I'm not sure what I should try next? Try another container o/s? Or is there a configuration on the NFS (server or client side) that might be causing this? What logs should I be looking at to see more detail about what's happened?

Any help much appreciated!
 
Last edited:
my pveversion -v is:
Code:
proxmox-ve: 5.1-25 (running kernel: 4.13.4-1-pve)
pve-manager: 5.1-36 (running version: 5.1-36/131401db)
pve-kernel-4.13.4-1-pve: 4.13.4-25
pve-kernel-4.10.17-2-pve: 4.10.17-20
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-15
qemu-server: 5.0-17
pve-firmware: 2.0-3
libpve-common-perl: 5.0-20
libpve-guest-common-perl: 2.0-13
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-16
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-2
pve-container: 2.0-17
pve-firewall: 3.0-3
pve-ha-manager: 2.0-3
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.0-2
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.2-pve1~bpo90
openvswitch-switch: 2.7.0-2
 
Syslog is giving some more details with the startup issue

Code:
Oct 28 20:55:08 pve systemd[1]: Starting PVE LXC Container: 221...
Oct 28 20:55:08 pve lxc-start[3351]: lxc-start: 221: lxccontainer.c: wait_on_daemonized_start: 751 No such file or directory - Failed to receive the container state
Oct 28 20:55:08 pve lxc-start[3351]: lxc-start: 221: tools/lxc_start.c: main: 368 The container failed to start.
Oct 28 20:55:08 pve lxc-start[3351]: lxc-start: 221: tools/lxc_start.c: main: 370 To get more details, run the container in foreground mode.
Oct 28 20:55:08 pve lxc-start[3351]: lxc-start: 221: tools/lxc_start.c: main: 372 Additional information can be obtained by setting the --logfile and --logpriority options.
Oct 28 20:55:08 pve systemd[1]: pve-container@221.service: Control process exited, code=exited status=1
Oct 28 20:55:08 pve systemd[1]: Failed to start PVE LXC Container: 221.
Oct 28 20:55:08 pve systemd[1]: pve-container@221.service: Unit entered failed state.
Oct 28 20:55:08 pve systemd[1]: pve-container@221.service: Failed with result 'exit-code'.
Oct 28 20:55:08 pve pve-guests[3349]: command 'systemctl start pve-container@221' failed: exit code 1

I'm also getting a handful of errors saying my two NFS shares aren't online (after the containers have failed to start)

Code:
Oct 28 20:55:18 pve pvestatd[3203]: storage 'remote-NN2-media' is not online
Oct 28 20:55:20 pve pvestatd[3203]: storage 'remote-NN1-media' is not online
Oct 28 20:55:28 pve pvestatd[3203]: storage 'remote-NN2-media' is not online
Oct 28 20:55:30 pve pvestatd[3203]: storage 'remote-NN1-media' is not online
Oct 28 20:55:38 pve pvestatd[3203]: storage 'remote-NN1-media' is not online
Oct 28 20:55:40 pve pvestatd[3203]: storage 'remote-NN2-media' is not online
 
when this happens, is the mount stale?

I have a problem with this under heavy I/O (NFS shares going stale probably because of latency.) My solution is to simply remove stale mounts as pvesm will remount them as necessary:

#!/bin/bash
# detect_stale.sh
for mountpoint in $(grep nfs /etc/mtab | awk '{print $2}')
do
read -t1 < <(stat -t "$mountpoint" 2>&-)
if [ -z "$REPLY" ] ; then
echo "NFS mount stale. Removing..."
umount -f -l "$mountpoint"​
fi​
done

reasonably effective via cron, could probably be more efficient as a systemd task. This really should be done automatically by pvesm.
 
Thanks for the reply!

I can access the shares fine - I think I would get a stale file handler error when I try to list the contents, for example, if that was the issue? I'll keep your script handy though, in case I hit that later!

I'm a bit confused how the containers start up and shutdown just fine from GUI and CLI (and I can unmount the NFS shares at the CLI), but not during the shutdown/startup process - I'm guessing the same commands are being issued?
 

Thanks for the reference!

It looks like they're trying to mount the nfs share directly in the containers though? I'm mounting it on the host, and creating mount points in the containers - which I believe, from other posts on here, is the preferred approach and doesn't involve app armour changes.

And my mounts and mount points work fine - it's just they're causing a hang on shutdown, and it's stopping the containers from starting automatically on boot.

I've now tried it using autofs instead, but I get the same results if the containers are running and the nfs shares are still mounted at shutdown (''a stop job is running for PVE guests (*time* / no limit)."). And they still don't startup.

I've been searching around and can't find many others getting this error, whilst many are using nfs shares and mount points, so there must be something unusual about my configuration or the way I've set it up causing this behaviour?
 
what I continue to find most baffling, is that the containers start up and shutdown seamlessly from the GUI. And I can run the command that the log says is problematic when shutting down the container from the CLI and it works instantly ("lxc-stop -n 221 --kill").
 
I have read that the startup issue could be a timing problem, with the network not up and running when the container is trying to start. I am using an OVS bond, and virtual LANs, both of which I assume take more time to initiate etc - could that be the problem here?

I've tried adding startup delays, but it doesn't seem to pay attention to them, and I get the errors with no apparent delay.
 
I have mostly resolved this now. I did a fresh install and went more slowly through setting up and configuring. There was an error in my openswitch configuration, which I believe was causing the shutdown problems. Shutdown is now quick and seamless.

For the startup problems, it was a timing issue. I now have a container starting up first that doesn't depend on mount points and have set a 60 second delay before the rest of the containers with mount points start. Although a bit of a kludge, it works fine now and is adequate for my needs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!