Stuck on shutdown

GeorgiL

New Member
Jan 18, 2019
14
0
1
36
Greetings

In Proxmox I had to perform a shutdown recently and the system did not shutdown and was not accessible from web interface. When I connected to the server with display it was stuck on "reached target shutdown". I noticed that before that there was a message along those lines [FAILED] to unmount NFS. I don't know if its related but I have an NFS share mounted to Proxmox - the share is provided by a VM running on same server. Recently i did edit storage.cfg and changed "options vers=4" for the NFS, so I can use NFSv4 and TCP instead of UDP and NFSv3 that it was defaulting to. Currently it looks like this:


Code:
nfsstat -m
/mnt/pve/PlexNFS from 192.168.0.30:/export/Plex-Storage/Media
 Flags: rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.50,local_lock=none,addr=192.168.0.30

I wonder if reducing the "timeo" would help or if NFS share is the problem to begin with.
 
If the NFS is provided by one of your VMs you must unmount it before it shutdowns, else you will always run into issues - as this is quite problematic with NFS, thus having such cyclic dependencies is not recommended.

with latest PVE 5.3 you could automate this with a VM hookscript doing an active unmount of the share in the 'pre-stop' hook step, see "/usr/share/pve-docs/examples/guest-example-hookscript.pl" for an example hookscript
 
Few things:

- examples folder does not exist for me for some reason, but i did find this: https://code.forksand.com/proxmox/pve-docs/src/branch/master/examples/guest-example-hookscript.pl
- I made following script:

Code:
use strict;
use warnings;

print "GUEST HOOK: " . join(' ', @ARGV). "\n";

my $vmid = shift;
my $phase = shift;

if ($phase eq 'pre-stop') {

        system("umount -l /mnt/pve/PlexNFS");
        print("PlexNFS unmounted");
} else {
        die "got unknown phase '$phase'\n";
}

exit(0);


- as stated in the example file, i did run following command which doesnt seems to exist:
Code:
root@vm:~# qm set 100 -hookscript ~/snippets/PlexNFS-unmount.pl
Unknown option: hookscript
400 unable to parse option
qm set <vmid> [OPTIONS]


- tested the umount -l /mnt/pve/PlexNFS on its own and it does unmount the NFS share if my other CT that uses it, have been shutdown before that - and thats all good but then it automatically mounts it again in PVE itself as part of the NFS backend auto-check-if-its-available-to-mount. That takes about 5sec, and my VM is not going to shutdown in that amount of time, so I suspect the share might get mounted again before VM is shut down or at least the share is stopped.

TL;DR: can't hook the script, even if I do, I don't think it will work with PVE auto-mounting it. Not even sure where my script's "print" will output to so I can check.
 
- as stated in the example file, i did run following command which doesnt seems to exist:

This feature has been added quite recently (version 5.0-46). Did you update your system? What are your package versions? (pveversion -v)

Also if you read the manpage carefully, you'll see that you need to put the script in one of your storages (enable the 'Snippets' option for this storage) to be able to call it.

have been shutdown before that - and thats all good but then it automatically mounts it again in PVE itself as part of the NFS backend auto-check-if-its-available-to-mount. That takes about 5sec, and my VM is not going to shutdown in that amount of time, so I suspect the share might get mounted again before VM is shut down or at least the share is stopped.

You will need to disable the NFS share on the host to avoid that. Just add

Code:
pvesm set <storage> --disable 1

and make another script for the VM boot, and add there the same command with "--disable 0" to enable the storage before the VM boots.
 
This feature has been added quite recently (version 5.0-46). Did you update your system? What are your package versions? (pveversion -v)

This is new, clean install of PVE 5.3 mid January 2019 and its been allowed to update from security repo:

Code:
root@vm:~# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-5 (running version: 5.3-5/97ae681d)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-33
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

I dont know how to enable snippets for a storage. I did move my script to /var/lib/vz/snippets and make both the script and snippets folder to be executable.
 
you need at least qemu-server >= 5.0-46 for this feature.

you should run a 'apt update && apt full-upgrade' to get all the new versions.

I dont know how to enable snippets for a storage.

after you upgrade, you should see "Snippets" in the GUI in Storage settings.
 
Been an old topic but there is new information about it.

I did update my PVE (but we are still speaking for PVE 5.3, not 5.4) as you mentioned and did use a hookscript and it was working fine until I discovered another case where PVE fails to proceed with shutdown, probably again because of NFS mount. So here is my current setup. Launch order:

1) VM id 100, launches and has a hookscript which has post-start that enables the NFS mount in proxmox (the NFS export comes from the VM); it also has a pre-stop which checks if LXC 101 is running and if it is, it shut it down but also always unmounts the NFS in proxmox, then disables it
2) then after period of time LXC id 101 starts - it has a bind mount to the NFS share (takes it from proxmox)
3) lxc id 102 starts


There seems to be a problem with the hdd or for some other reason but at some point i saw the NFS share did drop and proxmox was continuously checking if the share is available - there were messages for that in the syslog. I also did request for node shutdown and system went shutting down in reverse order 102 -> 101 -> 100 but it got stuck on 101, I'm assuming because the share was missing and somehow this seems to confuse proxmox a lot because messages in syslog said that PID for lxc 101 is unknown so it doesnt know what the status is and couldn't do anything. I tried shutting down 101 using the GUI again but it said it did timeout while trying to get lock on the lxc 101 config in /run (i did not write down complete path to it). I did manage to shutdown 100 using the gui which also removed the NFS share in proxmox. This seems like it helped because after another 10 minutes proxmox decided it is going to shutdown 101 forcefully (which honestly should've done long ago because this took 40minutes while 101 is set to wait for only 90 sec).

So if for whatever reason the NFS share drops, this seems to mess up with the container where it is been bind mounted and this is a problem only if you decide to shutdown or reboot the container at that particular time when the share is missing. I don't know why this is such a problem, considering that the bind mount is read only for the container so basically no information can be lost if the share is not available at the time of shutdown or restart of the container that is using it. Even if the NFS was not provided a VM on same server as the client using it, there could be reasons for the share to just drop and thats fine. Shouldn't proxmox be able to handle this more gracefully or I'm missing something?

As for right now i consider making a hookscript for 101 to disable the share in proxmox before shutting down the container or making a script to get rid of it inside the container on its shutdown. I hope that helps.
 
I had a related problem with my installation. I run a ProxMox 8.1.10 installed on top of Debian 12.5 Host. It holds multiple LXC containers and VM's. One LXC runs Debian 12 and a Docker environment. That Docker environment holds some ~20 containers that runs mostly administrative chores and all has no need for pass through anything. However, NFS shares are shared from Host to a mix of LXC, VM's and Docker containers.

When running all on Debian 11 it all ran fine with no problems with shutdowns and reboots. Debian 12 gave me a few issues with hanged reboot and corrupted LXC containers. I whipped up a shell script that gracefully shuts down all the docker containers first (and wait for them to do so) and then moves on to shutdown the ProxMox LXC's and VM's (and wait for them to do so). It seems to have solved the problem.

For the host a simple argument is given as for it to reboot or shutdown after the virtual machines has been stopped. Its a bit clunky but it shows status of the machines and refreshes it.

Maybe some of this code spagetti can be used for anyone that faces similar challenges:
https://github.com/Lennong/graceful-shutdown
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!