Unable to stop container, forced to reboot node manually

danny_h · Aug 21, 2019

I started having this issue after upgrading from v5 to v6

About once a week, I will notice that a service running in a container is no longer accessible or running (Plex, UniFi etc). I would load the Proxmox web interface and attempt to restart the container. I don't get an error message, but in the task list, the loading icon just keeps spinning and nothing ever happens.

I have tried the following, with the follow results.

lxc-stop -n 109 --kill => Never errors out, just sits there.

pct stop 109 => can't lock file '/run/lock/lxc/pve-config-109.lock' - got timeout

rm /run/lock/lxc/pve-config-109.lock
pct stop 109 => Never errors out, just sits there.

I have also tried rebooting the node from the web interface and it starts to shutdown my other containers, but obviously gets stuck eventually when it hits 109 (not always the same container). Eventually I have to walk my happy ass upstairs and press and hold the power button to fully shutdown the node.

pvecm status
Quorum information
------------------
Date: Wed Aug 21 04:27:18 2019
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1/32
Quorate: Yes

Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.1.118 (local)

Current Info
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-4.15: 5.4-7
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

------

Any help would be greatly appreciated. My wife approval factor is dropping quickly every time Plex or one of the other services goes down haha.

oguz · Aug 21, 2019

hi,

* does the container hang, or the service running inside stops?
* do you see anything in the syslog when you try to stop the container?

it doesn't really help solve the underlying issue, but maybe you can try:

Code:

pct shutdown CTID --forceStop

danny_h · Aug 21, 2019

So I believe it is an issue with the container. Looking over my downtime emails it seems to happen at night when I take down the container for a backup. Most of my containers come back up, but occasionally I will have an issue with one in the morning which leads me down the path of trying to shutdown/stop the broken one. Ultimately if I don't stop the container, it continues running. However if I stop it for a backup, then try to bring it back up, I have issues with some of them (though not always the same one).

As far as syslogs, here are the logs when I try to run `pct shutdown 109 --forceStop`

Code:

root@proxmox:/var/log# tail -f syslog
Aug 21 06:11:49 proxmox pvedaemon[94499]: starting termproxy UPID:proxmox:00017123:01995694:5D5D5F35:vncshell::root@pam:
Aug 21 06:11:50 proxmox pvedaemon[40349]: <root@pam> successful auth for user 'root@pam'
Aug 21 06:11:50 proxmox systemd[1]: Started Session 7329 of user root.
Aug 21 06:12:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Aug 21 06:12:01 proxmox systemd[1]: pvesr.service: Succeeded.

Doesn't look like anything shows up in the logs.

Eventually I get the following message when running `pct shutdown 109 --forceStop`

Code:

lxc-stop: 109: commands_utils.c: lxc_cmd_sock_rcv_state: 70 Resource temporarily unavailable - Failed to receive message
command 'lxc-stop -n 109 --timeout 60' failed: got timeout

Edit: Screenshot of backup policy.

oguz · Aug 21, 2019

only this container is affected? what does the container config look like?

do you see anything in the syslog, around the time container hangs?

danny_h · Aug 21, 2019

It is looking like I might be having an issue with my NAS box (UnRaid). I am going to investigate further and i'll report back once I see what's going on. I am seeing some cifs failures, which I believe may be causing the hang.

I do have a general question though about the cifs mounts. I know when creating my fstab entry, I can apply a uid and gid. What I can't figure out is how to determine the uid/gid of the user inside of a container. I have all of my mounts in /etc/fstab on the node, then those are being passed into the container like the following:

Code:

mp0: /media/music/,mp=/media/music/
mp1: /media/movies/,mp=/media/movies

Thanks!

oguz · Aug 23, 2019

hi,

danny_h said:
I do have a general question though about the cifs mounts. I know when creating my fstab entry, I can apply a uid and gid. What I can't figure out is how to determine the uid/gid of the user inside of a container. I have all of my mounts in /etc/fstab on the node, then those are being passed into the container like the following:

the easiest way to pass a cifs mount to an lxc container is to use a bind mount[0] (cifs would be mounted on your pve host) and configure uid mappings[1] for unprivileged containers.
you need to configure the uid mappings with the uid/gid which has read/write access on the mount on your host.

let me know if you have any other questions.

[0]: https://pve.proxmox.com/wiki/Linux_Container#_bind_mount_points
[1]: https://pve.proxmox.com/wiki/Unprivileged_LXC_containers

mailinglists · Aug 23, 2019

FYI I experience exact same symptoms (never stops) with LXC with NFS network mounts.
I decided I will not investigate further, because there is a simple solution, and it is just my home server.
So just do "ps faxuw" in console and look for process name with VM ID in it. Usually it is just one or two processes (can't remember), with nothing running in them from the actual LXC.
Then kill it with "kill (-9) PID". Then PM will notice it is stopped and you can start using it again. -9 is probably not needed.
Also my NFS and SMB server is just another (at the time of shutdown failure) working LCX on the same machine.
I have also set start/stop priorities, but it did not help.

Also the problem shows up randomly and sometimes after waiting long enough 10 to 20 minutes, it actually completes by itself.

danny_h · Aug 25, 2019

@oguz I wanted to provide an update as I am still working through my issue.

The container that seems to always have issues is my UniFi-Video container. Looking through the logs, it appears to break when I stop the containers for a scheduled backup. It looks like all of my other containers stop and start without issue, however when UniFi-Video tries to come back up it hangs and does so in a way that forces me to kill the node with the power button.

My UniFi-Video container seems to fail during the dismount of my CIFS share, which is mounted to the NODE then passed through to the container. Since UniFi-Video is in a constant recording state, I am assuming it is having issues dismounting the share since it is technically "in-use".

I am going to be posting over on the UniFi forums as well, but is there any known issues or configuration settings that would allow a container to dismount a share that is actively being written to? Would it be better to mount the share directly to the container instead of to the node, then to the container? Is that even possible?

In the meantime I have switched my backup method from "stop" to "snapshot" in hopes of resolving the trigger of my problem while I work through the cause.

@mailinglists - Next time I get a failure, I will give this a shot, thanks for the info!

Search

Search

Unable to stop container, forced to reboot node manually

danny_h

New Member

oguz

Proxmox Retired Staff

danny_h

New Member

oguz

Proxmox Retired Staff

danny_h

New Member

oguz

Proxmox Retired Staff

mailinglists

Renowned Member

danny_h

New Member