How to kill a container that doesn't stop

lince · Aug 6, 2016

Hello,

Is there a way to kill a container that won't stop ?

I launched the stop command and this doesn't respond. Now the proxmox node is with 15 load average (2 cpu) and it doesn't seem to respond to any virtualization tasks like start other containers, open console...

Is there any way to troubleshoot or kill this container to recover the proxmox node without having to reboot it ?

Thanks.

dietmar · Aug 6, 2016

The question is why your container fail to stop? Most likely because your storage backend is offline? Maybe a hanging NFS mount?
What is the output of

# pvesm status

lince · Aug 7, 2016

Thanks for your reply. It was indeed a storage issue but I think it was because of the container.

I ended up restarting the node because I needed to have it running.

As for the container with the issue, the problem was that I tried to increase the hdd size and it got corrupted some how. The size in the web panel was wrong. I made a backup and restore, and now the size reported in the web panel is correct and the container seems to be working fine.

It would be nice to have a reload or re-read the hdd information for a container in case things like this happen.

I will keep your command close to check it out in case something like this happens again.

Regards.

luphi · Mar 7, 2018

Hello,

I have the same issue but won't restart the whole node.
pct list is hanging, stopping the container is also hanging.

unfortunately storage seams to be okay, so I need help on further investigations.

Code:

root@pve:~# pvesm status
Name           Type     Status           Total            Used       Available        %
Images          lvm     active       488120320       226492416       261627904   46.40%
backup          nfs     active      1921801728       834254336      1087531008   43.41%
iso             nfs     active       159911424        94350848        65544192   59.00%
local           dir     active        20511312        10012276         9476024   48.81%
rep1_ct         rbd     active     10743709696      6270484480      4472176640   58.36%
rep1_vm         rbd     active     10743709696      6270484480      4472176640   58.36%
rep2_ct         rbd     active     10743709696      6270484480      4472176640   58.36%
rep2_vm         rbd     active     10743709696      6270484480      4472176640   58.36%
rep3_ct         rbd     active     10743709696      6270484480      4472176640   58.36%
rep3_vm         rbd     active     10743709696      6270484480      4472176640   58.36%
root@pve:~#

Cheers,
luphi

DerMerowinger · Jul 1, 2018

Same problem here
pvesm status
Name Type Status Total Used Available %
NFS-NAS nfs active 961433728 2717952 909877632 0.28%
backup dir active 3844640564 1909543520 1739729784 49.67%
local dir active 98559220 6744956 86764716 6.84%
local-lvm lvmthin active 367558656 47231287 320327368 12.85%

Cannot kill Container, have to HARD-RESET machine!
pct stop 102 and nothing happens.
with top on pve machine:

25624 root 20 0 0 0 0 R 100.0 0.0 4:55.15 kworker/u4+
3795 root 20 0 3016408 2.024g 5748 R 13.6 26.4 3022:48 kvm
12726 root 20 0 45056 3944 3080 R 0.3 0.0 0:00.29 top
1 root 20 0 57640 7084 4996 S 0.0 0.1 0:34.89 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.10 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:+
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_+

the appended picture is what shows my server

StanTastic · Feb 7, 2020

For anyone else who'll come here looking for solution: it seems that [lxc monitor] process is to blame for an undead container.

Do 'ps aux | grep [container ID]', and then kill [lxc monitor] process with -9.

ChiquiFornari · Apr 22, 2020

StanTastic said:
For anyone else who'll come here looking for solution: it seems that [lxc monitor] process is to blame for an undead container.

Do 'ps aux | grep [container ID]', and then kill [lxc monitor] process with -9.

THIS!
Thank you!

phagez · Aug 19, 2020

This monitor process seems to be finicky and keeps hanging. It seems that when i'm not careful and shutdown containers sequentially and quickly I can get in this situation. Every indication is that the container is actually shutdown, however Proxmox seems to think that it is still in the process of shutting down and this solution to kill the process seems to be the only way out without restarting the Proxmox host.

smallsomething · Dec 7, 2020

StanTastic said:
For anyone else who'll come here looking for solution: it seems that [lxc monitor] process is to blame for an undead container.

Do 'ps aux | grep [container ID]', and then kill [lxc monitor] process with -9.

This just saved my ass. Is this a bug or was this fixed in a new version? I'm a little behind still on my updates.

Edit: might have spoke too soon, trying to start the container in question now fails

Code:

root@thor01:~# systemctl status pve-container@110.service
● pve-container@110.service - PVE LXC Container: 110
   Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-12-07 00:13:35 EST; 46s ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 18270 ExecStart=/usr/bin/lxc-start -n 110 (code=exited, status=1/FAILURE)

Dec 07 00:13:34 thor01 systemd[1]: Starting PVE LXC Container: 110...
Dec 07 00:13:35 thor01 lxc-start[18270]: lxc-start: 110: lxccontainer.c: wait_on_daemonized_start: 874 Received container state "ABORTING" instead of "RUNNING"
Dec 07 00:13:35 thor01 lxc-start[18270]: lxc-start: 110: tools/lxc_start.c: main: 329 The container failed to start
Dec 07 00:13:35 thor01 lxc-start[18270]: lxc-start: 110: tools/lxc_start.c: main: 332 To get more details, run the container in foreground mode
Dec 07 00:13:35 thor01 lxc-start[18270]: lxc-start: 110: tools/lxc_start.c: main: 335 Additional information can be obtained by setting the --logfile and --logpriority
Dec 07 00:13:35 thor01 systemd[1]: pve-container@110.service: Control process exited, code=exited, status=1/FAILURE
Dec 07 00:13:35 thor01 systemd[1]: pve-container@110.service: Killing process 18281 (lxc-start) with signal SIGKILL.
Dec 07 00:13:35 thor01 systemd[1]: pve-container@110.service: Failed with result 'exit-code'.
Dec 07 00:13:35 thor01 systemd[1]: Failed to start PVE LXC Container: 110.

Edit 2: well rebooting the entire host "fixed" this, though the host itself was stuck shutting down waiting for some processes to die and the watchdog had to step in.. definitely making me concerned now..

toxic · Jan 7, 2022

Sorry to revive this ancient thread, but it seems @dietmar has some insights into my current issue as I just got a stuck container due to nfs backend being somehow crashed.

Nonetheless, I was unable to solve the issue in any other way than ps aux | grep <cointainerID> and trying to kill whatever I found there...
And even then, I did stop the container, but it was unable to start again even after the backend storage was back up, probably I killed everything named with my container id in the command line but I did not actually kill the container... I had to fully reboot the pve node itself...

But maybe dietmar asked this question because there is a better solution when nfs or cifs backends are stuck... Something that wouldn't require to reboot the pve host

Any input would be appreciated

allegiancetech · Jan 13, 2022

I am having the same issue and am also hoping for a resolution. I have several containers on this node, but it is always the same one that hangs up. I am using NFS, but I'm not seeing any issues. I'm running 6.3-2. Thanks.

toxic · Jan 13, 2022

Well, try patience, I rebooted the pve host then spent 4hours until I found pct fdisk to fix the corrupted filesystem of the container

pagalba-com · Nov 30, 2023

Seems like IO stuck container block all HOST. It made it appear with question marks in console. Still I didn't want to reboot all host, because rest containers looked like working ok.
This thread helped me to fix issue wthout host reboot.

First identify failing container, use this command:

Code:

ps auxf | awk '{if($8~"D") print $0;}';

it will show IO stuck/stale or blocked process.

Then find a container by PID, I call script `cont_pid2id`:

Code:

#!/bin/sh

if [ ! -e "/proc/$1/cgroup" ]; then
   echo "no process with PID '$1' found!"
   exit 1
fi

grep -oP "(?<=pids:/lxc/)\d+" "/proc/$1/cgroup" || (echo "process with PID '$1' does not belong to an PVE CT"; exit 1)

and finally check status a container:

Code:

systemctl status pve-container@CTID

and STOP container service with command:

Code:

systemctl stop pve-container@CTID

This will wait until graceful stop timeout, and then will kill -9 all the processes automatically.

Search

Search

How to kill a container that doesn't stop

lince

Member

dietmar

Proxmox Staff Member

lince

Member

luphi

Renowned Member

DerMerowinger

Active Member

Attachments

StanTastic

New Member

ChiquiFornari

Member

phagez

New Member

smallsomething

Member

toxic

Member

allegiancetech

New Member

toxic

Member

pagalba-com

New Member