LXC can't start after force stop

voarsh · Feb 22, 2021

Like this post: https://forum.proxmox.com/threads/u...ced-to-reboot-node-manually.57148/post-263512
It was necessary for me to actually kill the LXC start process using ps faxuw.

Unfortunately, now it won't turn back on.
Any ideas for me?

debug log:


lxc-start 140 20210222202600.401 INFO     lsm - lsm/lsm.c:lsm_init:29 - LSM security driver AppArmor
lxc-start 140 20210222202600.401 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "140", config section "lxc"
lxc-start 140 20210222202601.996 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 140 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/devices/lxc/140/ns/docker/153f724dd7b304a4a9042b652ff11c4c0bb238ecfeb2de94626bcbd13e646704': Device or resource busy

lxc-start 140 20210222202602.185 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 16
lxc-start 140 20210222202602.187 ERROR    start - start.c:lxc_init:797 - Failed to run lxc.hook.pre-start for container "140"
lxc-start 140 20210222202602.188 ERROR    start - start.c:__lxc_start:1896 - Failed to initialize container "140"
lxc-start 140 20210222202602.191 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxcfs/lxc.reboot.hook" for container "140", config section "lxc"
lxc-start 140 20210222202602.524 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxc/hooks/lxc-pve-poststop-hook" for container "140", config section "lxc"
lxc-start 140 20210222202603.984 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-poststop-hook 140 lxc post-stop produced output: umount: /var/lib/lxc/140/rootfs: not mounted

lxc-start 140 20210222202603.984 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-poststop-hook 140 lxc post-stop produced output: command 'umount --recursive -- /var/lib/lxc/140/rootfs' failed: exit code 1

lxc-start 140 20210222202604.728 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 1
lxc-start 140 20210222202604.739 ERROR    start - start.c:lxc_end:964 - Failed to run lxc.hook.post-stop for container "140"
lxc-start 140 20210222202604.746 ERROR    lxc_start - tools/lxc_start.c:main:308 - The container failed to start
lxc-start 140 20210222202604.750 ERROR    lxc_start - tools/lxc_start.c:main:314 - Additional information can be obtained by setting the --logfile and --logpriority options

oguz · Feb 23, 2021

hi,

voarsh said:
lxc-start 140 20210222202601.996 DEBUG conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 140 lxc pre-start produced output: failed to remove directory '/sys/fs/cgroup/devices/lxc/140/ns/docker/153f724dd7b304a4a9042b652ff11c4c0bb238ecfeb2de94626bcbd13e646704': Device or resource busy

here's the error message that seems most relevant.

could you also post the container configuration? pct config CTID

voarsh · Feb 23, 2021

oguz said:
here's the error message that seems most relevant.

could you also post the container configuration? pct config CTID

Yes I was thinking that too.
I tried to remove /sys/fs/cgroup/devices/lxc/140/* but got operation not permitted.
I might have to reboot the host, which is not really something I want to do at this time.


arch: amd64
cores: 32
features: nesting=1
hostname: API
memory: 3412
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.100.1,hwaddr=2E:F6:27:66:51:34,ip=192.168.100.15/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: FourTBpveIPC2Expansion:140/vm-140-disk-0.raw,size=25G
swap: 512

oguz · Feb 23, 2021

are you sure the container is dead?

please post:
* pct list
* ps aux | grep CTID

voarsh · Feb 23, 2021

oguz said:
pct list


root@HPProliantDL360PGen8:~# pct list
VMID       Status     Lock         Name                
100        running                 bitwarden           
102        stopped                 test                
106        stopped                 test2               
111        stopped                 gitlab      
116        running                 photoprism          
123        stopped                 nfsspeedtestFourteenTBExpansionUSB4
124        stopped                 nfsspeedtest2       
128        stopped                 Seagate1TBSpeedtest 
132        stopped                 python              
135        running                 zab2                
136        stopped                 power               
137        stopped                 tt-unity            
138        stopped                 prometheus          
140        stopped                 API             
141        running                 fileserver          
143        running                 homelabos           
145        running                 nextcloud           
146        running                 mayan               
151        stopped                 kibitzr             
152        running                 beehive             
153        running                 api2            
1299       stopped                 HPBay8speedtest

oguz said:
ps aux | grep CTID


root@HPProliantDL360PGen8:~# ps aux | grep 140
root       140  0.0  0.0      0     0 ?        S    Feb13   2:47 [ksoftirqd/21]
root      1138  0.0  0.0   2140  1220 ?        Ss   Feb13   1:07 /usr/sbin/watchdog-mux
root      1161  0.0  0.0 166756  2140 ?        Ssl  Feb13   0:00 /usr/sbin/zed -F
100111    6187  0.1  0.0 114088  6364 ?        S    Feb20   4:08 /usr/sbin/zabbix_server: preprocessing worker #2 started
root     11140  0.3  0.0  22936  2072 ?        S    13:03   0:04 /lib/systemd/systemd-udevd
root     42355  0.0  0.0   6072  2452 pts/12   S+   13:30   0:00 grep 140
root     43331  0.0  0.0   9512  2140 ?        S    12:00   0:00 /usr/sbin/CRON -f
root     45243  0.0  0.0  15848  1408 ?        Ss   Feb18   0:00 /usr/sbin/sshd -D
daemon   62918  7.4  0.0 404140 15196 ?        Sl   Feb18 519:01 /usr/bin/python2.7 /usr/bin/pagekite --pidfile /var/run/pagekite.pid --clean --runas=daemon:daemon --logfile=/var/log/pagekite/pagekite.log --optdir=/etc/pagekite.d --noloop
100000   63578  0.0  0.0 108700   584 ?        Sl   Feb18   0:53 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/ecbf5b165181b3ac46def49f00cfc59be90ef62afd14005404d3fbf1cc7bfd33 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
100000   64955  0.0  0.0 169588  5140 ?        Ss   Feb20   2:51 /sbin/init

oguz · Feb 23, 2021

ok can you show also: find /sys/fs/cgroup/devices/lxc/140?

you could probably remove these directories with find /sys/fs/cgroup/*/lxc/140* -depth -type d -print -delete if it's causing trouble for the container start

voarsh · Mar 9, 2021

oguz said:
ok can you show also: find /sys/fs/cgroup/devices/lxc/140?

you could probably remove these directories with find /sys/fs/cgroup/*/lxc/140* -depth -type d -print -delete if it's causing trouble for the container start

I solved the specific issue I had by restarting the host machine, not ideal by any means.

However, I am having the same issue again, and I am not able to run find /sys/fs/cgroup/*/lxc/159* -depth -type d -print -delete
I continue to get Device or resource busy.

Jugrnot · Jul 23, 2021

I have one particular Ubuntu container (my plex media server) which after more than two years of flawless performance, it'll just stop working at random, suddenly.. I can ssh into the container, do some basic commands and everything is responsive. Tried to do an apt update/upgrade, it just stopped responding. SSH'd into it again and issued 'sudo reboot' and it terminated my ssh session, however 30 min later I still couldn't ssh into it. Fire up the pve web portal, select ct101, console... it's a cursor that'll move around if you type, but the machine is deader than a doornail. Left it, came back 22 hours later and still hadn't restarted. Three hours of googling trying to figure out how the hell you can force kill a container and successfully restart it without rebooting the entire node, Zero success. Fine. Rebooted the node.

Cloned the container, got it up and running again, updated, etc. Three days go by and it does 100000000% the exact same thing, AGAIN. Here I sit in the exact same predicament with absolutely no idea how to begin fixing this. Hours looking through pve and the container's logs have turned up nothing. This box is a 2 cpu 12 core xeon with 128gb of ram (>50% ram free) that basically just idles all day long. Bare metal runs on 2 2tb SAS disks in a zpool mirror with 92% free, and a 50tb array of 8tb SAS disks in mirrored vdevs with >10% free space.

Anyone have a shot in the dark clue how to fix this? If rebuilding the entire container from scratch is the only way to resolve this, I swear by Zeus I'm scrapping the entirety of this god damn homelab.

voarsh · Jul 23, 2021

Jugrnot said:
I have one particular Ubuntu container (my plex media server) which after more than two years of flawless performance, it'll just stop working at random, suddenly.. I can ssh into the container, do some basic commands and everything is responsive. Tried to do an apt update/upgrade, it just stopped responding. SSH'd into it again and issued 'sudo reboot' and it terminated my ssh session, however 30 min later I still couldn't ssh into it. Fire up the pve web portal, select ct101, console... it's a cursor that'll move around if you type, but the machine is deader than a doornail. Left it, came back 22 hours later and still hadn't restarted. Three hours of googling trying to figure out how the hell you can force kill a container and successfully restart it without rebooting the entire node, Zero success. Fine. Rebooted the node.

Cloned the container, got it up and running again, updated, etc. Three days go by and it does 100000000% the exact same thing, AGAIN. Here I sit in the exact same predicament with absolutely no idea how to begin fixing this. Hours looking through pve and the container's logs have turned up nothing. This box is a 2 cpu 12 core xeon with 128gb of ram (>50% ram free) that basically just idles all day long. Bare metal runs on 2 2tb SAS disks in a zpool mirror with 92% free, and a 50tb array of 8tb SAS disks in mirrored vdevs with >10% free space.

Anyone have a shot in the dark clue how to fix this? If rebuilding the entire container from scratch is the only way to resolve this, I swear by Zeus I'm scrapping the entirety of this god damn homelab.

Never have this happen. Any backups to an earlier point where this doesn't happen?

Jugrnot · Jul 23, 2021

voarsh said:
Never have this happen. Any backups to an earlier point where this doesn't happen?

Unfortunately, I migrated from an older node to this new machine a couple months ago and forgot to configure periodic backups for this particular container. To be honest, the data is all still there, so it shouldn't be too terribly difficult to migrate the databases from the existing container to a new one, but it's still something I'd just rather not have to deal with. Guess I'll be monitoring this particular container for a bit and see what happens.

Search

Search

LXC can't start after force stop

voarsh

Member

oguz

Proxmox Retired Staff

voarsh

Member

oguz

Proxmox Retired Staff

voarsh

Member

oguz

Proxmox Retired Staff

voarsh

Member

Jugrnot

Member

voarsh

Member

Jugrnot

Member