Can't start ct after stopping it

breakaway9000 · Mar 5, 2018

Hi,

I am running a 3-node cluster.

I stopped a container to make some changes to the hardware. When I went to power it back on, I couldn't. When I press the "start' button, I get a spinning 'Please wait' thing for a while. Then eventually it says "Timed Out".

I ssh'ed into the host itself and ran "pct start 130" (130 being the vm id) but then I got this:

Code:

# pct start 130
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-130.lock' - got timeout

So I tried to unlock it.

Code:

# pct unlock 130
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-130.lock' - got timeout

So then I went to actually delete the lock and then start the container.

Code:

#rm  /var/lock/qemu-server/lock-130.conf
# pct start 130
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-130.lock' - got timeout

So, what do I need to do to get this VM started??

My Versions:

Code:

proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.98-2-pve: 4.4.98-101
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.4.67-1-pve: 4.4.67-92
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9
ceph: 12.2.2-pve1

breakaway9000 · Mar 5, 2018

A host reboot has resolved this but I'd like to know why this is happening... as you can imagine the host has other VMs and CTs running on it and restarting the host is a bit of a pain.

Vasu Sreekumar · Mar 5, 2018

It is same issue.

https://forum.proxmox.com/threads/p...b-for-pve-container-101-service-failed.41878/

We need Kernel 4.114.20+ to get this issue solved.

breakaway9000 · Mar 5, 2018

Vasu, do you know the root cause of this issue. I had a read of that thread and it doesn't look like this is the same issue? You're using ZFS but I am just using EXT4. My proxmox node is still responsive inside the cluster (I can access the consoles of running VM/CT). However I cannot log into the node itself (the login times out - it's not grey or anything).

Can this issue happen without ZFS?

Edit, as in ZFS is installed but we are not using ZFS for storing VMs. But we do have a ceph "test" cluster.

Vasu Sreekumar · Mar 5, 2018

What is your Proxmox Version?

breakaway9000 · Mar 5, 2018

That information is in my post above but I'll put it here again:

Code:

# pveversion --verbose
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-5-pve: 4.13.13-38
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9

Vasu Sreekumar · Mar 5, 2018

I had two issues.

1. LXC failing, and node restart clears issue. But it still happens.
2. LXC failing, and even node restart didn't fix issue, I had to do some patching for TUN/TAP to work, then it started. No more issues.

Both were converted from OpenVZ.

Your CT is fresh? or converted from OpenVZ?

breakaway9000 · Mar 5, 2018

We only started using proxmox from 4.1, and by then it was already LXC so I suppose it was "Fresh".

Vasu Sreekumar · Mar 5, 2018

The issue we faced was a Debian bug, not an LXC bug.

First i thought it was ZFS issue, but it is not.

breakaway9000 · Mar 5, 2018

Right, I went through those threads you linked but a lot of it went over my head. What is the root cause of this?

Vasu Sreekumar · Mar 5, 2018

It is a known bug in Kernel 3.13.13.

We need Kernel 4.14.20+ to get this issue solved.

Proxmox not yet released it.

Vasu Sreekumar · Mar 5, 2018

Another member here has a custom built kernel module.

you can contact him.
denos

Vasu Sreekumar · Mar 5, 2018

I did a modification, and for last 36 hours I don't have the node froze issue.

I changed ZFS cache from 8GB to 1GB. Then issue didn't happen yet.

breakaway9000 · Mar 5, 2018

Is that likely to help if I don't have any zfs voulumes on my systems at all? My boot volume is ext4/LVM, and my data volumes are ceph/ext4.

Vasu Sreekumar · Mar 5, 2018

I have only ZFS, I never tested in ext4.

denos · Mar 5, 2018

breakaway9000 said:
Is that likely to help if I don't have any zfs voulumes on my systems at all? My boot volume is ext4/LVM, and my data volumes are ceph/ext4.

It's a kernel issue with network namespace cloning. Don't worry if that doesn't make sense. To confirm that you are having the same issue, wait for an LXC container to fail to start, then run:

Code:

grep copy_net_ns /proc/*/stack

If that returns anything, you're having the same issue. Proxmox is working with the community to get this fixed but it's not affecting everyone and they haven't been successful duplicating it in the lab. If you think your setup is something they could duplicate, please describe it on this thread:
https://forum.proxmox.com/threads/lxc-container-reboot-fails-lxc-becomes-unusable.41264/#post-201579

Search

Search

Can't start ct after stopping it

breakaway9000

Renowned Member

breakaway9000

Renowned Member

Vasu Sreekumar

Active Member

breakaway9000

Renowned Member

Vasu Sreekumar

Active Member

breakaway9000

Renowned Member

Vasu Sreekumar

Active Member

breakaway9000

Renowned Member

Vasu Sreekumar

Active Member

breakaway9000

Renowned Member

Vasu Sreekumar

Active Member

Vasu Sreekumar

Active Member

Vasu Sreekumar

Active Member

breakaway9000

Renowned Member

Vasu Sreekumar

Active Member

denos

Well-Known Member

We value your privacy