Can't start ct after stopping it

breakaway9000

Renowned Member
Dec 20, 2015
91
22
73
Hi,

I am running a 3-node cluster.

I stopped a container to make some changes to the hardware. When I went to power it back on, I couldn't. When I press the "start' button, I get a spinning 'Please wait' thing for a while. Then eventually it says "Timed Out".

I ssh'ed into the host itself and ran "pct start 130" (130 being the vm id) but then I got this:

Code:
# pct start 130
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-130.lock' - got timeout

So I tried to unlock it.

Code:
# pct unlock 130
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-130.lock' - got timeout

So then I went to actually delete the lock and then start the container.

Code:
#rm  /var/lock/qemu-server/lock-130.conf
# pct start 130
trying to acquire lock...
can't lock file '/run/lock/lxc/pve-config-130.lock' - got timeout

So, what do I need to do to get this VM started??

My Versions:
Code:
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.4.49-1-pve: 4.4.49-86
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.98-2-pve: 4.4.98-101
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.4.67-1-pve: 4.4.67-92
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9
ceph: 12.2.2-pve1
 
A host reboot has resolved this but I'd like to know why this is happening... as you can imagine the host has other VMs and CTs running on it and restarting the host is a bit of a pain.
 
Vasu, do you know the root cause of this issue. I had a read of that thread and it doesn't look like this is the same issue? You're using ZFS but I am just using EXT4. My proxmox node is still responsive inside the cluster (I can access the consoles of running VM/CT). However I cannot log into the node itself (the login times out - it's not grey or anything).

Can this issue happen without ZFS?

Edit, as in ZFS is installed but we are not using ZFS for storing VMs. But we do have a ceph "test" cluster.
 
That information is in my post above but I'll put it here again:

Code:
# pveversion --verbose
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-5-pve: 4.13.13-38
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9
 
I had two issues.

1. LXC failing, and node restart clears issue. But it still happens.
2. LXC failing, and even node restart didn't fix issue, I had to do some patching for TUN/TAP to work, then it started. No more issues.

Both were converted from OpenVZ.

Your CT is fresh? or converted from OpenVZ?
 
We only started using proxmox from 4.1, and by then it was already LXC so I suppose it was "Fresh".
 
Is that likely to help if I don't have any zfs voulumes on my systems at all? My boot volume is ext4/LVM, and my data volumes are ceph/ext4.
 
Is that likely to help if I don't have any zfs voulumes on my systems at all? My boot volume is ext4/LVM, and my data volumes are ceph/ext4.
It's a kernel issue with network namespace cloning. Don't worry if that doesn't make sense. To confirm that you are having the same issue, wait for an LXC container to fail to start, then run:

Code:
grep copy_net_ns /proc/*/stack

If that returns anything, you're having the same issue. Proxmox is working with the community to get this fixed but it's not affecting everyone and they haven't been successful duplicating it in the lab. If you think your setup is something they could duplicate, please describe it on this thread:
https://forum.proxmox.com/threads/lxc-container-reboot-fails-lxc-becomes-unusable.41264/#post-201579
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!