Help! I'm in deep trouble, can't start containers after reboot

jarcher · Feb 27, 2015

Hi All...

I'm in deep trouble here :-(

I would greatly appreciate any assistance!

I was having some trouble managing containers, I would get an error about a worker thread when I tried to use the web based management interface. I was able to stop containers using vzctl at the command line. The containers were all operating okay, so I though there was an issue limited to the user interface. Follishly, I rebooted the machine. It came back up and now I can't start any of my containers. These are critical!

If I try to use vzctl to start the containers, I get:

root@pmmaster:/# vzctl start 101
Container config file does not exist
root@pmmaster:/#

If I try to list the containers:

root@pmmaster:/# vzlist -a
Container(s) not found
root@pmmaster:/#

Thanks very much!

Here is my version information:

root@pmmaster:/# pveversion -v
proxmox-ve-2.6.32: 3.1-109 (running kernel: 2.6.32-23-pve)
pve-manager: 3.1-3 (running version: 3.1-3/dc0e9b0e)
pve-kernel-2.6.32-20-pve: 2.6.32-100
pve-kernel-2.6.32-19-pve: 2.6.32-95
pve-kernel-2.6.32-16-pve: 2.6.32-82
pve-kernel-2.6.32-17-pve: 2.6.32-83
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.0-1
pve-cluster: 3.0-7
qemu-server: 3.1-1
pve-firmware: 1.0-23
libpve-common-perl: 3.0-6
libpve-access-control: 3.0-6
libpve-storage-perl: 3.0-10
pve-libspice-server1: 0.12.4-1
vncterm: 1.1-4
vzctl: 4.0-1pve3
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.4-17
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.0-2

jarcher · Feb 27, 2015

Digging further, I see that there are my containers in /var/lib/vz/private

for example, container 101 has its data there.

But, if I look in /var/lib/vz/root/101

There are no files at all in there. I think this is normal when containers are not running. But since my data is there I guess there is hope... Now I'm trying to track down there the config files are supposed to be.

jarcher · Feb 27, 2015

I was able to use the web interface to create a new container, number 3000. There is a config file for it:

root@pmmaster:/etc/pve/nodes/pmmaster/openvz# ls -l
total 1
-rw-r----- 1 root www-data 923 Feb 27 17:07 3000.conf
root@pmmaster:/etc/pve/nodes/pmmaster/openvz#

But that is the only config file in that directory. So apparently all my container config files have vanished. If this is correct, is there a way to recreate them? I have been backing up containers so can I maybe extract just the config files from the backups?

jarcher · Feb 28, 2015

As a quick fix, since I absolutely had to have some containers running, I copied the config files from the other (now inactive node) to /etc/pve/nodes/pmmaster/openvv with different file names. This allowed me to start the containers, even though the name of the config file didn't match the name of the container. It seems to be running, but this can't stay this way.

I did some reading about fuse, and as near as I can tell, the contents of /etc/pve is actually a SQLite database. So perhaps there is some corruption in the SQLite storage?

jarcher · Feb 28, 2015

From reading here:

https://pve.proxmox.com/wiki/Proxmox_Cluster_file_system_(pmxcfs)

I am wondering if I can just stop the cluster file system and get rid of the other node in the cluster? If that fixes the inability to over-write config files it thinks are there, maybe I could just restore the containers from backup?

soholingo · Feb 28, 2015

I have not had the problem, but I am following to learn your solution.

jarcher · Feb 28, 2015

soholingo said:
I have not had the problem, but I am following to learn your solution.

No solution yet... :-(

jarcher · Mar 2, 2015

Anyone have any ideas on this?

Thanks...

udo · Mar 2, 2015

jarcher said:
As a quick fix, since I absolutely had to have some containers running, I copied the config files from the other (now inactive node) to /etc/pve/nodes/pmmaster/openvv with different file names. This allowed me to start the containers, even though the name of the config file didn't match the name of the container. It seems to be running, but this can't stay this way.

I did some reading about fuse, and as near as I can tell, the contents of /etc/pve is actually a SQLite database. So perhaps there is some corruption in the SQLite storage?

Hi,
is the rebooted node part of an cluster?
Do you have the confings then still on the other node? /etc/pve/nodes/NODENAME/openvz/
What's about backups?

Udo

jarcher · Mar 2, 2015

Hi Udo, thank you for the reply.

Well this node was part of a cluster at one time a while ago. I upgraded this one but not the other node, since I didn't need the cluster. I now see the other node in the "datacenters" list and it has the containers listed and grayed out. And yes, the config files are on the other node, that's how I was able to recover them. I tried to copy them from the old node to the current one but it would not let me, saying those names are already in use. This is why I renamed them (for example, I renamed 101.conf to 3010.conf). In most cases this worked, although not all.

Each container is backed up using the Proxmox backup capability. I have not attempted to restore them because I was, at the time, under the impression that there was a file system issue. However, after reading about the cluster file system under fuse, I am thinking its normal that I can not write the files since they are already in the other node.

I have not wanted to risk restoring from backup since I really do not understand what is going on.

udo · Mar 2, 2015

Hi,
this sounds, that you have an broken cluster - normaly both (or how many nodes are in your cluster?) nodes should mount /etc/pve read-only, because they don't have quorum...

You know best, what you have done to get quorum...

what is the output of following commands (on all nodes)

Code:

pvecm nodes
pvecm status
mount | grep pve
find /etc/pve

Perhaps the cluster-filesystem in only mounted over your old /etc/pve?
Try to stop pve-cluster, control the mounts and take a look on /etc/pve

Udo

Search

Search

Help! I'm in deep trouble, can't start containers after reboot

jarcher

Member

jarcher

Member

jarcher

Member

jarcher

Member

jarcher

Member

soholingo

New Member

jarcher

Member

jarcher

Member

udo

Distinguished Member

jarcher

Member

udo

Distinguished Member

We value your privacy