Entire Xen Node has grey question mark + database container gone

dburleson · Jun 7, 2018

Hi Everyone,

Firstly, i've recently taken on the management of a proxmox cluster which I have had no experience managing previously (i'm completely new to cluster management, but not too bad at linux).

Code:

pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)

I have 2 xen nodes which run a number of containers and VMs within them. Yesterday, a container on Xen2, which runs a mysql database, stopped responding. I was able to log in to the container via ssh and attempted to restart mysql only to receive an error along the lines that it was unable to connect to the mysql.sock. So I decided to simply shutdown the container and start it back up. I chose 'shutdown' in proxmox UI for the container, which it then shutdown. Then I clicked 'start', in which proxmox logs recorded:

Code:

CT 110 - Start          ERROR: command 'systemctl start pve-container@110' failed: exit code 1

So, I've tried running the 'system start ...' via ssh. It takes a while, and then I get the following:

Code:

Job for pve-container@110.service failed because a timeout was exceeded.
See "systemctl status pve-container@110.service" and "journalctl -xe" for details.

Here is the output of 'systemctl status ...':

Code:

● pve-container@110.service - PVE LXC Container: 110
   Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2018-06-07 08:35:22 BST; 43s ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 1603366 ExecStart=/usr/bin/lxc-start -n 110 (code=killed, signal=TERM)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/system-pve\x2dcontainer.slice/pve-container@110.service
           └─1532500 [lxc monitor] /var/lib/lxc 110

Jun 07 08:33:52 xen2 systemd[1]: Starting PVE LXC Container: 110...
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.

and 'journalctl -xe':

Code:

Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
-- Subject: Unit pve-container@110.service has failed
-- Defined-By: systemd
--
-- Unit pve-container@110.service has failed.
--
-- The result is failed.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: pve-container@110.service: Failed with result 'timeout'.

Shortly after attempting to restart the container the first time, the entire xen2 node started displaying grey questions marks along side all it's VM/containers and they lost their labels (see screenshot):

imgur[dot]com/a/eBerw8q (can't link it due to anti-spam)

Despite this, all the other VMs/Containers within xen2 are still functioning fine. So, I've then decided to run the following commands to see what would happen:

service pvedaemon restart (nothing changed)
service pveproxy restart (nothing changed)
service pvestatd restart (The VMs started showing names within proxmox UI (but not containers), but this only lasted 10-15 minutes)

I'm hesitant to upgrade or restart the entire xen node due to the unknown side of configuration and what potential pitfalls may lie ahead and that its business critical to have at least something running. Furthermore, I don't have a test environment to run a test upgrade. I

've ran through /var/log/syslog and didn't see anything that indicated why the container crashed.

Ideally, I want to achieve:

Determine why the database container crashed (110)
Successfully start up the database container again
Determine why the xen2 node isn't reporting data to the UI about it's VM/Containers
Fix the reporting data in the UI for the node

Again, please appreciate i'm new to proxmox, but I do know my away around linux.

Thank you for any tips/knowledge on troubleshooting this problem. If there is any other info you'd like me to share, please let me know.

Cheers,
David

FastLaneJB · Jun 7, 2018

I think you'll probably find this is because of an issue in the 4.13 kernel your running. The newer 4.15 kernel should solve that. I had similar issues to you as well but nothing since moving to the 4.15 kernel. There's another thread, I think the 4.15 kernel thread itself which talked about this.

One other point but Proxmox uses KVM and LXC, not Xen. Just wanted to mention it

dburleson · Jun 7, 2018

Thanks for the response. Unfortunately, as mentioned, hesitant upgrading (as per reasons above) and because i've come across plenty of threads with similar issues which do the upgrade, and then report the problem isn't resolved. I'm more interested in trying to figure out what happened before jumping for the 'upgrade' option.

dburleson · Jun 7, 2018

I've done some further reading, and came across this post on here (can't find it) which indicated that there might be a process which is locking any other commands from being executed on the container. So I ran 'ps aux | grep 110' :

Code:

root@xen2:~# ps aux | grep 110
root         110  0.0  0.0      0     0 ?        S<   Apr21   0:00 [kworker/16:0H]
root     1532500  0.0  0.0  50216  3732 ?        Ds   Jun06   0:00 [lxc monitor] /var/lib/lxc 110
root     1532958  0.0  0.0  14688  1604 ?        Ss   Jun06   0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole110 -r winch -z lxc-console -n 110 -e -1
root     1532959  0.0  0.0  41768  4128 pts/5    Ss+  Jun06   0:00 lxc-console -n 110 -e -1
root     1551261  0.0  0.0  41772  4084 ?        S    Jun06   0:00 lxc-info -n 110 -p
root     1557645  0.0  0.0  41772  4236 ?        S    Jun06   0:00 lxc-info -n 110 -p
root     1557993  0.0  0.0  41772  4248 ?        S    Jun06   0:00 lxc-info -n 110 -p
root     1558236  0.0  0.0  41772  4208 ?        S    Jun06   0:00 lxc-info -n 110 -p
root     1608962  0.0  0.0  12788   968 pts/4    S+   09:54   0:00 grep 110

Sureenough, 1532500 looked like the cuplrit. So I've slowly killed off the processes. Once I did that, my proxmox UI started reporting the right data again in terms of names and status (woohoo!).

Now I just need to get the container back up and running. So, i've ran 'systemctl status pve-container@110.service' :

Code:

● pve-container@110.service - PVE LXC Container: 110
   Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2018-06-07 10:01:34 BST; 1min 10s ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 1612944 ExecStart=/usr/bin/lxc-start -n 110 (code=exited, status=1/FAILURE)

Jun 07 10:01:24 xen2 lxc-start[1612944]: lxc-start: 110: lxccontainer.c: wait_on_daemonized_start: 760 Received container state "STOPPING" instead of "RUNNING"
Jun 07 10:01:24 xen2 lxc-start[1612944]: lxc-start: 110: tools/lxc_start.c: main: 371 The container failed to start.
Jun 07 10:01:24 xen2 lxc-start[1612944]: lxc-start: 110: tools/lxc_start.c: main: 373 To get more details, run the container in foreground mode.
Jun 07 10:01:24 xen2 lxc-start[1612944]: lxc-start: 110: tools/lxc_start.c: main: 375 Additional information can be obtained by setting the --logfile and --logpriority options.
Jun 07 10:01:24 xen2 systemd[1]: pve-container@110.service: Control process exited, code=exited status=1
Jun 07 10:01:24 xen2 systemd[1]: pve-container@110.service: Killing process 1612946 (lxc-start) with signal SIGKILL.
Jun 07 10:01:24 xen2 systemd[1]: pve-container@110.service: Killing process 1612958 (sh) with signal SIGKILL.
Jun 07 10:01:34 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 10:01:34 xen2 systemd[1]: pve-container@110.service: Unit entered failed state.
Jun 07 10:01:34 xen2 systemd[1]: pve-container@110.service: Failed with result 'exit-code'.

It recommends running it in foreground mode:

Code:

root@xen2:~# lxc-start --name 110 --foreground
lxc-start: 110: network.c: instantiate_veth: 130 Failed to create veth pair "veth110i0" and "vethU94HCC": File exists
                                                                                                                     lxc-start: 110: network.c: lxc_create_network_priv: 2407 Failed to create network device
                                                                                                                                                                                                             lxc-start: 110: start.c: lxc_spawn: 1206 Failed to create the network.
                                                                lxc-start: 110: start.c: __lxc_start: 1477 Failed to spawn container "110".
                                                                                                                                           lxc-start: 110: tools/lxc_start.c: main: 371 The container failed to start.
lxc-start: 110: tools/lxc_start.c: main: 375 Additional information can be obtained by setting the --logfile and --logpriority options.

So, there appears to be something to do with a network device.

r.jochum · Jun 7, 2018

An upgrade to PVE 5.2 with the 4.15 kernel should be done, its very easy just add the pve-no-subscription repo and "apt dist-upgrade" (or better ofc pve-enterprise if you willing to get a Subscription and give some thanks back).

back to your xen3 LXC container (was very confused by its name

):

hmm give the output of "ip link" please.

r.jochum · Jun 7, 2018

Killing an LXC process is never good, that wount cleanup everything ....

run:

ip link delete veth110i0
ip link delete vethU94HCC

Think maybe after that the container comes up again, or better as said upgrade and reboot to 4.15.

dburleson · Jun 7, 2018

Ok, i've done that, and re ran 'lxc-start --name 100 --foreground' and got the following now:

Code:

lxc-start: 110: cgroups/cgfsng.c: create_path_for_hierarchy: 1337 Path "/sys/fs/cgroup/systemd//lxc/110" already existed.
lxc-start: 110: cgroups/cgfsng.c: cgfsng_create: 1433 Failed to create "/sys/fs/cgroup/systemd//lxc/110"
lxc-start: 110: cgroups/cgfsng.c: create_path_for_hierarchy: 1337 Path "/sys/fs/cgroup/systemd//lxc/110-1" already existed.
lxc-start: 110: cgroups/cgfsng.c: cgfsng_create: 1433 Failed to create "/sys/fs/cgroup/systemd//lxc/110-1"
/code]

this is still running at the moment...so it must be doing something?

r.jochum · Jun 7, 2018

Any chance to reboot?

r.jochum · Jun 7, 2018

Seems theres still a process missing, NEVER kill the monitor but use "lxc-stop -n 110 --kill"

"ps fauxww" may helps if you still wanna cleanup.

dburleson · Jun 7, 2018

from what I can see, when it crashed (and everytime i've killed the process manually) it's added another set of files within /sys/fs/cgroup/lxc entitled '110'. if I rename one, and then try to start the container again, it moves ont othe next located file with the name of 110 that needs fixing. therefore, I need to clean up all instances of '110' before attempted to start the container up again.

btw, thanks for the help so far! You're really helping me understand how all this cluster stuff works

reboot not likely as it's mid day here and considering no one has documentation about how it was setup in the first place, there may be some unknown considerations that arise (again, thus the hesitation at the moment).

I'm just wondering if I find a way to rename all the instances of '110' within /sys/fs/cgroup to '110.bak', might this help start the container? When trying to start it, it's complaing of things wthing cgroup already existsing.

Or is there an easier way to 'clean up' cgroup?

r.jochum · Jun 7, 2018

Are there still processes spawning? What do you get from "lxc-info -n 110" ?

dburleson · Jun 7, 2018

I don't think so, I think it's happened from each time I've ran 'lxc-start --name 110 --foreground' and killed the process manually via 'kill -9'

So, I've renamed all instances of /lxc/110/ to /lxc/110-bak/ from within sys/fs/cgroup in hopes this might prevent the errors.

I've started running 'lxc-start --name 110 --foreground' again, but it's taking a really long time. Is there any way to see if it's actually doing anything or if it's died?

It looks like all the names/stats are missing again in the web ui, I guess it's stopped pvestatd

r.jochum · Jun 7, 2018

lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID.log

dburleson · Jun 7, 2018

Thanks, i've ran that command and it's hanging on this:

Code:

root@xen2:/var/log# tail -f lxc-110.log 
      lxc-start 110 20180607105405.697 DEBUG    lxc_start - start.c:setup_signal_fd:301 - Set SIGCHLD handler with file descriptor: 5.
      lxc-start 110 20180607105405.697 DEBUG    console - console.c:lxc_console_peer_default:459 - using "/dev/tty" as peer tty device
      lxc-start 110 20180607105405.697 DEBUG    console - console.c:lxc_console_sigwinch_init:151 - process 1655323 created signal fd 9 to handle SIGWINCH events
      lxc-start 110 20180607105405.697 DEBUG    console - console.c:lxc_console_winsz:71 - set winsz dstfd:6 cols:105 rows:50
      lxc-start 110 20180607105405.697 INFO     lxc_start - start.c:lxc_init:680 - container "110" is initialized
      lxc-start 110 20180607105405.698 INFO     lxc_conf - conf.c:run_script:507 - Executing script "/usr/share/lxc/lxcnetaddbr" for container "110", config section "net".
      lxc-start 110 20180607105406.116 DEBUG    lxc_network - network.c:instantiate_veth:219 - Instantiated veth "veth110i0/vethKIG9ED", index is "92"
      lxc-start 110 20180607105406.116 INFO     lxc_cgroup - cgroups/cgroup.c:cgroup_init:67 - cgroup driver cgroupfs-ng initing for 110
      lxc-start 110 20180607105406.116 DEBUG    lxc_cgfsng - cgroups/cgfsng.c:filter_and_set_cpus:469 - No isolated cpus detected.
      lxc-start 110 20180607105406.116 DEBUG    lxc_cgfsng - cgroups/cgfsng.c:handle_cpuset_hierarchy:640 - "cgroup.clone_children" was already set to "1".

Is it doing something? I should mention this is a database container for mysql.

r.jochum · Jun 7, 2018

I see no error.

dburleson · Jun 7, 2018

Me neither, apart from the fact that it's still not started (45 minutes now). I'm trying not to complicate things, I simply want to get this container restarted. However, would cloning the container might be a possible solution?

r.jochum · Jun 7, 2018

Not sure whats broken, whats not. Sounds like a good idea though.

dburleson · Jun 7, 2018

Unfortunatly, it looks like it's not just the container. I decided to migrate a copy of working container from another machine to the problematci machine and it still wouldn't boot with the same errors. However, loading the container on a known working machine works fine.

eliosh · May 8, 2019

Hi, I have the identical problem.
I created another lxc, but the problem is identical: it remains stuck there:

Code:

...
lxc-start 116 20190508074130.570 DEBUG    network - network.c:instantiate_veth:206 - Instantiated veth "veth116i0/vethE620E5", index is "41"
lxc-start 116 20190508074130.571 TRACE    cgfsng - cgroups/cgfsng.c:cg_legacy_filter_and_set_cpus:434 - No isolated cpus detected
lxc-start 116 20190508074130.571 DEBUG    cgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:619 - "cgroup.clone_children" was already set to "1"
lxc-start 116 20190508074130.571 INFO     cgfsng - cgroups/cgfsng.c:cgfsng_payload_create:1537 - The container uses "lxc/116" as cgroup

Stoiko Ivanov · May 8, 2019

* please post the container's config (`pct config $vmid`)
* do you have any relevant messages in the journal (`journalctl -r`)
* take a look at the output of `ps auxwf` (you should see whether the container has started any processes)

hope this helps!

Entire Xen Node has grey question mark + database container gone

New Member

Renowned Member

New Member

New Member

Renowned Member

Renowned Member

New Member

Renowned Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Member

Proxmox Staff Member

We value your privacy