[SOLVED] Proxmox 5.1.46 LXC cluster error Job for pve-container@101.service failed

Vasu Sreekumar

Active Member
Mar 3, 2018
123
36
28
55
St Louis MO USA
Hi,


We have a very serious issue with Proxmox 5.1.46 LXC cluster with ZFS and we need urgent help.


When somebody stops a LXC container and restart it will not restart but gives the following error.


Job for pve-container@101.service failed because the control process exited with error code.

See "systemctl status pve-container@101.service" and "journalctl -xe" for details.

TASK ERROR: command 'systemctl start pve-container@101' failed: exit code 1


At this point Proxmox node pings and all other LXC containers pings, but GUI will show grey for Proxmox node and the containers.


1. From Proxmox node console run ps aux |grep xxx

2. Then locate process ID with lxc monitor yyyy

3. Then run kill -9 yyyy

4. This will make Proxmox node and containers come green in GUI. But the error container will still be grey and shows same error when we start the it.


If we restart Proxmox node all errors will get cleared, and we can start the error container also.


Please investigate and let us know a solution.


Thanks,


Vasu
 
Last edited:
  • Like
Reactions: afsal and Manohar
You use the term 'cloud' a lot and this term does not exist in the Proxmox VE terminology. Please use container, node and VM to clearify what you mean.

As far as I can interpret what you mean, the reboot on containers got stuck, is that right?
 
Yes container. LXC based.
Yes after the stop the at start of the LXC container I get this message.

Job for pve-container@101.service failed because the control process exited with error code.
See "systemctl status pve-container@101.service" and "journalctl -xe" for details.
TASK ERROR: command 'systemctl start pve-container@101' failed: exit code 1

At this point Proxmox IP pings and all other LXC containers ping, but GUI will show grey for both Proxmox and the LXC containers

I have 4 Proxmox clusters each with 5 servers each, all have same issue.

Once I reboot the proxmox server, everything is normal.
 
  • Like
Reactions: Manohar
I expected to see a lot of other people running into this issue but there haven't been many "me too" posts. There could be many reasons:
* LXC container reboots may be infrequent in other environments.
* The issue may be specific to containers migrated from OpenVZ.
* Many people may be on earlier versions of Proxmox and/or may not have the latest patches.
* Other reasons I haven't thought of...

Proxmox is aware of the issue as they have responded to my original thread. I do expect that whatever resolved the issue in the 4.14.20+ kernels will eventually be backported into 4.13, but I can't wait as I'm responsible for ~ 1000 instances (LXC and KVM) and this was causing regular outages (ie. having to reboot the hypervisors).

I will be checking new kernels as they're released but since we don't know the exact patch that resolved the issue, it's hard to say when pve-kernel will have the fix.
 
  • Like
Reactions: afsal
Issue is there on fresh containers also. I took a fresh node and just installed 4 containers, left it like that. And when i tried to stop and start, issue happened.

I have 25 nodes, issue happened on all nodes atleast once in last 5 days time.
 
  • Like
Reactions: Manohar
That's my bug report at LXC for this issue. It doesn't seem difficult for me to duplicate the issue either so it's strange that more people aren't complaining. Anyway, Proxmox is aware that the newer kernels resolve the issue so hopefully an updated pve-kernel in the future will take care of this.
 
We loaded new Kernel 4.15.

Created 5 LXC guests, created cron to stop and start all 5 guests every 5 minutes.

Now it passed 6 hours, no errors yet. We are still running the test.

CPU(s)24 x Intel(R) Xeon(R) CPU L5639 @ 2.13GHz (2 Sockets)
Kernel Version Linux 4.15.3-1-pve #1 SMP PVE 4.15.3-1 (Fri, 9 Mar 2018 14:45:34 +0100)
PVE Manager Version pve-manager/5.1-46/ae8241d4

With same setup 4.13 kernal Proxmox produced error within 30-40 minutes.
 
Last edited:
For what it's worth, yeah, me too.

Single node setup on pve 5.1 with the 4.13 kernel. All of my LXCs remain responsive and reachable, it's just the node itself for whatever reason.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!