LXC containers failed to start randomly

raoulh

Renowned Member
Dec 22, 2015
10
2
68
Swiss
Hi,

I have a few servers with proxmox 4 (up to date) and a bunch of LXC running on them. Starting from 1 week I have problems with some LXC that do not start correctly anymore.

At regular interval all LXC are stopped for doing some maintenance, and then started again. This is done using a cron job every night that do:

Code:
pct stop 150
# some work on rootfs
pct start 150

For a reason starting a week ago randomly i have 2 or more LXC that do not restart anymore by saying in the cron log:
Code:
command 'lxc-start -n 150' failed: exit code 1
<root@pam> end task UPID:svr-linux2:0000599F:0355671C:56E89B37:vzstart:150:root@pam: command 'lxc-start -n 150' failed: exit code 1

I have to start the LXC manually after that, and it then works correctly.

After digging, it seems to come from the veth network that is not destroyed/created correctly:
Code:
─➤  tail /var/log/lxc/150.log
      lxc-start 1458083560.397 ERROR    lxc_conf - conf.c:instantiate_veth:2767 - failed to create veth pair (veth150i0 and veth1XGOX6): File exists
      lxc-start 1458083560.447 ERROR    lxc_conf - conf.c:lxc_create_network:3084 - failed to create netdev
      lxc-start 1458083560.447 ERROR    lxc_start - start.c:lxc_spawn:954 - failed to create the network
      lxc-start 1458083560.447 ERROR    lxc_start - start.c:__lxc_start:1211 - failed to spawn '150'
      lxc-start 1458083567.068 ERROR    lxc_start_ui - lxc_start.c:main:344 - The container failed to start.
      lxc-start 1458083567.068 ERROR    lxc_start_ui - lxc_start.c:main:346 - To get more details, run the container in foreground mode.
      lxc-start 1458083567.068 ERROR    lxc_start_ui - lxc_start.c:main:348 - Additional information can be obtained by setting the --logfile and --logpriority options.

This is my journald log file for the relevant LXC:
Code:
Mar 16 00:11:39 svr-linux1 systemd-timesyncd[910]: interval/delta/delay/jitter/drift 2048s/-0.017s/0.052s/0.011s/-16ppm (ignored)
Mar 16 00:12:39 svr-linux1 pct[27324]: <root@pam> starting task UPID:svr-linux1:00006ABD:035363C8:56E896E7:vzstart:150:root@pam:
Mar 16 00:12:39 svr-linux1 pct[27325]: starting CT 150: UPID:svr-linux1:00006ABD:035363C8:56E896E7:vzstart:150:root@pam:
Mar 16 00:12:40 svr-linux1 kernel: EXT4-fs (loop4): mounted filesystem with ordered data mode. Opts: (null)
Mar 16 00:12:40 svr-linux1 kernel: vmbr0: port 6(veth150i0) entered disabled state
Mar 16 00:12:40 svr-linux1 kernel: device veth150i0 left promiscuous mode
Mar 16 00:12:40 svr-linux1 kernel: vmbr0: port 6(veth150i0) entered disabled state
Mar 16 00:12:47 svr-linux1 pct[27325]: command 'lxc-start -n 150' failed: exit code 1
Mar 16 00:12:47 svr-linux1 pct[27324]: <root@pam> end task UPID:svr-linux1:00006ABD:035363C8:56E896E7:vzstart:150:root@pam: command 'lxc-start -n 150' failed: exit code 1

My question is, what is causing the lxc to not start? Is it a bug somewhere? Where to look?

Thanks,
Raoul