LXCs Won't Start At Boot - But Will Manually?

CashewCaliphate · Jul 14, 2019

I'm on a fresh install of Proxmox, with all packages updated on the host:

Code:

proxmox-ve: 5.4-2 (running kernel: 4.15.18-18-pve)
pve-manager: 5.4-11 (running version: 5.4-11/6df3d8d0)
pve-kernel-4.15: 5.4-6

When my node boots up, all of my LXCs fail to boot (classic systemd 'exit-code 1`), but my one VM boots just fine. If I then start the containers manually through the GUI, they start right up.

Starting a LXC through the CLI in the foreground provides this output.

How could a container fail to boot at boot, but then immediately start if you just click on "start" ?

LnxBil · Jul 14, 2019

CashewCaliphate said:
How could a container fail to boot at boot, but then immediately start if you just click on "start" ?

Simplest answer could be that the storage is not present. What about logfiles of the startup problem? There should be something in /var/log or journalctl.

CashewCaliphate · Jul 14, 2019

LnxBil said:
Simplest answer could be that the storage is not present. What about logfiles of the startup problem? There should be something in /var/log or journalctl.

With that possibility in mind, I tried adding a 30 second delay for the containers to start and it didn't help.

Here's the syslog portion during startup, the LXC relevant items start at line 1200: https://pastebin.com/YrCEDXTF

LnxBil · Jul 15, 2019

Okay, it starts e.g. at line 1331 and does not find a file, but it does not state what file.

Could you please also post the output of df -h, pvesm status in CODE tags please.

CashewCaliphate · Jul 15, 2019

LnxBil said:
Okay, it starts e.g. at line 1331 and does not find a file, but it does not state what file.

Could you please also post the output of df -h, pvesm status in CODE tags please.

Yeah I saw that line too and was disappointed that it didn't mention what file, but I wasn't sure if that would truly cause the start failure or not. Furthermore, what would be different about starting it from the GUI versus starting from a boot? (since starting it from the GUI works).

Output of df -h is below. For context, I have three physical drives:

1TB SSD for pve-host (sda - ext4 - "local")
400GB SSD for vm-storage where I store all of my VMs and LXCs, except for one (sdb - single disk zfs pool - "vm-storage")
400GB SSD for Plex Media Center LXC, which is LXC100 from the line in syslog that you referenced (sdc - single disk zfs pool - "plex")

I also have a LAN CIFS share on a Windows machine where I store all my LXC backups (That's the "thunderdome" share).

Code:

root@pve:~# df -h
Filesystem                    Size  Used Avail Use% Mounted on
udev                           32G     0   32G   0% /dev
tmpfs                         6.3G  9.3M  6.3G   1% /run
/dev/mapper/pve-root          872G   16G  820G   2% /
tmpfs                          32G   40M   32G   1% /dev/shm
tmpfs                         5.0M     0  5.0M   0% /run/lock
tmpfs                          32G     0   32G   0% /sys/fs/cgroup
plex                          321G     0  321G   0% /plex
plex/subvol-100-disk-0         60G   40G   21G  66% /plex/subvol-100-disk-0
vm-storage                    307G  128K  307G   1% /vm-storage
vm-storage/subvol-101-disk-0  4.0G  809M  3.3G  20% /vm-storage/subvol-101-disk-0
vm-storage/subvol-102-disk-0  4.0G  480M  3.6G  12% /vm-storage/subvol-102-disk-0
vm-storage/subvol-104-disk-0  150G   28G  123G  19% /vm-storage/subvol-104-disk-0
vm-storage/subvol-105-disk-0   10G  6.6G  3.5G  66% /vm-storage/subvol-105-disk-0
vm-storage/subvol-106-disk-0  5.0G  968M  4.1G  19% /vm-storage/subvol-106-disk-0
vm-storage/subvol-108-disk-0  8.0G  1.6G  6.5G  20% /vm-storage/subvol-108-disk-0
vm-storage/subvol-110-disk-0  4.0G  2.2G  1.9G  55% /vm-storage/subvol-110-disk-0
local:remote                  1.1P  123T  820G 100% /mnt/unionfs
/dev/fuse                      30M   20K   30M   1% /etc/pve
google:                       1.0P  123T  1.0P  11% /mnt/remote
//192.168.1.38/Thunderdome     15T  965G   14T   7% /mnt/pve/thunderdome
tmpfs                         6.3G     0  6.3G   0% /run/user/0

Here is pvem status:

Code:

root@pve:~# pvesm status
Name               Type     Status           Total            Used       Available        %
local               dir     active       913674576        20733324       854768008    2.27%
plex            zfspool     active       377880576        30139096       347741480    7.98%
thunderdome        cifs     active     15569124328      1010773684     14558350644    6.49%
vm-storage      zfspool     active       377880576        56805248       321075328   15.03%

LnxBil · Jul 15, 2019

CashewCaliphate said:
Furthermore, what would be different about starting it from the GUI versus starting from a boot? (since starting it from the GUI works).

No, the same systemd service is started. The only explanation I can think of is a delay. If you have the time, could you try to double the delay on startup and see it is actually is a delay problem?

The other output looks fine. So, on the local boot SSD, there is no container? Could you create on there and see if you also have the problem there?

CashewCaliphate · Jul 16, 2019

LnxBil said:
No, the same systemd service is started. The only explanation I can think of is a delay. If you have the time, could you try to double the delay on startup and see it is actually is a delay problem?

The other output looks fine. So, on the local boot SSD, there is no container? Could you create on there and see if you also have the problem there?

So, interesting development... originally, I had all of my containers set to wait on LXC100 (Plex) starting, prior to starting. LXC100 was priority 1 at boot, and all the other containers were priority 2. For troubleshooting purposes, I removed priorities entirely, and left all containers at "any" priority. That resulted in all the containers starting except two: LXC100 (Plex), and LXC102 (an Ubuntu 18 based LXC).

I then added a 30 sec delay to LXC102 and a 65 second delay to LXC100, which yielded the same result and log output.

And to confirm: that's correct, there are no LXCs installed on the local boot SSD. But LXC102 is on the same disk as the other LXCs that are working:

"local" ssd
- pve host
"plex" ssd
- LXC100 (fails)
"vm-storage" ssd
- LXC101 (starts)
- LXC102 (fails)
- LXC104 (starts)
- LXC105 (starts)
- LXC106 (starts)
- LXC108 (starts)
- LXC110 (starts)
- VM111 (starts)

I then added a test LXC on "local" to see if it would boot up, and it did just fine. Same old log output for LXC100 though:

Code:

Jul 16 00:00:30 pve systemd[1]: Starting PVE LXC Container: 100...
Jul 16 00:00:30 pve lxc-start[2713]: lxc-start: 100: lxccontainer.c: wait_on_daemonized_start: 856 No such file or directory - Failed to receive the container state
Jul 16 00:00:30 pve lxc-start[2713]: lxc-start: 100: tools/lxc_start.c: main: 330 The container failed to start
Jul 16 00:00:30 pve lxc-start[2713]: lxc-start: 100: tools/lxc_start.c: main: 333 To get more details, run the container in foreground mode
Jul 16 00:00:30 pve lxc-start[2713]: lxc-start: 100: tools/lxc_start.c: main: 336 Additional information can be obtained by setting the --logfile and --logpriority options
Jul 16 00:00:30 pve systemd[1]: pve-container@100.service: Control process exited, code=exited status=1
Jul 16 00:00:30 pve systemd[1]: Failed to start PVE LXC Container: 100.
Jul 16 00:00:30 pve systemd[1]: pve-container@100.service: Unit entered failed state.
Jul 16 00:00:30 pve systemd[1]: pve-container@100.service: Failed with result 'exit-code'.
Jul 16 00:00:30 pve pve-guests[2711]: command 'systemctl start pve-container@100' failed: exit code 1
Jul 16 00:00:31 pve pvesh[2568]: Starting CT 100 failed: command 'systemctl start pve-container@100' failed: exit code 1

LnxBil · Jul 16, 2019

That is really weird. It's really hard to debug if you cannot reproduce the error.

One trick you can try is to start the container manually in foreground with debugging on boot and hope the error still persists. You could also try to create systemd override that starts the container in debug mode, yet I have not done both before and can only point you in that direction.

CashewCaliphate · Jul 17, 2019

LnxBil said:
That is really weird. It's really hard to debug if you cannot reproduce the error.

One trick you can try is to start the container manually in foreground with debugging on boot and hope the error still persists. You could also try to create systemd override that starts the container in debug mode, yet I have not done both before and can only point you in that direction.

Once proxmox is booted and active, I can start the container manually in the GUI, or in the foreground via CLI. I'm not sure how to "start the container manually in the foreground with debugging on boot," but here's the full output file for LXC104 in the foreground when I run lxc-start -n 102 -F -l DEBUG -o /tmp/lxc-102.log. And here are some highlights I noticed in the CLI:

Code:

Set hostname to <ombi>.
Failed to attach 1 to compat systemd cgroup /init.scope: No such file or directory
Couldn't move remaining userspace processes, ignoring: Input/output error
[  OK  ] Listening on Journal Socket (/dev/log).
system.slice: Failed to reset devices.list: Operation not permitted
system-container\x2dgetty.slice: Failed to reset devices.list: Operation not permitted
[  OK  ] Created slice system-container\x2dgetty.slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Reached target Swap.
system-postfix.slice: Failed to reset devices.list: Operation not permitted
[  OK  ] Created slice system-postfix.slice.
[  OK  ] Listening on Syslog Socket.
[  OK  ] Listening on Journal Audit Socket.
[  OK  ] Reached target Remote File Systems.
[  OK  ] Listening on Journal Socket.
keyboard-setup.service: Failed to reset devices.list: Operation not permitted
         Starting Set the console keyboard layout...
systemd-sysctl.service: Failed to reset devices.list: Operation not permitted
         Starting Apply Kernel Variables...
ufw.service: Failed to reset devices.list: Operation not permitted
         Starting Uncomplicated firewall...
systemd-journald.service: Failed to reset devices.list: Operation not permitted
Failed to attach 47 to compat systemd cgroup /system.slice/systemd-journald.service: No such file or directory
         Starting Journal Service...
[  OK  ] Reached target User and Group Name Lookups.
[  OK  ] Started Forward Password Requests to Wall Directory Watch.
[  OK  ] Reached target Slices.
systemd-sysusers.service: Failed to reset devices.list: Operation not permitted
         Starting Create System Users...
dev-hugepages.mount: Failed to reset devices.list: Operation not permitted
Failed to attach 49 to compat systemd cgroup /system.slice/dev-hugepages.mount: No such file or directory
         Mounting Huge Pages File System...
Failed to attach 47 to compat systemd cgroup /system.slice/systemd-journald.service: No such file or directory
Failed to attach 49 to compat systemd cgroup /system.slice/dev-hugepages.mount: No such file or directory
[  OK  ] Started Uncomplicated firewall.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Create System Users.
[  OK  ] Started Apply Kernel Variables.
[  OK  ] Started Set the console keyboard layout.
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[  OK  ] Reached target Local Encrypted Volumes.
[  OK  ] Reached target Paths.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
plymouth-read-write.service: Failed to reset devices.list: Operation not permitted
         Starting Tell Plymouth To Write Out Runtime Data...
console-setup.service: Failed to reset devices.list: Operation not permitted
         Starting Set console font and keymap...
apparmor.service: Failed to reset devices.list: Operation not permitted
         Starting AppArmor initialization...
[  OK  ] Started Tell Plymouth To Write Out Runtime Data.
[  OK  ] Started Set console font and keymap.
[  OK  ] Started Journal Service.
         Starting Flush Journal to Persistent Storage...
[  OK  ] Started Flush Journal to Persistent Storage.
         Starting Create Volatile Files and Directories...
[  OK  ] Started Create Volatile Files and Directories.
[  OK  ] Reached target System Time Synchronized.
         Starting Network Service...
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
[  OK  ] Started Network Service.
         Starting Network Name Resolution...
[  OK  ] Started Network Name Resolution.
[  OK  ] Reached target Host and Network Name Lookups.
[  OK  ] Reached target Network.
[  OK  ] Reached target Network is Online.
[FAILED] Failed to start AppArmor initialization.
See 'systemctl status apparmor.service' for details.
[  OK  ] Reached target System Initialization.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Started Discard unused blocks once a week.
[  OK  ] Started Daily Cleanup of Temporary Directories.
[  OK  ] Started Daily apt download activities.
[  OK  ] Started Daily apt upgrade and clean activities.
[  OK  ] Listening on UUID daemon activation socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Postfix Mail Transport Agent (instance -)...
         Starting System Logging Service...
         Starting Dispatcher daemon for systemd-networkd...
         Starting OpenBSD Secure Shell server...
[  OK  ] Started D-Bus System Message Bus.
         Starting Accounts Service...
         Starting Login Service...
         Starting Permit User Sessions...
[  OK  ] Started Ombi - PMS Requests System.
[  OK  ] Started Message of the Day.
[  OK  ] Started Daily rotation of log files.
[  OK  ] Reached target Timers.
[  OK  ] Started Regular background program processing daemon.
[  OK  ] Started Permit User Sessions.
         Starting Hold until boot process finishes up...
         Starting Terminate Plymouth Boot Screen...
[  OK  ] Started System Logging Service.
[  OK  ] Started Hold until boot process finishes up.
[  OK  ] Started Container Getty on /dev/tty1.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Terminate Plymouth Boot Screen.
[  OK  ] Started Accounts Service.
[  OK  ] Started OpenBSD Secure Shell server.
[  OK  ] Started Dispatcher daemon for systemd-networkd.
[  OK  ] Started Login Service.
[  OK  ] Started Postfix Mail Transport Agent (instance -).
         Starting Postfix Mail Transport Agent...
[  OK  ] Started Postfix Mail Transport Agent.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.

I didn't notice anything that was a deal break--which makes sense, since I've always been able to start it manually after boot.

Is there a way to capture the same log when the container attempts to start at boot?

LnxBil · Jul 17, 2019

CashewCaliphate said:
Is there a way to capture the same log when the container attempts to start at boot?

Yes, the systemd override stuff I mentioned earlier. You have e.g. the systemd service pve-container@100.service for your first VM. There is the start command that needs to be expanded.

Search

Search

LXCs Won't Start At Boot - But Will Manually?

CashewCaliphate

Member

LnxBil

Distinguished Member

CashewCaliphate

Member

LnxBil

Distinguished Member

CashewCaliphate

Member

LnxBil

Distinguished Member

CashewCaliphate

Member

LnxBil

Distinguished Member

CashewCaliphate

Member

LnxBil

Distinguished Member