LXC container reboot fails - LXC becomes unusable

denos

Well-Known Member
Jul 27, 2015
83
41
58
Overview
We're seeing a new issue since in-place upgrading Proxmox 4.x to Proxmox 5.x. For 2 different LXC containers after 1-3 reboots, the shutdown will fail. This occurs if triggered inside the OS (shutdown -r now) or from the hypervisor using "pct shutdown XXX --forceStop 1". Once it occurs, pct and the underlying lxc-XXX commands all hang.

kill -9 (lxc monitor process for CT) will allow pct and lxc-XXX commands to function again, but it becomes impossible to start any LXC containers from this point forward without a hypervisor reboot.

Debug
pveversion
Code:
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.4.98-5-pve: 4.4.98-105
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.4.67-1-pve: 4.4.67-92
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9

strace of a pct start just shows Proxmox waiting on the underlying thead to start and fails on timeout.

"strace -f lxc-start -l trace -o /tmp/trace.log -n CTID", yields too much debug to post in it's entirety, but here is the end of the dump showing that it's hanging on a clone operation (12161 is the lxc-monitor for the container):
Code:
... snip ...
[pid 12161] write(2, "cgroups/cgfsng.c: cgfsng_create:"..., 38) = 38
[pid 12161] write(2, "Failed to create \"/sys/fs/cgroup"..., 51) = 51
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] stat("/sys/fs/cgroup/systemd//lxc/2022-1", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 171) = 171
[pid 12161] write(2, "lxc-start: 2022: ", 17) = 17
[pid 12161] write(2, "cgroups/cgfsng.c: create_path_fo"..., 50) = 50
[pid 12161] write(2, "Path \"/sys/fs/cgroup/systemd//lx"..., 58) = 58
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 154) = 154
[pid 12161] write(2, "lxc-start: 2022: ", 17) = 17
[pid 12161] write(2, "cgroups/cgfsng.c: cgfsng_create:"..., 38) = 38
[pid 12161] write(2, "Failed to create \"/sys/fs/cgroup"..., 53) = 53
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] stat("/sys/fs/cgroup/systemd//lxc/2022-2", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 171) = 171
[pid 12161] write(2, "lxc-start: 2022: ", 17) = 17
[pid 12161] write(2, "cgroups/cgfsng.c: create_path_fo"..., 50) = 50
[pid 12161] write(2, "Path \"/sys/fs/cgroup/systemd//lx"..., 58) = 58
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 154) = 154
[pid 12161] write(2, "lxc-start: 2022: ", 17) = 17
[pid 12161] write(2, "cgroups/cgfsng.c: cgfsng_create:"..., 38) = 38
[pid 12161] write(2, "Failed to create \"/sys/fs/cgroup"..., 53) = 53
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] stat("/sys/fs/cgroup/systemd//lxc/2022-3", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 171) = 171
[pid 12161] write(2, "lxc-start: 2022: ", 17) = 17
[pid 12161] write(2, "cgroups/cgfsng.c: create_path_fo"..., 50) = 50
[pid 12161] write(2, "Path \"/sys/fs/cgroup/systemd//lx"..., 58) = 58
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 154) = 154
[pid 12161] write(2, "lxc-start: 2022: ", 17) = 17
[pid 12161] write(2, "cgroups/cgfsng.c: cgfsng_create:"..., 38) = 38
[pid 12161] write(2, "Failed to create \"/sys/fs/cgroup"..., 53) = 53
[pid 12161] write(2, "\n", 1)           = 1
[pid 12161] stat("/sys/fs/cgroup/systemd//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/systemd//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/systemd//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/systemd//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/memory//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/memory//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/memory//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/memory//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/freezer//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/freezer//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/freezer//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/freezer//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/cpu//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/cpu//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/cpu//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/cpu//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/pids//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/pids//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/pids//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/pids//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/hugetlb//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/hugetlb//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/hugetlb//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/hugetlb//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/net_cls//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/net_cls//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/net_cls//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/net_cls//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/blkio//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/blkio//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/blkio//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/blkio//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/cpuset//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/sys/fs/cgroup/cpuset//lxc", 0755) = -1 EEXIST (File exists)
[pid 12161] stat("/sys/fs/cgroup/cpuset//lxc/cgroup.clone_children", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 12161] open("/sys/fs/cgroup/cpuset//lxc/cgroup.clone_children", O_RDONLY|O_CLOEXEC) = 14
[pid 12161] read(14, "1", 1)            = 1
[pid 12161] close(14)                   = 0
[pid 12161] open("/sys/fs/cgroup/cpuset//cpuset.cpus", O_RDONLY) = 14
[pid 12161] fstat(14, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 12161] read(14, "0-31\n", 4096)    = 5
[pid 12161] read(14, "", 4096)          = 0
[pid 12161] close(14)                   = 0
[pid 12161] stat("/sys/devices/system/cpu/isolated", {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
[pid 12161] open("/sys/devices/system/cpu/isolated", O_RDONLY) = 14
[pid 12161] fstat(14, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
[pid 12161] read(14, "\n", 4096)        = 1
[pid 12161] read(14, "", 4096)          = 0
[pid 12161] close(14)                   = 0
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 132) = 132
[pid 12161] write(3, "      lxc-start 2022 20180207193"..., 157) = 157
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/cpuset//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/cpuset//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/cpuset//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/devices//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/devices//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/devices//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/devices//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/rdma//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/rdma//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/rdma//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/rdma//lxc/2022-4", 0755) = 0
[pid 12161] stat("/sys/fs/cgroup/perf_event//lxc/2022-4", 0x7ffe5a3f1c50) = -1 ENOENT (No such file or directory)
[pid 12161] mkdir("/", 0755)            = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/", 0755)        = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/", 0755)     = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/perf_event//", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/perf_event//lxc/", 0755) = -1 EEXIST (File exists)
[pid 12161] mkdir("/sys/fs/cgroup/perf_event//lxc/2022-4", 0755) = 0
[pid 12161] lstat("/var", {st_mode=S_IFDIR|0755, st_size=13, ...}) = 0
[pid 12161] lstat("/var/lib", {st_mode=S_IFDIR|0755, st_size=49, ...}) = 0
[pid 12161] lstat("/var/lib/lxc", {st_mode=S_IFDIR|0755, st_size=15, ...}) = 0
[pid 12161] lstat("/var/lib/lxc/2022", {st_mode=S_IFDIR|0755, st_size=4, ...}) = 0
[pid 12161] lstat("/var/lib/lxc/2022/rootfs", {st_mode=S_IFDIR|0755, st_size=36, ...}) = 0
[pid 12161] access("/var/lib/lxc/2022/rootfs", F_OK) = 0
[pid 12161] stat("/var/lib/lxc/2022/rootfs", {st_mode=S_IFDIR|0755, st_size=36, ...}) = 0
[pid 12161] open("/var/lib/lxc/2022/rootfs/lxc.hold", O_RDWR|O_CREAT, 0600) = 14
[pid 12161] unlink("/var/lib/lxc/2022/rootfs/lxc.hold") = 0
[pid 12161] getpid()                    = 12161
[pid 12161] open("/proc/12161/ns", O_RDONLY|O_CLOEXEC) = 15
[pid 12161] close(15)                   = 0
[pid 12161] clone(

netstat -xp shows a failure to connect to the command socket for the container:
Code:
root@dvm2:/proc/7600# netstat -xp | grep lxc | grep 2022
unix  3      [ ]         STREAM     CONNECTING    0        -                    @/var/lib/lxc/2022/command
unix  3      [ ]         STREAM     CONNECTING    0        -                    @/var/lib/lxc/2022/command
unix  3      [ ]         STREAM     CONNECTING    0        -                    @/var/lib/lxc/2022/command
unix  2      [ ]         STREAM     CONNECTING    0        -                    @/var/lib/lxc/2022/command
unix  3      [ ]         STREAM     CONNECTING    0        -                    @/var/lib/lxc/2022/command
unix  3      [ ]         STREAM     CONNECTING    0        -                    @/var/lib/lxc/2022/command

The trace debug of the lxc-start command that was strace'd:
Code:
      lxc-start 2022 20180207175421.764 INFO     lxc_start_ui - tools/lxc_start.c:main:280 - using rcfile /var/lib/lxc/2022/config
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:lxc_cmd:290 - command get_init_pid tries to connect command socket
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:lxc_cmd:295 - command get_init_pid failed to connect command socket: Connection refused
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:lxc_cmd:290 - command get_cgroup tries to connect command socket
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:lxc_cmd:295 - command get_cgroup failed to connect command socket: Connection refused
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:do_lxc_cmd_get_cgroup_path:439 - command get_cgroup failed for container "2022": Connection refused.
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:lxc_cmd:290 - command get_state tries to connect command socket
      lxc-start 2022 20180207175421.765 TRACE    lxc_commands - commands.c:lxc_cmd:295 - command get_state failed to connect command socket: Connection refused
      lxc-start 2022 20180207175421.765 TRACE    lxc_start - start.c:lxc_init_handler:583 - Created anonymous pair {4,5} of unix sockets
      lxc-start 2022 20180207175421.765 TRACE    lxc_start - start.c:lxc_init_handler:592 - unix domain socket 6 for command server is ready
      lxc-start 2022 20180207175421.765 INFO     lxc_container - lxccontainer.c:do_lxcapi_start:883 - Attempting to set proc title to [lxc monitor] /var/lib/lxc 2022
      lxc-start 2022 20180207175421.767 INFO     lxc_lsm - lsm/lsm.c:lsm_init:48 - LSM security driver AppArmor
      lxc-start 2022 20180207175421.767 TRACE    lxc_start - start.c:lxc_init:607 - initialized LSM
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .reject_force_umount  # comment this to allow umount -f;  not recommended.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for reject_force_umount action 0(kill).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:do_resolve_add_rule:276 - Setting Seccomp rule to reject force umounts.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for reject_force_umount action 0(kill).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:do_resolve_add_rule:276 - Setting Seccomp rule to reject force umounts.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:do_resolve_add_rule:276 - Setting Seccomp rule to reject force umounts.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .[all].
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .kexec_load errno 1.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for kexec_load action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for kexec_load action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .open_by_handle_at errno 1.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for open_by_handle_at action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for open_by_handle_at action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .init_module errno 1.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for init_module action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for init_module action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .finit_module errno 1.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for finit_module action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for finit_module action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .delete_module errno 1.
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for delete_module action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for delete_module action 327681(errno).
      lxc-start 2022 20180207175421.767 INFO     lxc_seccomp - seccomp.c:parse_config_v2:624 - Merging in the compat Seccomp ctx into the main one.
      lxc-start 2022 20180207175421.767 TRACE    lxc_start - start.c:lxc_init:613 - read seccomp policy
      lxc-start 2022 20180207175421.767 TRACE    lxc_start - start.c:lxc_serve_state_clients:373 - set container state to STARTING
      lxc-start 2022 20180207175421.767 TRACE    lxc_start - start.c:lxc_serve_state_clients:376 - no state clients registered
      lxc-start 2022 20180207175421.767 TRACE    lxc_start - start.c:lxc_init:620 - set container state to "STARTING"
      lxc-start 2022 20180207175421.767 TRACE    lxc_start - start.c:lxc_init:648 - set environment variables
      lxc-start 2022 20180207175421.767 INFO     lxc_conf - conf.c:run_script_argv:457 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "2022", config section "lxc".
      lxc-start 2022 20180207175422.297 TRACE    lxc_start - start.c:lxc_init:654 - ran pre-start hooks
      lxc-start 2022 20180207175422.297 DEBUG    lxc_start - start.c:setup_signal_fd:301 - Set SIGCHLD handler with file descriptor: 7.
      lxc-start 2022 20180207175422.297 TRACE    lxc_start - start.c:lxc_init:665 - set up signal fd
      lxc-start 2022 20180207175422.297 DEBUG    console - console.c:lxc_console_peer_default:450 - process does not have a controlling terminal
      lxc-start 2022 20180207175422.297 TRACE    lxc_start - start.c:lxc_init:672 - created console
      lxc-start 2022 20180207175422.297 TRACE    lxc_start - start.c:lxc_init:678 - shifted tty ids
      lxc-start 2022 20180207175422.297 INFO     lxc_start - start.c:lxc_init:680 - container "2022" is initialized
      lxc-start 2022 20180207175422.299 ERROR    lxc_network - network.c:instantiate_veth:130 - Failed to create veth pair "veth2022i0" and "vethC5KAG8": File exists
      lxc-start 2022 20180207175422.299 ERROR    lxc_network - network.c:lxc_create_network_priv:2407 - Failed to create network device
      lxc-start 2022 20180207175422.299 ERROR    lxc_start - start.c:lxc_spawn:1206 - Failed to create the network.
      lxc-start 2022 20180207175422.299 ERROR    lxc_start - start.c:__lxc_start:1477 - Failed to spawn container "2022".
      lxc-start 2022 20180207175422.299 TRACE    lxc_start - start.c:lxc_serve_state_socket_pair:446 - Sent container state "STOPPING" to 5
      lxc-start 2022 20180207175422.299 TRACE    lxc_start - start.c:lxc_serve_state_clients:373 - set container state to STOPPING
      lxc-start 2022 20180207175422.299 TRACE    lxc_start - start.c:lxc_serve_state_clients:376 - no state clients registered
      lxc-start 2022 20180207175422.299 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:760 - Received container state "STOPPING" instead of "RUNNING"
      lxc-start 2022 20180207175422.299 TRACE    lxc_start - start.c:lxc_serve_state_clients:373 - set container state to STOPPED
      lxc-start 2022 20180207175422.299 TRACE    lxc_start - start.c:lxc_serve_state_clients:376 - no state clients registered
      lxc-start 2022 20180207175422.299 ERROR    lxc_start_ui - tools/lxc_start.c:main:371 - The container failed to start.
      lxc-start 2022 20180207175422.299 INFO     lxc_conf - conf.c:run_script_argv:457 - Executing script "/usr/share/lxc/hooks/lxc-pve-poststop-hook" for container "2022", config section "lxc".
      lxc-start 2022 20180207175422.299 ERROR    lxc_start_ui - tools/lxc_start.c:main:373 - To get more details, run the container in foreground mode.
      lxc-start 2022 20180207175422.299 ERROR    lxc_start_ui - tools/lxc_start.c:main:375 - Additional information can be obtained by setting the --logfile and --logpriority options.
      lxc-start 2022 20180207175437.694 INFO     lxc_conf - conf.c:run_script_argv:457 - Executing script "/usr/share/lxcfs/lxc.reboot.hook" for container "2022", config section "lxc".
      lxc-start 2022 20180207175505.412 INFO     lxc_start_ui - tools/lxc_start.c:main:280 - using rcfile /var/lib/lxc/2022/config
      lxc-start 2022 20180207175505.412 TRACE    lxc_commands - commands.c:lxc_cmd:290 - command get_init_pid tries to connect command socket
      lxc-start 2022 20180207175505.413 TRACE    lxc_commands - commands.c:lxc_cmd:295 - command get_init_pid failed to connect command socket: Connection refused
      lxc-start 2022 20180207175505.413 TRACE    lxc_commands - commands.c:lxc_cmd:290 - command get_cgroup tries to connect command socket
      lxc-start 2022 20180207175505.413 TRACE    lxc_commands - commands.c:lxc_cmd:295 - command get_cgroup failed to connect command socket: Connection refused
      lxc-start 2022 20180207175505.413 TRACE    lxc_commands - commands.c:do_lxc_cmd_get_cgroup_path:439 - command get_cgroup failed for container "2022": Connection refused.
      lxc-start 2022 20180207175505.413 TRACE    lxc_commands - commands.c:lxc_cmd:290 - command get_state tries to connect command socket
      lxc-start 2022 20180207175505.413 TRACE    lxc_commands - commands.c:lxc_cmd:295 - command get_state failed to connect command socket: Connection refused
      lxc-start 2022 20180207175505.413 TRACE    lxc_start - start.c:lxc_init_handler:592 - unix domain socket 4 for command server is ready
      lxc-start 2022 20180207175505.413 INFO     lxc_lsm - lsm/lsm.c:lsm_init:48 - LSM security driver AppArmor
      lxc-start 2022 20180207175505.413 TRACE    lxc_start - start.c:lxc_init:607 - initialized LSM
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .reject_force_umount  # comment this to allow umount -f;  not recommended.
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for reject_force_umount action 0(kill).
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:do_resolve_add_rule:276 - Setting Seccomp rule to reject force umounts.
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for reject_force_umount action 0(kill).
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:do_resolve_add_rule:276 - Setting Seccomp rule to reject force umounts.
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:do_resolve_add_rule:276 - Setting Seccomp rule to reject force umounts.
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .[all].
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .kexec_load errno 1.
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for kexec_load action 327681(errno).
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for kexec_load action 327681(errno).
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .open_by_handle_at errno 1.
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for open_by_handle_at action 327681(errno).
      lxc-start 2022 20180207175505.413 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for open_by_handle_at action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .init_module errno 1.
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for init_module action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for init_module action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .finit_module errno 1.
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for finit_module action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for finit_module action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:435 - processing: .delete_module errno 1.
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:610 - Adding native rule for delete_module action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:614 - Adding compat rule for delete_module action 327681(errno).
      lxc-start 2022 20180207175505.414 INFO     lxc_seccomp - seccomp.c:parse_config_v2:624 - Merging in the compat Seccomp ctx into the main one.
      lxc-start 2022 20180207175505.414 TRACE    lxc_start - start.c:lxc_init:613 - read seccomp policy
      lxc-start 2022 20180207175505.414 TRACE    lxc_start - start.c:lxc_serve_state_clients:373 - set container state to STARTING
      lxc-start 2022 20180207175505.414 TRACE    lxc_start - start.c:lxc_serve_state_clients:376 - no state clients registered
      lxc-start 2022 20180207175505.414 TRACE    lxc_start - start.c:lxc_init:620 - set container state to "STARTING"
      lxc-start 2022 20180207175505.414 TRACE    lxc_start - start.c:lxc_init:648 - set environment variables
      lxc-start 2022 20180207175505.414 INFO     lxc_conf - conf.c:run_script_argv:457 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "2022", config section "lxc".
      lxc-start 2022 20180207175505.945 TRACE    lxc_start - start.c:lxc_init:654 - ran pre-start hooks
      lxc-start 2022 20180207175505.945 DEBUG    lxc_start - start.c:setup_signal_fd:301 - Set SIGCHLD handler with file descriptor: 5.
      lxc-start 2022 20180207175505.945 TRACE    lxc_start - start.c:lxc_init:665 - set up signal fd
      lxc-start 2022 20180207175505.946 DEBUG    console - console.c:lxc_console_peer_default:459 - using "/dev/tty" as peer tty device
      lxc-start 2022 20180207175505.946 DEBUG    console - console.c:lxc_console_sigwinch_init:151 - process 42013 created signal fd 9 to handle SIGWINCH events
      lxc-start 2022 20180207175505.946 DEBUG    console - console.c:lxc_console_winsz:71 - set winsz dstfd:6 cols:89 rows:26
      lxc-start 2022 20180207175505.946 TRACE    lxc_start - start.c:lxc_init:672 - created console
      lxc-start 2022 20180207175505.946 TRACE    lxc_start - start.c:lxc_init:678 - shifted tty ids
      lxc-start 2022 20180207175505.946 INFO     lxc_start - start.c:lxc_init:680 - container "2022" is initialized
      lxc-start 2022 20180207175505.948 INFO     lxc_conf - conf.c:run_script:507 - Executing script "/usr/share/lxc/lxcnetaddbr" for container "2022", config section "net".
      lxc-start 2022 20180207175506.465 DEBUG    lxc_network - network.c:instantiate_veth:219 - Instantiated veth "veth2022i0/vethSUM8EJ", index is "50"
      lxc-start 2022 20180207175506.465 INFO     lxc_cgroup - cgroups/cgroup.c:cgroup_init:67 - cgroup driver cgroupfs-ng initing for 2022
      lxc-start 2022 20180207175506.465 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1337 - Path "/sys/fs/cgroup/systemd//lxc/2022" already existed.
      lxc-start 2022 20180207175506.465 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1433 - Failed to create "/sys/fs/cgroup/systemd//lxc/2022"
      lxc-start 2022 20180207175506.466 DEBUG    lxc_cgfsng - cgroups/cgfsng.c:filter_and_set_cpus:469 - No isolated cpus detected.
      lxc-start 2022 20180207175506.466 DEBUG    lxc_cgfsng - cgroups/cgfsng.c:handle_cpuset_hierarchy:640 - "cgroup.clone_children" was already set to "1".

Any attempt to start any container will now hang, even after a kill -9 of the monitor process which only allows informational pct and lxc commands to execute. The containers are CentOS 6 based images that were migrated from OpenVZ. They were restarting without issue on Proxmox 4.x.

Any help would be greatly appreciated.
 
Could you please try a kernel 4.10 from our repository? AFAIK, to get container start working again, only a reboot of the host will work.
 
I have been trying different kernels, network changes, LXC container image changes, etc. without solving this problem. However, I notice these on the console regularly:
kernel:[ 1987.452238] unregister_netdevice: waiting for lo to become free. Usage count = 1

I've seen that before without it causing problems, but some research shows that it has caused issues similar to mine for Docker:
https://github.com/coreos/bugs/issues/254
https://github.com/moby/moby/issues/5618

From the comments, it looks like there is a Jan/2018 patch for this issue in 4.4.114 (so it's not likely in 4.13.13):
https://github.com/moby/moby/issues/5618#issuecomment-365484617
Second commit in changelog:
https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.114

Also in 4.14.16
https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.14.16

Patch Details:
https://patchwork.kernel.org/patch/10174111/

Any chance this could be ported into 4.13.13 testing?
 
Last edited:
While reading kernel commits yesterday, I noticed multiple namespaces fixes in 4.14.16 - 4.14.20 so I built a vanilla 4.14.20 kernel (using the config but no patchsets from 4.13.13-6) and there have been no further namespace cloning issues after almost 1000 reboots.

I can confirm the issue is still present in 4.13.13-6.
 
Our newest 4.13.13-6 kernel has patches included, please try the kernel from our pvetest repository.
 
The issue is still present in 4.13.13-6. I tried with kpti on and off -- both broken. So it must be another change that resolved it. 4.14.20 works.
 
The issue is still present in 4.13.13-6. I tried with kpti on and off -- both broken. So it must be another change that resolved it. 4.14.20 works.

which exact kernel did you test? there are three kernel versions with the ABI 4.13.13-6-pve, only the latest (package version 4.13.13-41) includes the patches.

uname -a should output:
Code:
Linux host 4.13.13-6-pve #1 SMP PVE 4.13.13-41 (Wed, 21 Feb 2018 10:07:54 +0100) x86_64 GNU/Linux
 
I didn't realize the pve-kernel versions aren't unique. I evaluated pve-kernel-4.13.13-6 backed by kernel 4.13.13-40 which still had the issue. I see the update you referenced to pve-kernel-4.13.13-6 (4.13.13-41) and will test that shortly. Thanks for clarifying.
 
4-13-13-6 (4.13.13-41) still has the issue. I test for the issue by triggering a reboot every minute via cron in a CentOS 6 or Ubuntu 14.04 container. It died within a few minutes with the same symptoms (stuck cloning the network namespace):

Here's the stack for the monitor process:

Code:
# cat /proc/23276/stack
[<ffffffff8e1dfd1b>] copy_net_ns+0xab/0x220
[<ffffffff8daa9beb>] create_new_namespaces+0x11b/0x1c0
[<ffffffff8daa9cfd>] copy_namespaces+0x6d/0xa0
[<ffffffff8da84cbc>] copy_process.part.32+0x91c/0x1a40
[<ffffffff8da85fbf>] _do_fork+0xdf/0x3f0
[<ffffffff8da86379>] SyS_clone+0x19/0x20
[<ffffffff8da03a91>] do_syscall_64+0x61/0xd0
[<ffffffff8e40024d>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff
 
I also have exactly same problem..
I have 5 clusters with 5 nodes each.
All clusters has same issue.
I didn't know LXC has so many issues. I should have stayed with openvz.
KVM has no issue in promxox 5.1 though.
 
If you're desperate for an immediate resolution, I have debian packages built using GCC 7.3 (full generic retpoline) for 4.14.23 along with ZFS 0.7.6 DKMS packages. The issue is definitely resolved in any kernel 4.14.20 or later (possibly earlier but that's where I started).

Things to consider:
* I'm a stranger.
* This is a different kernel branch (4.14) than what Proxmox is tracking (4.13) for the pve-kernel (ie. untested).
* Although I've structured the changes so that new pve-kernels will still work, some future cleanup should be done once pve-kernel has this issue resolved.

If you still want to go ahead, send me a private message and I'll get you the files with some instructions.
 
If you're desperate for an immediate resolution, I have debian packages built using GCC 7.3 (full generic retpoline) for 4.14.23 along with ZFS 0.7.6 DKMS packages. The issue is definitely resolved in any kernel 4.14.20 or later (possibly earlier but that's where I started).

Things to consider:
* I'm a stranger.
* This is a different kernel branch (4.14) than what Proxmox is tracking (4.13) for the pve-kernel (ie. untested).
* Although I've structured the changes so that new pve-kernels will still work, some future cleanup should be done once pve-kernel has this issue resolved.

If you still want to go ahead, send me a private message and I'll get you the files with some instructions.

if you can reliably reproduce the issue, it would be great to verify whether mainline 4.14 itself is affected or not. so far, we cannot reproduce the issue in our test lab, so we cannot bisect it either (we did cherry-pick some likely fixes in related areas to our 4.13 kernel, but it seems those did not fix the issue at hand).
 
If you're desperate for an immediate resolution, I have debian packages built using GCC 7.3 (full generic retpoline) for 4.14.23 along with ZFS 0.7.6 DKMS packages. The issue is definitely resolved in any kernel 4.14.20 or later (possibly earlier but that's where I started).

Things to consider:
* I'm a stranger.
* This is a different kernel branch (4.14) than what Proxmox is tracking (4.13) for the pve-kernel (ie. untested).
* Although I've structured the changes so that new pve-kernels will still work, some future cleanup should be done once pve-kernel has this issue resolved.

If you still want to go ahead, send me a private message and I'll get you the files with some instructions.

Can you provide a test container (backup) where we can see this behavior? So far we cannot see this with our containers, so there must be something different in your containers.
(send a download link to office(at)proxmox.com, with no sensitive data inside)
 
if you can reliably reproduce the issue, it would be great to verify whether mainline 4.14 itself is affected or not. so far, we cannot reproduce the issue in our test lab, so we cannot bisect it either (we did cherry-pick some likely fixes in related areas to our 4.13 kernel, but it seems those did not fix the issue at hand).

The kernels (4.14.20 and 4.14.23 -- both resolve the issue) are from the source tars on kernel.org with no patches applied. I just copied over the Proxmox config from the 4.13.13 kernel (make oldconfig) then used the make target "deb-pkg" to get the kernel and headers packaged so I could distribute them to other servers.

By mainline, did you mean from source (tested - works) or using the Debian 4.14 backport (I haven't tested this)?
 
Can you provide a test container (backup) where we can see this behavior? So far we cannot see this with our containers, so there must be something different in your containers.
(send a download link to office(at)proxmox.com, with no sensitive data inside)

There are 5.1 hypervisors (running pve-kernel) where I don't see the issue either and I'm at a loss as to why some hypervisors can reproduce this easily while others appear unaffected (at exactly the same patch level). My best guess is that it's an interaction or timing problem caused by a specific mix of containers.

Unfortunately, the container I test with is one of 17 on my home server and is currently very large (database). I'll investigate whether the issue is repeatable on the container without the database as that would be a size that could be shared.

Another user reported that they were able to reproduce this issue on a clean install with 4 generic containers. I'll encourage them to post the details to this thread.
 
I also have exactly same problem..
I have 5 clusters with 5 nodes each.
All clusters has same issue.
I didn't know LXC has so many issues. I should have stayed with openvz.
KVM has no issue in promxox 5.1 though.

You mentioned in another thread that you were able to duplicate this issue on a clean install. Can you please share the details here so Proxmox can attempt to duplicate the environment? Thanks!
 
Yes, I took a fresh node, installed 5 CT with centos 6.

Keep it in loop to stop and start each CT every minute.

And issue happened in 10-20 minutes time.

Keep in mind I had ZFS, and ARC cache was set at 8GB.

When I changed ARC cache to 1GB, issue reduced to almost 10%, it ran for 2-4 hours without issues.

In my production nodes (25 of them) , I avoided stop and restart maximum possible and nodes are running for last 48 hours without an issue. (with 1GB ARC cache)
 
Yes, I took a fresh node, installed 5 CT with centos 6.

Keep it in loop to stop and start each CT every minute.

And issue happened in 10-20 minutes time.

Keep in mind I had ZFS, and ARC cache was set at 8GB.

When I changed ARC cache to 1GB, issue reduced to almost 10%, it ran for 2-4 hours without issues.

In my production nodes (25 of them) , I avoided stop and restart maximum possible and nodes are running for last 48 hours without an issue. (with 1GB ARC cache)
This could indicate that the trigger for the issue is sensitive to timing / performance at the storage layer. With a smaller ARC (especially 1GB) performance will be degraded and the delays may reduce the occurrence of the issue. All of our environments run large ARCs (32GB minimum) tuned for very high cache hit rates. The servers that regularly experience this issue are showing ARC hit rates between 84% and 95%.
 
I can confirm by 100% that reducing ARC cache from 8GB to 1GB reduced the issue by almost 90%

But you did test with 3.14.20 and issue was not there even with 32GB ARC cache right?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!