Out of ideas - PVE 5.2 lxc container restart issue

dw2412

Active Member
Mar 14, 2013
22
0
41
Dear technicians & Proxmox devs,

lets start it this way - I think I've found an issue and I'm more or less out of ideas on how to continue debugging / solving the issue at all.

In use:

Proxmox 5.2 details:

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-3-pve: 4.15.18-22
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.4-1-pve: 4.13.4-26
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-35
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

running on:
Code:
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 2.1c
        Release Date: 08/03/2012
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 4096 kB
        Characteristics:
                ISA is supported
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                LS-120 boot is supported
                ATAPI Zip drive boot is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
        BIOS Revision: 8.16

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Supermicro
        Product Name: X8DTU
        Version: 1234567890
        Serial Number: 1234567890
        UUID: 12345678-0200-900A-2500-1234567890AB
        Wake-up Type: Power Switch
        SKU Number: To Be Filled By O.E.M.
        Family: Server

Inside the system Intel Corporation 82576 (igb module) being used. 2 Onboard, 2 on pcie card.

Now let's take a vanilla template from Proxmox site (Ubuntu 14.04, CentOS 7), create 2 lxc container with any of the OS and start them both. After starting just shutdown them and do a simple pct start again. This will work. Now lets start both and create an IPTables firewall rule like this inside the container:

iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT

Shutdown the container and try to start it again. It will result in a hanging lxc monitor process like this:

root 5036 1 0 13:41 ? 00:00:00 /usr/bin/lxc-start -n 9135
root 5040 1 2 13:41 ? 00:00:00 [lxc monitor] /var/lib/lxc 9135
root 5130 2569 0 13:41 ? 00:00:00 lxc-info -n 9135 -p

After doing a kill -9 5040 at least pct list command works again but only way found to solve this and get the container at least started again is to reboot the Proxmox node itself.

I've tried a lot of things like updating the igb module to the really latest version, tried to play with SR-IOV, used Ubuntu instead of CentOS, downgrading/upgrading kernel, used a container restore from another Proxmox 5 System known to shutdown and start there without any issues - no way. If there is a single iptables rule being created inside the container the container can be stopped but never be restarted without prior reboot of the node.

Last hope now is that someone in forum is getting an idea from the details on how to solve this issue.

Thanks for reading my lines and for help

Andreas
 
does you htop/top on the node by any chance show a kworker process with 100% cpu usage (see https://bugzilla.proxmox.com/show_bug.cgi?id=1943)?

kworker according to top is at (around) 100%. But instead of zfs lvm-thin is being used.

Can you start other containers when this happens?

No. Just tried it, after first shutdown of container with single iptables command issued occured no other container could be started.

can you try to start the container in foreground (https://pve.proxmox.com/pve-docs/pct.1.html - obtaining debug logs)?

EDIT: are the containers unprivileged or privileged ones?

Only thing I can say from this is that first start works (as expected) and after iptables set up and container shutdown the 2nd start does not work. It actually hangs in the network setup. The containers are all privileged.

Here is the debug log of the 2nd start that I fetched some days ago in early beginning of all this mess...

Code:
lxc-start 9136 20180927142826.580 INFO     lsm - lsm/lsm.c:lsm_init:47 - LSM security driver AppArmor
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "reject_force_umount  # comment this to allow umount -f;  not recommended"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:do_resolve_add_rule:503 - Set seccomp rule to reject force umounts
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:934 - Added native rule for arch 0 for reject_force_umount action 0(kill)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:do_resolve_add_rule:503 - Set seccomp rule to reject force umounts
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:943 - Added compat rule for arch 1073741827 for reject_force_umount action 0(kill)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:do_resolve_add_rule:503 - Set seccomp rule to reject force umounts
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:953 - Added compat rule for arch 1073741886 for reject_force_umount action 0(kill)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:do_resolve_add_rule:503 - Set seccomp rule to reject force umounts
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:963 - Added native rule for arch -1073741762 for reject_force_umount action 0(kill)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "[all]"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "kexec_load errno 1"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:934 - Added native rule for arch 0 for kexec_load action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:943 - Added compat rule for arch 1073741827 for kexec_load action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:953 - Added compat rule for arch 1073741886 for kexec_load action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:963 - Added native rule for arch -1073741762 for kexec_load action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "open_by_handle_at errno 1"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:934 - Added native rule for arch 0 for open_by_handle_at action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:943 - Added compat rule for arch 1073741827 for open_by_handle_at action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:953 - Added compat rule for arch 1073741886 for open_by_handle_at action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:963 - Added native rule for arch -1073741762 for open_by_handle_at action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "init_module errno 1"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:934 - Added native rule for arch 0 for init_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:943 - Added compat rule for arch 1073741827 for init_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:953 - Added compat rule for arch 1073741886 for init_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:963 - Added native rule for arch -1073741762 for init_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "finit_module errno 1"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:934 - Added native rule for arch 0 for finit_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:943 - Added compat rule for arch 1073741827 for finit_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:953 - Added compat rule for arch 1073741886 for finit_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:963 - Added native rule for arch -1073741762 for finit_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:757 - Processing "delete_module errno 1"
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:934 - Added native rule for arch 0 for delete_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:943 - Added compat rule for arch 1073741827 for delete_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:953 - Added compat rule for arch 1073741886 for delete_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:963 - Added native rule for arch -1073741762 for delete_module action 327681(errno)
lxc-start 9136 20180927142826.580 INFO     seccomp - seccomp.c:parse_config_v2:967 - Merging compat seccomp contexts into main context
lxc-start 9136 20180927142826.580 INFO     conf - conf.c:run_script_argv:374 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "9136", config section "lxc"
lxc-start 9136 20180927142827.439 DEBUG    terminal - terminal.c:lxc_terminal_peer_default:711 - Using terminal "/dev/tty" as proxy
lxc-start 9136 20180927142827.439 DEBUG    terminal - terminal.c:lxc_terminal_signal_init:189 - Created signal fd 9
lxc-start 9136 20180927142827.439 DEBUG    terminal - terminal.c:lxc_terminal_winsz:87 - Set window size to 253 columns and 67 rows
lxc-start 9136 20180927142827.439 DEBUG    terminal - terminal.c:lxc_terminal_create_log_file:876 - Using "9136_1.log" as terminal log file
lxc-start 9136 20180927142827.439 INFO     start - start.c:lxc_init:873 - Container "9136" is initialized
lxc-start 9136 20180927142827.441 INFO     conf - conf.c:run_script:512 - Executing script "/usr/share/lxc/lxcnetaddbr" for container "9136", config section "net"
lxc-start 9136 20180927142828.207 DEBUG    network - network.c:instantiate_veth:205 - Instantiated veth "veth9136i0/veth6SILG1", index is "18"
lxc-start 9136 20180927142828.208 DEBUG    cgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:613 - "cgroup.clone_children" was already set to "1"
 
I could not reproduce the issue locally - even with adding iptables rules inside the containers.
* Could you provide the requested information gather via perf record? (see the bugzilla report for instructions)
* Additionally the tracing output of
Code:
echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
cat /sys/kernel/debug/tracing/trace_pipe > out.txt
(wait a few secs)
 ^C
could help - see [0].

* Additionally you could try upgrading the BIOS of the board (according to dmidecode it's from 2012)
* Can you restart the containers multiple times, when you don't add the iptables rules?

If you gather the reports, please attach them to the bugzilla report.


[0] https://lkml.org/lkml/2011/3/31/68
 
I could not reproduce the issue locally - even with adding iptables rules inside the containers.
* Could you provide the requested information gather via perf record? (see the bugzilla report for instructions)
* Additionally the tracing output of
Code:
echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
cat /sys/kernel/debug/tracing/trace_pipe > out.txt
(wait a few secs)
 ^C
could help - see [0].

Is there a chance that it contains privacy related informations (passwords / ssh keys)? If so please give me another way to provide you the debug infos that does not result in releasing this info to the public...

* Additionally you could try upgrading the BIOS of the board (according to dmidecode it's from 2012)

Sorry - none available... :-(

* Can you restart the containers multiple times, when you don't add the iptables rules?

Yes. I would say so... I did it in bash script 20 times with 10 seconds sleep between pct start and pct shutdown. Container started all the time after shutdown. In end I started it one time manually and did few things inside - again without issue...

Andreas
 
Yes. I would say so... I did it in bash script 20 times with 10 seconds sleep between pct start and pct shutdown. Container started all the time after shutdown. In end I started it one time manually and did few things inside - again without issue...
Then it's probably yet a separate issue :/ - Thanks for the update!
 
After last week being on holidays I got some info from my collegue about what he did and thought to have solved everything. Well actually what he did helps a bit but does not solve the issue entirely.

The "solution" was to set all containers to unconfined apparmor profile. Doing a reboot and find out that everything regarding start stop works for at least 10 times. But well - after several days (something like 6) - the issue was found to occur again.

Maybe it gives you some more clue about the issue...

Best regards,

Andreas
 
Hi Stoiko,

some time passed by already - I've updated now to latest pve-manager/5.2-11/13c2da63 (running kernel: 4.15.18-9-pve). The reported issue is unfortunately still the same. If there is more info needed please tell me what you need and I will try to prepare it for you.

Thanks

Andreas
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!