[SOLVED] LXC container reboot fails - LXC becomes unusable

Discussion in 'Proxmox VE: Installation and configuration' started by denos, Feb 7, 2018.

  1. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,119
    Likes Received:
    91
    I just updated the bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=1943 - sadly could still not reproduce the issue locally (despite sending out quite some (fragmented) ipv6 traffic and restaring containers).
    If possible please provide the requested information in the bug-report
    Thanks!
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  2. foobar73

    foobar73 New Member

    Joined:
    Jan 19, 2016
    Messages:
    8
    Likes Received:
    0
    I noted this on the bug as well..

    We ( @seneca214 and me) were able to reproduce the bug with the ip6tables block in place, unfortunately. This time the spinlock was in a kernel tree with the ipv4 version of the same exit_frags_net kernel process.

    @seneca214 noted that there was a lot of mDNS broadcast traffic hitting this machine so maybe that's what is doing it.

    I enabled the firewall at the cluster level and added MDNS macro to the drop chain and then made sure the default action was to allow so that we didn't lose access to anything else while testing this.
     
  3. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    538
    Likes Received:
    58
    +1 me too.

    I am not using the proxmox fw at all (disabled) and up to this point I have not seen this behavior before. Some nodes work fine, some are seeing this issue. pveversion for all nodes:
    Code:
    proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
    pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
    pve-kernel-4.15: 5.3-2
    pve-kernel-4.15.18-11-pve: 4.15.18-33
    pve-kernel-4.15.18-9-pve: 4.15.18-30
    pve-kernel-4.15.18-7-pve: 4.15.18-27
    ceph: 12.2.11-pve1
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-3
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-46
    libpve-guest-common-perl: 2.0-20
    libpve-http-server-perl: 2.0-11
    libpve-storage-perl: 5.0-38
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.1.0-3
    lxcfs: 3.0.3-pve1
    novnc-pve: 1.0.0-2
    openvswitch-switch: 2.7.0-3
    proxmox-widget-toolkit: 1.0-22
    pve-cluster: 5.0-33
    pve-container: 2.0-34
    pve-docs: 5.3-2
    pve-edk2-firmware: 1.20181023-1
    pve-firewall: 3.0-17
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-6
    pve-i18n: 1.0-9
    pve-libspice-server1: 0.14.1-2
    pve-qemu-kvm: 2.12.1-1
    pve-xtermjs: 3.10.1-1
    qemu-server: 5.0-46
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.12-pve1~bpo1
     
  4. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,119
    Likes Received:
    91
    @alexskysilk :
    * Please provide the perf-data, workqueue trace and other information as requested in:
    https://bugzilla.proxmox.com/show_bug.cgi?id=1943#c4

    I just updated the issue's summary and added a comment to clarify what the exact problem described in the issue is (kworker spinning in inet_frags_exit_net) - given that we had quite a few reports of other issues with the same symptoms (kworker using 100% CPU - only fixable by node reset)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  5. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,119
    Likes Received:
    91
    Does your workaround with iptables still work and prevent the issue from occuring (for those users, which tried to mitigate the issue with it)?

    As written in the bugreport (https://bugzilla.proxmox.com/show_bug.cgi?id=1943#c20) I still was not able to reproduce the issue locally despite additionally introducing mDNS traffic into the test-setup
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  6. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    23
    Likes Received:
    3
    So far, we've been unable to reproduce the issue with any server that's been rebooted with the firewall rules in place.

    If nothing else, this seems to greatly mitigate the issue.
     
  7. alexskysilk

    alexskysilk Active Member

    Joined:
    Oct 16, 2015
    Messages:
    538
    Likes Received:
    58
  8. sQuote.de Thorsten

    sQuote.de Thorsten New Member
    Proxmox Subscriber

    Joined:
    Dec 3, 2018
    Messages:
    29
    Likes Received:
    0
    Hey,

    is here any Bug Fix? After an reboot from an LXC, the Host is go Offline. 1 CPU on 100% always Question mark on the Panel. Only an reboot fix this for a moment to the next reboot.

    I hope anyone can Help me!

    Regards Thorsten
     
  9. fireon

    fireon Well-Known Member
    Proxmox Subscriber

    Joined:
    Oct 25, 2010
    Messages:
    2,924
    Likes Received:
    168
    Since last update: On my last tests with my CT's i have seen that if i reboot or shutdown the CT's from inside, then everything hangs. You have to kill the LXCprocess manually and reboot the host. But if i shutdown the CT from the PVE Webinterface everything goes fine. Tested it 20 times.
    Code:
    pve-manager/5.4-3/0a6eaa62 (running kernel: 4.15.18-12-pve)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. Stoiko Ivanov

    Stoiko Ivanov Proxmox Staff Member
    Staff Member

    Joined:
    May 2, 2018
    Messages:
    1,119
    Likes Received:
    91
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  11. seneca214

    seneca214 New Member

    Joined:
    Dec 3, 2012
    Messages:
    23
    Likes Received:
    3
    When the kworker issue is present we do see the web console show grey icons on all containers. This does sound like the same issue.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice