LXC container reboot fails - LXC becomes unusable

Stoiko Ivanov · Mar 12, 2019

I just updated the bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=1943 - sadly could still not reproduce the issue locally (despite sending out quite some (fragmented) ipv6 traffic and restaring containers).
If possible please provide the requested information in the bug-report
Thanks!

foobar73 · Mar 13, 2019

I noted this on the bug as well..

We ( @seneca214 and me) were able to reproduce the bug with the ip6tables block in place, unfortunately. This time the spinlock was in a kernel tree with the ipv4 version of the same exit_frags_net kernel process.

@seneca214 noted that there was a lot of mDNS broadcast traffic hitting this machine so maybe that's what is doing it.

I enabled the firewall at the cluster level and added MDNS macro to the drop chain and then made sure the default action was to allow so that we didn't lose access to anything else while testing this.

alexskysilk · Mar 14, 2019

+1 me too.

I am not using the proxmox fw at all (disabled) and up to this point I have not seen this behavior before. Some nodes work fine, some are seeing this issue. pveversion for all nodes:

Code:

proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-7-pve: 4.15.18-27
ceph: 12.2.11-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-38
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Stoiko Ivanov · Mar 15, 2019

@alexskysilk :
* Please provide the perf-data, workqueue trace and other information as requested in:
https://bugzilla.proxmox.com/show_bug.cgi?id=1943#c4

I just updated the issue's summary and added a comment to clarify what the exact problem described in the issue is (kworker spinning in inet_frags_exit_net) - given that we had quite a few reports of other issues with the same symptoms (kworker using 100% CPU - only fixable by node reset)

Stoiko Ivanov · Mar 15, 2019

Does your workaround with iptables still work and prevent the issue from occuring (for those users, which tried to mitigate the issue with it)?

As written in the bugreport (https://bugzilla.proxmox.com/show_bug.cgi?id=1943#c20) I still was not able to reproduce the issue locally despite additionally introducing mDNS traffic into the test-setup

seneca214 · Mar 15, 2019

So far, we've been unable to reproduce the issue with any server that's been rebooted with the firewall rules in place.

If nothing else, this seems to greatly mitigate the issue.

alexskysilk · Mar 15, 2019

Stoiko Ivanov said:
@alexskysilk :
* Please provide the perf-data, workqueue trace and other information as requested in:
https://bugzilla.proxmox.com/show_bug.cgi?id=1943#c4

Will do, but wont be able to get to it till next week. I'll keep an eye on bugzilla for any further developments.

sQuote.de Thorsten · Apr 20, 2019

Hey,

is here any Bug Fix? After an reboot from an LXC, the Host is go Offline. 1 CPU on 100% always Question mark on the Panel. Only an reboot fix this for a moment to the next reboot.

I hope anyone can Help me!

Regards Thorsten

fireon · Apr 21, 2019

Since last update: On my last tests with my CT's i have seen that if i reboot or shutdown the CT's from inside, then everything hangs. You have to kill the LXCprocess manually and reboot the host. But if i shutdown the CT from the PVE Webinterface everything goes fine. Tested it 20 times.

Code:

pve-manager/5.4-3/0a6eaa62 (running kernel: 4.15.18-12-pve)

Stoiko Ivanov · Apr 23, 2019

@sQuote.de Thorsten : please verify that you're indeed hit by this very issue (e.g. I'm not aware that the kworker spinning on inet_frags_exit_net causes the grey question marks) - all necessary indicators are explained in https://bugzilla.proxmox.com/show_bug.cgi?id=1943
Thanks!

seneca214 · Apr 29, 2019

When the kworker issue is present we do see the web console show grey icons on all containers. This does sound like the same issue.

Kerel · Jun 8, 2019

Same problem here:

kworker 100%:

startup lxc failing:

The topic title says "solved", where can I find the details how to resolve this issue on my system?

pveversion -V:

Code:

root@server:~# uname -a
Linux server 4.15.18-13-pve #1 SMP PVE 4.15.18-37 (Sat, 13 Apr 2019 21:09:15 +0200) x86_64 GNU/Linux
root@server:~# pveversion -V
proxmox-ve: 5.4-1 (running kernel: 4.15.18-13-pve)
pve-manager: 5.4-5 (running version: 5.4-5/c6fdb264)
pve-kernel-4.15: 5.4-1
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-51
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-26
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-20
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
[CODE]

The topic title says "solved", where can I find the details how to resolve this issue on my system?

denos · Jun 8, 2019

The topic title says "solved", where can I find the details how to resolve this issue on my system?

I have removed "Solved" from the title as the only solution is to manually install and maintain a 4.18+ kernel which isn't feasible / desirable for most users.

Kerel · Jun 8, 2019

I do however have mounted NFS resources from within my containers. I read somewhere this is not advised by the proxmox team. So I should make use of bind mounts. Is there any evidence NFS mounts have a relation with this issue?

mac.linux.free · Jun 8, 2019

I had the same problem. I fixed it by removing openvswitch and changing back to linux-bridge.
Working till now.

denos · Jun 8, 2019

Kerel said:
I do however have mounted NFS resources from within my containers. I read somewhere this is not advised by the proxmox team. So I should make use of bind mounts. Is there any evidence NFS mounts have a relation with this issue?

I have seen the issue on a Proxmox node without any (client or server) NFS.

Kerel · Jun 8, 2019

denos said:
I have seen the issue on a Proxmox node without any (client or server) NFS.

Alright, thanks for posting.

However, just to be sure, I just moved all NFS mounts to the proxmox host, and 'bind mounted' all of them to the individual lxc's.

I find it strange that Proxmox forces me to select one of the 'content types (see screenshot below) An alternative is a systemd mount service, but I'd prefer to do NFS mounting via the gui.

Now I got this 'snippets' folder inside my downloads mount, but anyways, that's a different story..

Will update here when I'm encountering the same issue again, let's see if this fixes mine.

oguz · Jun 11, 2019

Kerel said:
Now I got this 'snippets' folder inside my downloads mount, but anyways, that's a different story..

that's just a recent feature addition, no need to worry

Kerel · Jun 11, 2019

oguz said:
that's just a recent feature addition, no need to worry

Thanks for the info. I'm actually not worrying at all because of that folder, I'm just questioning why the PVE team decided that you must select a content type, when defining an NFS mount.

Alwin · Jun 12, 2019

In the screenshot, you are defining a storage and the content type is used in other parts of Proxmox VE for eg. filter views, or if CTs/VMs can be migrated there. If you want to use a NFS mount for other purposes or as a directory storage, then you need to go through the fstab.

LXC container reboot fails - LXC becomes unusable

Proxmox Staff Member

Renowned Member

Distinguished Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

Distinguished Member

Member

Distinguished Member

Proxmox Staff Member

Active Member

New Member

Well-Known Member

New Member

Renowned Member

Well-Known Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff

We value your privacy