LXC restart creates kworker CPU 100%

Jean-Pierre · Jan 29, 2018

Hi

We have a consistent issue when rebooting LXC conatiners a kworker process eventually locks the system and we have to hard reset the server.

A line from `top` below for the process that spawns,
31043 root 20 0 0 0 0 R 100.0 0.0 94:20.90 kworker/u24:3

The spec of this server which is a standlone server:
Supermicro 1018R-WC0R with a X10SRW-F main board and a E5-1650 v4 CPU.
Boots off 2x SSD internal in Raid1 and storage is a internal Raid10 array of 6 1TB SSD drives all software/mdadm raid.
There is NFS storage attatched for backup images.

The server was fully updated and rebooted about 9 days ago. The server was initialy installed with Proxmox 5.1 in November 2017. I do see there is another kernel available, however rebooting this production server is not a simple process.
pveversion -v
proxmox-ve: 5.1-35 (running kernel: 4.13.13-4-pve)
pve-manager: 5.1-42 (running version: 5.1-42/724a6cb3)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-19
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

We noticed this issue around 12 December 2017 and confirmed that this happens 99.9% of the time any LXC conatiner is restarted/rebooted.
The kworker process will appaear (@100% CPU) and slowly after a few hours start grinding the server to an almost stand still. I cannot kill this process or even get the server to reboot/halt gracefully a hard reset is required.
This is for any LXC container on the server even a brand new one.
I did notice that 9 days ago after the last update and reboot immediatly after I could restart LXC conatiners, however 10 hours later the next container restart caused the same issue.

The below posts mention what could be the same issue but do not seem to be addressed at all,
https://forum.proxmox.com/threads/proxmox-ve-5-1-released.37650/page-3#post-187137
https://forum.proxmox.com/threads/kworker-100-cpu.37795/

I should also mentioned we have a few servers with similar hardware running Proxmox 4.4 and do not have this issue.

On a side note. I have a feeling it might have something to do with ACPI and friends and LXC maybe even triggering the old kworker bug somwhow.
bugs.launchpad.net/ubuntu/+source/linux/+bug/887793

Any help would be apreciated.

Marius_B · Jan 30, 2018

I can confirm we get the same issue ever since we upgraded to the new version of Proxmox. This is very annoying and renders a host almost useless with us unable to kill the kworker process. Instead of restarting the LXC we've tried shutting it down and then starting it up again, which appeared to work or be safer in most cases, but even then we've had a LXC cause the same situation during a shutdown. Needs attention please.

sapphiron · Feb 1, 2018

I have also been experiencing the same since my upgrade to Proxmox 5.1.

I restarted a container, which resulted in Kworker using 100% CPU of IOWAIT. however, non of the disk devices had high utilization percentages, they were normal. VM and containers already running, also experienced no performance problems.

The Proxmox interface stoped updating all VM and container data. Restarting some of the proxmox services, will briefly start updating KVM's again, but after about 30 seconds, it stops again.

I was able to safely shut down my KVMs on the same box, using the qm shutdown commands via SSH. Using pct commands to attempt to do the same with my containers, simply hung on the command. I was also unable to do a pct list.

Hard reset via IPMI was my only option to restore the server as it would not reboot via SSH.

My Server is using MDADM+LVM (not thin) for VM and container storage. My motherboard is a ASUS X99 IPMI with Xeon-E5-1650 V4. an NFS share is mounted for Backups, ISO's and container templates.

root@vm1:~# pveversion -v
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.24-1-pve: 4.4.24-72
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.21-1-pve: 4.4.21-71
pve-kernel-4.4.95-1-pve: 4.4.95-99
pve-kernel-4.4.44-1-pve: 4.4.44-84
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.4.59-1-pve: 4.4.59-87
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9

uncia · Feb 8, 2018

We have the same problem

And a waiting for fix release.

gusans · Feb 9, 2018

hi! same problem here

root@pve1:~# pveversion -v
proxmox-ve: 5.1-38 (running kernel: 4.13.13-5-pve)
pve-manager: 5.1-43 (running version: 5.1-43/bdb08029)
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-5-pve: 4.13.13-38
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-20
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-6
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.4-pve2~bpo9

Vasu Sreekumar · Mar 10, 2018

I also see same issue after i upgraded to 5.1.

Waiting for an update.

kworker/u48:1 at 100%

top - 23:02:35 up 3 days, 23:57, 1 user, load average: 69.34, 64.66, 63.87
Tasks: 720 total, 17 running, 703 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.4 us, 62.9 sy, 0.0 ni, 26.5 id, 9.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 49443632 total, 11305480 free, 36524072 used, 1614080 buff/cache
KiB Swap: 8388604 total, 8388604 free, 0 used. 12136912 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32333 root 20 0 0 0 0 R 100.0 0.0 102:07.86 kworker/u48:1
405 root 20 0 0 0 0 R 99.7 0.0 3988:29 arc_reclaim
8728 root 20 0 0 0 0 R 55.9 0.0 2:24.85 arc_prune
8901 root 20 0 0 0 0 S 55.6 0.0 2:23.63 arc_prune
1593 root 20 0 0 0 0 S 54.9 0.0 4:01.49 arc_prune
17397 root 20 0 0 0 0 S 54.9 0.0 0:06.30 arc_prune
4254 root 20 0 0 0 0 R 54.6 0.0 3:14.35 arc_prune

Harrdy · Mar 10, 2018

Your not alone. I got the same issue after my upgrade from 4.x to 5.x on one of my hosts. The kworker process spawns from time to time. The problem occurs about every two weeks without start or stopping any lxc or do anything on the host system. After a Reset via IPMI the system is still running without any problems until the kworker process spawns again.

Code:

root@node003:~# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-3-pve: 4.13.13-34
pve-kernel-4.13.13-2-pve: 4.13.13-33
pve-kernel-4.13.13-1-pve: 4.13.13-31
pve-kernel-4.13.8-3-pve: 4.13.8-30
pve-kernel-4.13.8-1-pve: 4.13.8-27
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.79-1-pve: 4.4.79-95
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.62-1-pve: 4.4.62-88
pve-kernel-4.4.59-1-pve: 4.4.59-87
pve-kernel-4.4.49-1-pve: 4.4.49-86
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-3
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

Vasu Sreekumar · Mar 11, 2018

When i restart the node, everyhting comes back to normal.

But issue happens after few hours again.

Harrdy · Mar 28, 2018

I read this in another topic.

Vasu Sreekumar said:
I had same issue running kernel 4.13 with 25 live nodes, Kworker using 100% CPU.

Now I moved all 25 live nodes to 4.15 kernel, no issues so far.

And i tried it too. So far, it looks very good. Also after 10 days no kworker has been spawned. Although I have done many actions with the containers like start/stop/restart/dump & restore.

Vasu Sreekumar · Mar 28, 2018

kernel 4.15 solved the issue

Shankar · Apr 22, 2018

I've had the same issue for months, and then I came across this thread. I upgraded to the 4.15 kernel, still seeing issues. I'm hoping I made a mistake somewhere and kernel 4.15 did solve the issue, but I'm not sure what to fix.

I'll be happy to provide any troubleshooting info, but this is what I have right now -

Code:

root@T30:~# pveversion -v
proxmox-ve: 5.1-42 (running kernel: 4.15.15-1-pve)
pve-manager: 5.1-51 (running version: 5.1-51/96be5354)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
corosync: 2.4.2-pve4
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-18
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-25
pve-container: 2.0-21
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-2
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

Code:

root@T30:~# uname -a
Linux T30 4.15.15-1-pve #1 SMP PVE 4.15.15-6 (Mon, 9 Apr 2018 12:24:42 +0200) x86_64 GNU/Linux

Code:

top - 09:56:35 up 1 day,  9:37,  2 users,  load average: 2.30, 2.54, 2.54
Tasks: 575 total,   2 running, 445 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.1 us, 27.6 sy,  0.0 ni, 67.4 id,  1.7 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65834576 total, 17571348 free,  9703844 used, 38559384 buff/cache
KiB Swap:  7340028 total,  4993276 free,  2346752 used. 55229064 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
18550 root      20   0       0      0      0 R 100.0  0.0   1474:07 kworker/u8:4

Thank you in advance !

Jean-Pierre · Apr 22, 2018

I can confirm Kernel 4.15 solved this issue for me, we have had zero issues for a few weeks now. I also had no luck with paid Proxmox support when this was happening, I was actually surprised at how bad it was.

Jean-Pierre · Apr 22, 2018

Code:

root@T30:~# uname -a
Linux T30 4.15.15-1-pve #1 SMP PVE 4.15.15-6 (Mon, 9 Apr 2018 12:24:42 +0200) x86_64 GNU/Linux

Code:

top - 09:56:35 up 1 day,  9:37,  2 users,  load average: 2.30, 2.54, 2.54
Tasks: 575 total,   2 running, 445 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.1 us, 27.6 sy,  0.0 ni, 67.4 id,  1.7 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65834576 total, 17571348 free,  9703844 used, 38559384 buff/cache
KiB Swap:  7340028 total,  4993276 free,  2346752 used. 55229064 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
18550 root      20   0       0      0      0 R 100.0  0.0   1474:07 kworker/u8:4

Thank you in advance ![/QUOTE]

Hi Shankar

I do not have a solid answer for you, however I am running slightly older versions namely:
kernel: 4.15.10-1-pve
lxc-pve: 2.1.1-3
lxcfs: 2.0.8-2

Did you try the older kernel version I have or only the very new one you currently have as there was a source change after 4.15.10-4?

Lastly if you would like to send me a copy of pvereport privately I can see if I have a node with the same versions of what you have and test.

Shankar · Apr 23, 2018

I just updated the kernel to 4.15, based on instructions here. In short, ran this -

Code:

apt update
apt install pve-kernel-4.15

And rebooted. it did pick up the latest kernel, I think, based on your versions. If I want to match your versions for the kernel, lxc-pve, and lxcfs, how do I do it ?

Thank you for your time on this ! I'll send you the relevant portions of my pvereport, via PM.

lojasyst · Jun 28, 2018

Same problem here.

I upgraded to 4.15 kernel but nothing.

Ideas?

#pveversion
pve-manager/5.2-2/b1d1c7f4 (running kernel: 4.15.17-3-pve)

Jean-Pierre · Jul 1, 2018

Hi

Sorry about my delayed responses, by now version 5.2 should not have this issue. If you wish to try the exact same kernel I am still running, you should be able to run:

apt-get install pve-kernel-4.15.10-1-pve

when you reboot make sure it is booting off this kernel as it may not be your default.

if you run apt-cache search pve-kernel-4.15 this will list all 4.15 available kernels.

fireon · Aug 12, 2018

Problem exists also on version pve-manager/5.2-6/bcd5f008 (running kernel: 4.15.18-1-pve)

Jean-Pierre · Aug 13, 2018

I too can confirm the latest version of Promox 5.2 still has this issue, I have had at least two seprate nodes go down with the same kworker issue. I ca also confirm downgrading the kernel to pve-kernel-4.15.10-1-pve still fixes the issue. I will take this up with proxmox support again and report back.

Shankar · Aug 13, 2018

I last updated my kernel on Apr 22-ish, and have not had a problem since, there were at least 3-4 instances where I was almost sure the entire node would go down because of kworker issue, but never had the issue. For me, my current setup is pretty stable.

# pveversion
pve-manager/5.2-5/eb24855a (running kernel: 4.15.15-1-pve)

fireon · Aug 14, 2018

For information: https://pve.proxmox.com/pipermail/pve-devel/2018-August/033288.html fix will be available soon in testingrepo.

LXC restart creates kworker CPU 100%

Active Member

New Member

Renowned Member

New Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

Active Member

Active Member

New Member

Renowned Member

Active Member

Distinguished Member

Active Member

New Member

Distinguished Member

We value your privacy