Hi
We have a consistent issue when rebooting LXC conatiners a kworker process eventually locks the system and we have to hard reset the server.
A line from `top` below for the process that spawns,
31043 root 20 0 0 0 0 R 100.0 0.0 94:20.90 kworker/u24:3
The spec of this server which is a standlone server:
Supermicro 1018R-WC0R with a X10SRW-F main board and a E5-1650 v4 CPU.
Boots off 2x SSD internal in Raid1 and storage is a internal Raid10 array of 6 1TB SSD drives all software/mdadm raid.
There is NFS storage attatched for backup images.
The server was fully updated and rebooted about 9 days ago. The server was initialy installed with Proxmox 5.1 in November 2017. I do see there is another kernel available, however rebooting this production server is not a simple process.
pveversion -v
proxmox-ve: 5.1-35 (running kernel: 4.13.13-4-pve)
pve-manager: 5.1-42 (running version: 5.1-42/724a6cb3)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-19
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
We noticed this issue around 12 December 2017 and confirmed that this happens 99.9% of the time any LXC conatiner is restarted/rebooted.
The kworker process will appaear (@100% CPU) and slowly after a few hours start grinding the server to an almost stand still. I cannot kill this process or even get the server to reboot/halt gracefully a hard reset is required.
This is for any LXC container on the server even a brand new one.
I did notice that 9 days ago after the last update and reboot immediatly after I could restart LXC conatiners, however 10 hours later the next container restart caused the same issue.
The below posts mention what could be the same issue but do not seem to be addressed at all,
https://forum.proxmox.com/threads/proxmox-ve-5-1-released.37650/page-3#post-187137
https://forum.proxmox.com/threads/kworker-100-cpu.37795/
I should also mentioned we have a few servers with similar hardware running Proxmox 4.4 and do not have this issue.
On a side note. I have a feeling it might have something to do with ACPI and friends and LXC maybe even triggering the old kworker bug somwhow.
bugs.launchpad.net/ubuntu/+source/linux/+bug/887793
Any help would be apreciated.
We have a consistent issue when rebooting LXC conatiners a kworker process eventually locks the system and we have to hard reset the server.
A line from `top` below for the process that spawns,
31043 root 20 0 0 0 0 R 100.0 0.0 94:20.90 kworker/u24:3
The spec of this server which is a standlone server:
Supermicro 1018R-WC0R with a X10SRW-F main board and a E5-1650 v4 CPU.
Boots off 2x SSD internal in Raid1 and storage is a internal Raid10 array of 6 1TB SSD drives all software/mdadm raid.
There is NFS storage attatched for backup images.
The server was fully updated and rebooted about 9 days ago. The server was initialy installed with Proxmox 5.1 in November 2017. I do see there is another kernel available, however rebooting this production server is not a simple process.
pveversion -v
proxmox-ve: 5.1-35 (running kernel: 4.13.13-4-pve)
pve-manager: 5.1-42 (running version: 5.1-42/724a6cb3)
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.13.13-4-pve: 4.13.13-35
pve-kernel-4.13.13-1-pve: 4.13.13-31
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-19
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-16
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9
We noticed this issue around 12 December 2017 and confirmed that this happens 99.9% of the time any LXC conatiner is restarted/rebooted.
The kworker process will appaear (@100% CPU) and slowly after a few hours start grinding the server to an almost stand still. I cannot kill this process or even get the server to reboot/halt gracefully a hard reset is required.
This is for any LXC container on the server even a brand new one.
I did notice that 9 days ago after the last update and reboot immediatly after I could restart LXC conatiners, however 10 hours later the next container restart caused the same issue.
The below posts mention what could be the same issue but do not seem to be addressed at all,
https://forum.proxmox.com/threads/proxmox-ve-5-1-released.37650/page-3#post-187137
https://forum.proxmox.com/threads/kworker-100-cpu.37795/
I should also mentioned we have a few servers with similar hardware running Proxmox 4.4 and do not have this issue.
On a side note. I have a feeling it might have something to do with ACPI and friends and LXC maybe even triggering the old kworker bug somwhow.
bugs.launchpad.net/ubuntu/+source/linux/+bug/887793
Any help would be apreciated.