CPU Lockup on 5.4 Kernel

So reporting back with more data.... 5.3.18-3 is currently at 1d 10h 42m with no issue... Going to reboot into 5.4.30 and test that for a while...
 
Dear all,

With no VM running, I am achieving no issues with uptime. 12:42:39 up 3 days

The issue is only present for me when running a VM in Proxmox.

This is on 5.4 kernel.

Kindly
 
Well, I lasted about 2 days and whoa - machine froze again, the first GPU shut off (windows vm), the 2nd (kali) was still displaying, but totally unresponsive. Total network connection lost towards the host.

And of course - not a single thing in the logs, nada, zip...

Any way of forcing logging when the machine freezes, so at least I'd know what caused it..?
 
Last edited:
Dear all,

I have upgraded to the latest Proxmox 6.2-4 with kernel Linux 5.4.34-1-pve #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200).

Right now it is looking promising and I will let you all know if the issue is persistent. Uptime is 3 hours with VMs running which is good.

Yours
 
So far, after replugging the ram modules, no freeze (yet):
Linux proxmox 5.4.30-1-pve #1 SMP PVE 5.4.30-1 (Fri, 10 Apr 2020 09:12:42 +0200) x86_64 GNU/Linux
22:21:47 up 3 days, 10:28, 1 user, load average: 3.87, 4.20, 4.28
 
Is the network card in your system a realtek r8169 based one?
No, is Intel.

Code:
23:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
24:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
 
Annoyingly, I got another lock up today when on kernel:
Code:
Linux proxmox 5.4.41-1-pve #1 SMP PVE 5.4.41-1 (Fri, 15 May 2020 15:06:08 +0200) x86_64

Text extracted from the photo:
Code:
Welcome to the Proxmox Virtual Environment. Please use your web browser to configure this server - connect to:

https://10.1.1.1:8006/

proxmox login: [ 474981.752276 ] INFO: task btrfs-transacti:20126 blocked for more than 120 seconds.

[ 474981.752293 ] OE Tainted: P 5.4.41-1-pve #1 474981752299 ] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 648490.718698 ] watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [pvesr:12782]
[ 648518.718124 ] watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [pvesr:12782]
[ 646546.717556 ] watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [pvesr:12782]
[ 648574.716982 ] watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [pvesr:12782]
[ 648602.716414 ] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [pvesr:12782]
[ 648630.715840 ] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [pvesr:12782]

EDIT: So I'm trying to debug this further - as I know Fedora and CentOS haven't failed me - but I'm looking at it from a hardware perspective first...

I pulled down 'zenstates.py' and inspected my setup - and I noticed the output as:
Code:
root@proxmox:~# zenstates.py -l 
P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled 
P4 - Disabled 
P5 - Disabled 
P6 - Disabled 
P7 - Disabled 
C6 State - Package - Enabled
C6 State - Core - Enabled

All good - but I'm pretty sure "C6 State - Package" is what the PSU workaround disables for non-zero amp power supplies on the 12v rail.

I did a factory reset of the BIOS, then went in and set "Power Supply Idle Control" to "Typical Idle Current" - and the output in zenstates changed. I reapplied an overclock I hadn't used for ages (its low usage that kills things, not high usages!), and now I get:
Code:
root@proxmox:~# zenstates.py -l 
P0 - Enabled - FID = 98 - DID = 8 - VID = 20 - Ratio = 38.00 - vCore = 1.35000 
P1 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000 
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000 
P3 - Disabled 
P4 - Disabled 
P5 - Disabled 
P6 - Disabled 
P7 - Disabled 
C6 State - Package - Disabled 
C6 State - Core - Enabled

This is what I'd expect to see - so I would assume that something changed in the BIOS that is as I'd expect now.

I'm back to leaving things go for a while now and see what happens. Unless I get further info, I'm going to assume hardware right now.
 
Last edited:
  • Like
Reactions: NessageHostsINC
I actually got rid of the freezes/lockups I was experiencing, right now 14 days uptime @ kernel 5.4.34-1-pve, 4 VMs running (and by 4 VMs I mean 1 Windows vm that I run as a "daily driver", including gaming, and 1 linux VM for pentests, and 2 linux VMs (ubuntu) - web servers & etc.
- Disabled KSM
- Set kernel.hung_task_timeout_secs = 30

I still see an occasional nvme0 queue timeout, but no freezes (yet). I'm almost "afraid" to update to the latest kernel :)
 
Last edited:
I just thought that I'd give some feedback since my last post.

Code:
# uptime
 02:43:41 up 7 days,  9:29,  2 users,  load average: 3.86, 4.37, 11.04

This is a good thing.

A question for the Proxmox folk if any happen to be watching - I noticed kernel.org is at 5.4.44 - but the latest proxmox kernel is 5.4.41.

What is your timeline for updates? In my kernels I build for Xen, I monitor kernel.org and rebuild new packages automatically within 6 hours of release and normally have packages for the kernel in the mirrors within an hour for both CentOS 6, 7 and 8.

What type of integration is there for proxmox?
 
Still good news:
Code:
# uptime
 05:17:47 up 16 days, 12:04,  4 users,  load average: 0.25, 0.24, 0.48
# cat /proc/version 
Linux version 5.4.41-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.41-1 (Fri, 15 May 2020 15:06:08 +0200)
 
I just thought that I'd give some feedback since my last post.

Code:
# uptime
02:43:41 up 7 days,  9:29,  2 users,  load average: 3.86, 4.37, 11.04

This is a good thing.

A question for the Proxmox folk if any happen to be watching - I noticed kernel.org is at 5.4.44 - but the latest proxmox kernel is 5.4.41.

What is your timeline for updates? In my kernels I build for Xen, I monitor kernel.org and rebuild new packages automatically within 6 hours of release and normally have packages for the kernel in the mirrors within an hour for both CentOS 6, 7 and 8.

What type of integration is there for proxmox?

our kernels are not directly based on kernel.org, but on Ubuntu kernel series (usually the latest LTS, with some patches / config changes on top). we monitor both kernel.org and Ubuntu upstreams, but don't automatically build upon each upstream release.
 
  • Like
Reactions: CRCinAU