CPU Lockup on 5.4 Kernel

NessageHostsINC · Apr 29, 2020

Dear,

We are tragically encountering an issue with the new Proxmox 5.4 kernel.

After a few hours of running a KVM VM, the hypervisor locks up and freezes entirely. Sometimes we can still access the Proxmox Web GUI but not SSH, however after even longer that also stops working.

There is some information in syslog. https://pastebin.com/raw/J2Mn28QT

We are using ZFS and AMD Ryzen 3900X with 128GB non eec memory.

Code:

# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.4.30-1-pve)

I hope the information we have provided can help diagnose the problem and/or fix a potential issue with the kernel. I am unaware of what the issue is with.

Kindly

morph027 · Apr 30, 2020

Does it work with the previous kernel?

Probably related to C-States: https://forum.proxmox.com/threads/proxmox-ve-5-0-ryzen-7-1700x-crashes-daily.36123/

Got a EPYC 7351P w/ 128GB Memory running the kernel since 22 days without any issues.

NessageHostsINC · May 1, 2020

Dear,

I have disabled C-States within the BIOS, though despite the efforts the issue is persistent.

Here is IPMI screenshot:

Code:

Last login: Thu Apr 30 22:12:00 BST 2020 on pts/1
Linux host-01 5.4.30-1-pve #1 SMP PVE 5.4.30-1 (Fri, 10 Apr 2020 09:12:42 +0200) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.


Message from syslogd@host-01 at May  1 00:42:14 ...
kernel:[38500.250297] watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [kworker/3:0:22370]

Message from syslogd@host-01 at May  1 00:42:14 ...
kernel:[38500.270297] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kvm:32646]

Message from syslogd@host-01 at May  1 00:42:18 ...
kernel:[38504.274311] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/14:1:29989]

Message from syslogd@host-01 at May  1 00:42:18 ...
kernel:[38504.282310] watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [sshd:18760]

Message from syslogd@host-01 at May  1 00:42:26 ...
kernel:[38512.258337] watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [pvesr:18961]
root@host-01:~#
root@host-01:~#
Message from syslogd@host-01 at May  1 00:42:42 ...
kernel:[38528.250391] watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [kworker/3:0:22370]

Message from syslogd@host-01 at May  1 00:42:42 ...
kernel:[38528.270391] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kvm:32646]

Message from syslogd@host-01 at May  1 00:42:46 ...
kernel:[38532.274405] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [kworker/14:1:29989]

Message from syslogd@host-01 at May  1 00:42:46 ...
kernel:[38532.282404] watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [sshd:18760]

Here is syslog. https://pastebin.com/raw/CBjWLDi6

I am struggling to find a solution to the issue, unfortunately. This also happened with the previous kernel.

EDIT: Load average super-high but only one tiny VM is running.

:

Yours

NessageHostsINC · May 1, 2020

Dear All,

There is an update on the situation. I am using ZFS for this Proxmox virtualisation machine server and kworker process is reaching 100% which is why load average is so high.

Code:

top - 02:52:51 up 12:52,  3 users,  load average: 146.02, 145.70, 143.44
Tasks: 588 total,  25 running, 561 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0.6 us, 25.2 sy,  0.0 ni, 74.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 128824.5 total,  57761.5 free,  69960.2 used,   1102.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  57614.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                       
29989 root      20   0       0      0      0 R 100.0   0.0 260:00.80 kworker/14:1+events           
32646 root      20   0 4730776 510284   6808 R 106.7   0.4 273:49.19 kvm                           
18760 sshd      20   0   15848   2532   1680 R 100.0   0.0 274:01.23 sshd                         
18961 root      20   0  275536  83004  14640 R 100.0   0.1 273:51.58 pvesr                         
22370 root      20   0       0      0      0 R 100.0   0.0 198:28.71 kworker/3:0+rcu_par_gp       
    1 root      20   0  171436   8736   5368 D   0.0   0.0   0:04.81 systemd                       
    2 root      20   0       0      0      0 S   0.0   0.0   0:53.03 kthreadd                     
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                       
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                   
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd         
    9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                 
   10 root      20   0       0      0      0 S   0.0   0.0   0:00.11 ksoftirqd/0                   
   11 root      20   0       0      0      0 R   0.0   0.0  28:50.22 rcu_sched                     
   12 root      rt   0       0      0      0 S   0.0   0.0   0:00.04 migration/0                   
   13 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0                 
   14 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                       
   15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                       
   16 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/1                 
   17 root      rt   0       0      0      0 S   0.0   0.0   0:00.16 migration/1

Does anyone have any suggestions?

Thanks kindly

djsami · May 1, 2020

NessageHostsINC said:

Dear All,

There is an update on the situation. I am using ZFS for this Proxmox virtualisation machine server and kworker process is reaching 100% which is why load average is so high.

Code:

top - 02:52:51 up 12:52,  3 users,  load average: 146.02, 145.70, 143.44
Tasks: 588 total,  25 running, 561 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0.6 us, 25.2 sy,  0.0 ni, 74.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 128824.5 total,  57761.5 free,  69960.2 used,   1102.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  57614.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                     
29989 root      20   0       0      0      0 R 100.0   0.0 260:00.80 kworker/14:1+events         
32646 root      20   0 4730776 510284   6808 R 106.7   0.4 273:49.19 kvm                         
18760 sshd      20   0   15848   2532   1680 R 100.0   0.0 274:01.23 sshd                       
18961 root      20   0  275536  83004  14640 R 100.0   0.1 273:51.58 pvesr                       
22370 root      20   0       0      0      0 R 100.0   0.0 198:28.71 kworker/3:0+rcu_par_gp     
    1 root      20   0  171436   8736   5368 D   0.0   0.0   0:04.81 systemd                     
    2 root      20   0       0      0      0 S   0.0   0.0   0:53.03 kthreadd                   
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                     
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                 
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd       
    9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq               
   10 root      20   0       0      0      0 S   0.0   0.0   0:00.11 ksoftirqd/0                 
   11 root      20   0       0      0      0 R   0.0   0.0  28:50.22 rcu_sched                   
   12 root      rt   0       0      0      0 S   0.0   0.0   0:00.04 migration/0                 
   13 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0               
   14 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                     
   15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                     
   16 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/1               
   17 root      rt   0       0      0      0 S   0.0   0.0   0:00.16 migration/1

Does anyone have any suggestions?

Thanks kindly

Hello

ZFS Promlem cpu

xfs use the

30 vmm working

NessageHostsINC · May 1, 2020

djsami said:
Hello

ZFS Promlem cpu

xfs use the

View attachment 16803

30 vmm working

View attachment 16804

Dear,

Thank you for letting me know that ZFS is incompatible here.

So I must change from ZFS to XFS? In the cluster, the other nodes are running on ZFS. Is it possible to migrate it to XFS when migrating to this server?

EDIT: I cannot find XFS in Proxmox storage datacenter button.

Kindly

djsami · May 1, 2020

NessageHostsINC said:
Dear,

Thank you for letting me know that ZFS is incompatible here.

So I must change from ZFS to XFS? In the cluster, the other nodes are running on ZFS. Is it possible to migrate it to XFS when migrating to this server?

EDIT: I cannot find XFS in Proxmox storage datacenter button.

Kindly

vmm Image backup and reinstallation

EDIT: I cannot find XFS in Proxmox storage datacenter button.
Proxmox XFS install format

xfs storage 1
xfs storage 2

as vmware

how many vm are on the machine?
contact me if you want, I will help you

XFS install

NessageHostsINC · May 2, 2020

Dear,

Thank you for the time for providing the detail. Only 1 VM is running on the server currently but I want to migrate 8 to this server.

Do you think the ZFS issue crashing kworker is due to kernel?

Kindly

morph027 · May 2, 2020

I don't think so, my servers are all running on ZFS with multiple VM's on this kernel.

Is your ZFS running on hard disks or SSDs?

NessageHostsINC · May 2, 2020

Dear,

We have 2 Samsung Enterprise SSDs with ZFS Raid 1 from PVE installer. https://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZ7LM960HMJP/

Kindly

tom · May 2, 2020

NessageHostsINC said:
Do you think the ZFS issue crashing kworker is due to kernel?

Difficult to say. You run desktop hardware, AMD mainboards got recently issues with incompatible ram, Bios bugs, etc.

Could be a ZFS issue and kernel (but I doubt that it is the kernel)

NessageHostsINC · May 2, 2020

Dear,

Are you aware whether ASROCKRACK version is designed for server or same as desktop motherboards? https://www.asrockrack.com/general/productdetail.asp?Model=1U4LW-X470#Specifications

Thank you very much for the assistance everyone also!

I am running the latest BIOS but the issue is persistent. What I don't understand is why after 6-12 hours of running perfectly fine that load average is becoming 150 and kworker is using 100% of CPU, and syslog with those kernel errors which I do not understand how to read.

Kindly

onepamopa · May 2, 2020

tom said:
ay. You run desktop hardware, AMD mainboards got recently issues with inco

I'm also running a desktop hardware (TR 3960x, ASUS TRX40-PRO w.latest bios), will test later if 5.4 kernel has any issues on my hw.
Actually something weird happened today after I updated the bios to last version - the nvme drive disappeared for some reason, few reboots later it appeared & works ok (so far).

Dark26 · May 3, 2020

So far so good.

Run fine on my 3 servers celeron N3450 on cluster with ceph

Run fine on my 2 workstation; AMD FX6300, and xeon E1246 v3

Juste install on my 2 HP N54L, use for storage only, we will see if they boot the next time; but i sure they will .

Yes, proxmox everywhere.

NessageHostsINC · May 5, 2020

Dear all,

I found out kworker is a kernel processing thread which backs up my idea that the issue present is with AMD Ryzen using EEC memory with ZFS on Proxmox kernel.

This is syslog with kernel errors, but I cannot understand how to fix or program kernels. https://pastebin.com/raw/CBjWLDi6

This is very confusing time for me. I am aware that on another AMD Ryzen 3700X system BUT with 64GB NON EEC RAM it is functioning all good.

So I think the EEC is the issue with support in the kernel.

Thanks

onepamopa said:
I'm also running a desktop hardware (TR 3960x, ASUS TRX40-PRO w.latest bios), will test later if 5.4 kernel has any issues on my hw.
Actually something weird happened today after I updated the bios to last version - the nvme drive disappeared for some reason, few reboots later it appeared & works ok (so far).

I am excited about your findings.

Dark26 said:
So far so good.

Run fine on my 3 servers celeron N3450 on cluster with ceph

Run fine on my 2 workstation; AMD FX6300, and xeon E1246 v3

Juste install on my 2 HP N54L, use for storage only, we will see if they boot the next time; but i sure they will .

Yes, proxmox everywhere.

Are using ZFS?

Dark26 · May 5, 2020

I am not using zfs, not enough memory for that on my N3450 ( only 8 giga de ram each )

CRCinAU · May 6, 2020

I noticed that today with kernel 5.4 - everything hung overnight at some point... Sadly, nothing got logged to /var/log/messages or /var/log/kern.log.

I don't believe its the hardware platform, as this machine has run Fedora + Xen for months of uptime... I only recently moved everything to proxmox.

CPU model name : AMD Ryzen 7 1700X Eight-Core Processor

No ECC RAM, just plain DDR4 2400Mhz RAM - 32Gb worth...

I don't run any ZFS, just plain mdadm raid and LVM...

To add another data point, I've rebooted back into kernel 5.3.18-3-pve - and we'll see if it dies from there...

onepamopa · May 6, 2020

Well, I didn't reach the kernel testing stage... After today's PVE update one of my VM's stopped working (stuck on uefi shell). I"ll open a new topic about this and link it here -- https://forum.proxmox.com/threads/problem-vm-stuck-on-uefi-shell-after-todays-pve-update.69393/

onepamopa · May 7, 2020

So, after experiencing complete machine freeze several times on the 5.3 kernels, I've installed the 5.4 kernel hoping this gets resolved..

Hardware is:
ASUS TRX40-PRO (latest bios)
Threadripper 3960x
8x8G ddr4 3200MHz (no ECC)
320G hdd (pve installed on it) -- NO zfs
1T XPG GAMMIX S5 nvme (VM disk storage) -- NO zfs
1x nVidia GTX 1070 (allocated to Windows VM)
1x nVidia GTX 1050ti (allocated to FreeBSD VM)

Until now, when the system freezes up - I don't see anything in the logs at all.

Let's see if the freezes persist on the 5.4 kernel.

CRCinAU · May 7, 2020

I rebooted back into 5.3.18-3 after the updates of today to test to see if I get the cpu soft lock issue I saw with 5.4.17.

At this point, uptime is: 14h 32m 5s...

This is more than I had before 5.4.17 died. I haven't tried 5.4.30(?) that was pushed with the update yet - as its going to take time to gather more data...

CPU Lockup on 5.4 Kernel

Member

Renowned Member

Member

Member

Renowned Member

Member

Renowned Member

Member

Renowned Member

Member

Proxmox Staff Member

Member

Well-Known Member

Renowned Member

Member

Renowned Member

Renowned Member

Well-Known Member

Well-Known Member

Renowned Member

We value your privacy