spurious kernel messages since upgrade to 7.2

Bruno Félix · May 24, 2022

Hi,

I upgraded our cluster to 7.2.4 (comming from last 7.1) lately,
the /var/log/messages was filling up with kernel crash reports :

Code:

May 19 17:00:05 pve01 kernel: [16498.218023]  <TASK>
May 19 17:00:05 pve01 kernel: [16498.266312]  kthread+0x127/0x150
May 19 17:00:05 pve01 kernel: [16498.271642]  ? set_kthread_struct+0x50/0x50
May 19 17:00:05 pve01 kernel: [16498.278014]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:05 pve01 kernel: [16498.304283]  kthread+0x127/0x150
May 19 17:00:06 pve01 kernel: [16499.239798]  process_one_work+0x228/0x3d0
May 19 17:00:06 pve01 kernel: [16499.337357]  <TASK>
May 19 17:00:06 pve01 kernel: [16499.341011]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:06 pve01 kernel: [16499.350852]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:07 pve01 kernel: [16500.325987] Call Trace:
May 19 17:00:07 pve01 kernel: [16500.329407]  kthread+0x127/0x150
May 19 17:00:07 pve01 kernel: [16500.330798]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:08 pve01 kernel: [16501.350924]  <TASK>
May 19 17:00:08 pve01 kernel: [16501.354426]  ? set_kthread_struct+0x50/0x50
May 19 17:00:08 pve01 kernel: [16501.354885]  <TASK>
May 19 17:00:09 pve01 kernel: [16502.376740]  worker_thread+0x53/0x410
May 19 17:00:10 pve01 kernel: [16503.337078]  ? set_kthread_struct+0x50/0x50
May 19 17:00:10 pve01 kernel: [16503.462301]  <TASK>
May 19 17:00:11 pve01 kernel: [16504.359709]  worker_thread+0x53/0x410
May 19 17:00:12 pve01 kernel: [16505.516017]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:13 pve01 kernel: [16506.471940]  <TASK>
May 19 17:00:13 pve01 kernel: [16506.477785]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:13 pve01 kernel: [16506.480187]  ? set_kthread_struct+0x50/0x50
May 19 17:00:14 pve01 kernel: [16507.436723]  ? set_kthread_struct+0x50/0x50
May 19 17:00:14 pve01 kernel: [16507.437515]  </TASK>
May 19 17:00:15 pve01 kernel: [16508.588336]  ? set_kthread_struct+0x50/0x50
May 19 17:00:16 pve01 kernel: [16509.484090]  kthread+0x127/0x150
May 19 17:00:16 pve01 kernel: [16509.543859]  worker_thread+0x53/0x410
May 19 17:00:16 pve01 kernel: [16509.613053]  </TASK>
May 19 17:00:17 pve01 kernel: [16510.629877] Call Trace:
May 19 17:00:18 pve01 kernel: [16511.591705]  worker_thread+0x53/0x410
May 19 17:00:18 pve01 kernel: [16511.660203]  worker_thread+0x53/0x410
May 19 17:00:19 pve01 kernel: [16512.615358]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:19 pve01 kernel: [16512.678853]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:20 pve01 kernel: [16513.637848] Call Trace:
May 19 17:00:20 pve01 kernel: [16513.640556]  worker_thread+0x53/0x410
May 19 17:00:21 pve01 kernel: [16514.600594]  worker_thread+0x53/0x410
May 19 17:00:21 pve01 kernel: [16514.668270]  worker_thread+0x53/0x410
May 19 17:00:22 pve01 kernel: [16515.688179]  worker_thread+0x53/0x410

we use a ceph pool, zfs replication, cgroup-v1
the bug was present only on the computing hosts

Using kernel 5.13.19-6-pve instead of 5.15.35-1-pve solved the issue.

Bruno Félix · Jul 6, 2022

issue is still present with kernel 5.15.35-2-pve

kyriazis · Aug 3, 2022

I have this happening, too, although it is more than one host. Any idea when it's going to get fixed?

Thanks!

Bruno Félix · Aug 9, 2022

Had hopes with 5.15.39-1-pve...
It took a while longer, but then it happened again : messages log makes now 250Mo per day
going back to 5.13.19-6-pve again.

Any idea to identify the root cause ?

kyriazis · Oct 3, 2022

Any updates on this? Still happening with 5.15.60-1-pve

Thank you!

kyriazis · Oct 5, 2022

@Bruno Félix , if you are still seeing this, can you please verify if the nodes that are having this problem also have an Intel Omnipath HFI card installed? If not that, maybe some other fabric card?

We see this only on machines that have HFI cards.

zima · Oct 6, 2022

Hi,
exactly the same problem, identical servers, those with BCM57412 NetXtreme-E 10Gb have problem, but those with intel cards seems ok.
in logs on all nodes problem starts with

Code:

Oct  5 14:11:45 kernel: [135939.389881] unchecked MSR access error: WRMSR to 0x19c (tried to write 0x0000000000002a80) at rIP: 0xffffffff99495074 (native_write_msr+0x4/0x30)
Oct  5 14:11:45 kernel: [135939.389895] Call Trace:
Oct  5 14:11:45 kernel: [135939.389896]  <TASK>
Oct  5 14:11:45 kernel: [135939.389897]  ? throttle_active_work+0xe2/0x1f0
Oct  5 14:11:45 kernel: [135939.389905]  process_one_work+0x228/0x3d0
Oct  5 14:11:45 kernel: [135939.389909]  worker_thread+0x53/0x420
Oct  5 14:11:45 kernel: [135939.389911]  ? process_one_work+0x3d0/0x3d0
Oct  5 14:11:45 kernel: [135939.389913]  kthread+0x127/0x150
Oct  5 14:11:45 kernel: [135939.389917]  ? set_kthread_struct+0x50/0x50
Oct  5 14:11:45 kernel: [135939.389920]  ret_from_fork+0x1f/0x30
Oct  5 14:11:45 kernel: [135939.389925]  </TASK>

kyriazis · Oct 6, 2022

Did some digging and found out that these are spurious logs for thermal prochot throttling.

Please take a look at the following link:
https://www.spinics.net/lists/kernel/msg4380894.html

I can verify that the following command:
# wrmsr -a 0x19c 0x0a80

indeed silences the spurious messages. Hopefully it doesn't affect system prochot behavior, and only affects logging.

This is on Cascade Lake 8260 processors.

This is a temporary fix.. Presumably there will be an updated kernel/microcode "soon" that fixes this.

jt-socal · Apr 8, 2023

I solved this by installing the latest microcode, "apt install intel-microcode" which required enabling non-free in sources, which required a dependency which required I enable contrib in sources.

Previously from /proc/cpuinfo:
microcode : 0x2006a08

Now:
microcode : 0x2006d05

spurious kernel messages since upgrade to 7.2

Bruno Félix

Member

Bruno Félix

Member

kyriazis

Well-Known Member

Bruno Félix

Member

kyriazis

Well-Known Member

kyriazis

Well-Known Member

zima

Renowned Member

kyriazis

Well-Known Member

jt-socal

Active Member

We value your privacy