spurious kernel messages since upgrade to 7.2

Bruno Félix

Member
Jan 24, 2017
5
0
21
52
Hi,

I upgraded our cluster to 7.2.4 (comming from last 7.1) lately,
the /var/log/messages was filling up with kernel crash reports :

Code:
May 19 17:00:05 pve01 kernel: [16498.218023]  <TASK>
May 19 17:00:05 pve01 kernel: [16498.266312]  kthread+0x127/0x150
May 19 17:00:05 pve01 kernel: [16498.271642]  ? set_kthread_struct+0x50/0x50
May 19 17:00:05 pve01 kernel: [16498.278014]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:05 pve01 kernel: [16498.304283]  kthread+0x127/0x150
May 19 17:00:06 pve01 kernel: [16499.239798]  process_one_work+0x228/0x3d0
May 19 17:00:06 pve01 kernel: [16499.337357]  <TASK>
May 19 17:00:06 pve01 kernel: [16499.341011]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:06 pve01 kernel: [16499.350852]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:07 pve01 kernel: [16500.325987] Call Trace:
May 19 17:00:07 pve01 kernel: [16500.329407]  kthread+0x127/0x150
May 19 17:00:07 pve01 kernel: [16500.330798]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:08 pve01 kernel: [16501.350924]  <TASK>
May 19 17:00:08 pve01 kernel: [16501.354426]  ? set_kthread_struct+0x50/0x50
May 19 17:00:08 pve01 kernel: [16501.354885]  <TASK>
May 19 17:00:09 pve01 kernel: [16502.376740]  worker_thread+0x53/0x410
May 19 17:00:10 pve01 kernel: [16503.337078]  ? set_kthread_struct+0x50/0x50
May 19 17:00:10 pve01 kernel: [16503.462301]  <TASK>
May 19 17:00:11 pve01 kernel: [16504.359709]  worker_thread+0x53/0x410
May 19 17:00:12 pve01 kernel: [16505.516017]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:13 pve01 kernel: [16506.471940]  <TASK>
May 19 17:00:13 pve01 kernel: [16506.477785]  ? process_one_work+0x3d0/0x3d0
May 19 17:00:13 pve01 kernel: [16506.480187]  ? set_kthread_struct+0x50/0x50
May 19 17:00:14 pve01 kernel: [16507.436723]  ? set_kthread_struct+0x50/0x50
May 19 17:00:14 pve01 kernel: [16507.437515]  </TASK>
May 19 17:00:15 pve01 kernel: [16508.588336]  ? set_kthread_struct+0x50/0x50
May 19 17:00:16 pve01 kernel: [16509.484090]  kthread+0x127/0x150
May 19 17:00:16 pve01 kernel: [16509.543859]  worker_thread+0x53/0x410
May 19 17:00:16 pve01 kernel: [16509.613053]  </TASK>
May 19 17:00:17 pve01 kernel: [16510.629877] Call Trace:
May 19 17:00:18 pve01 kernel: [16511.591705]  worker_thread+0x53/0x410
May 19 17:00:18 pve01 kernel: [16511.660203]  worker_thread+0x53/0x410
May 19 17:00:19 pve01 kernel: [16512.615358]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:19 pve01 kernel: [16512.678853]  ? throttle_active_work+0xe2/0x1f0
May 19 17:00:20 pve01 kernel: [16513.637848] Call Trace:
May 19 17:00:20 pve01 kernel: [16513.640556]  worker_thread+0x53/0x410
May 19 17:00:21 pve01 kernel: [16514.600594]  worker_thread+0x53/0x410
May 19 17:00:21 pve01 kernel: [16514.668270]  worker_thread+0x53/0x410
May 19 17:00:22 pve01 kernel: [16515.688179]  worker_thread+0x53/0x410

we use a ceph pool, zfs replication, cgroup-v1
the bug was present only on the computing hosts

Using kernel 5.13.19-6-pve instead of 5.15.35-1-pve solved the issue.
 
Had hopes with 5.15.39-1-pve...
It took a while longer, but then it happened again : messages log makes now 250Mo per day
going back to 5.13.19-6-pve again.

Any idea to identify the root cause ?
 
@Bruno Félix , if you are still seeing this, can you please verify if the nodes that are having this problem also have an Intel Omnipath HFI card installed? If not that, maybe some other fabric card?

We see this only on machines that have HFI cards.
 
Hi,
exactly the same problem, identical servers, those with BCM57412 NetXtreme-E 10Gb have problem, but those with intel cards seems ok.
in logs on all nodes problem starts with

Code:
Oct  5 14:11:45 kernel: [135939.389881] unchecked MSR access error: WRMSR to 0x19c (tried to write 0x0000000000002a80) at rIP: 0xffffffff99495074 (native_write_msr+0x4/0x30)
Oct  5 14:11:45 kernel: [135939.389895] Call Trace:
Oct  5 14:11:45 kernel: [135939.389896]  <TASK>
Oct  5 14:11:45 kernel: [135939.389897]  ? throttle_active_work+0xe2/0x1f0
Oct  5 14:11:45 kernel: [135939.389905]  process_one_work+0x228/0x3d0
Oct  5 14:11:45 kernel: [135939.389909]  worker_thread+0x53/0x420
Oct  5 14:11:45 kernel: [135939.389911]  ? process_one_work+0x3d0/0x3d0
Oct  5 14:11:45 kernel: [135939.389913]  kthread+0x127/0x150
Oct  5 14:11:45 kernel: [135939.389917]  ? set_kthread_struct+0x50/0x50
Oct  5 14:11:45 kernel: [135939.389920]  ret_from_fork+0x1f/0x30
Oct  5 14:11:45 kernel: [135939.389925]  </TASK>
 
Last edited:
Did some digging and found out that these are spurious logs for thermal prochot throttling.

Please take a look at the following link:
https://www.spinics.net/lists/kernel/msg4380894.html

I can verify that the following command:
# wrmsr -a 0x19c 0x0a80

indeed silences the spurious messages. Hopefully it doesn't affect system prochot behavior, and only affects logging. :-)

This is on Cascade Lake 8260 processors.

This is a temporary fix.. Presumably there will be an updated kernel/microcode "soon" that fixes this.
 
I solved this by installing the latest microcode, "apt install intel-microcode" which required enabling non-free in sources, which required a dependency which required I enable contrib in sources.

Previously from /proc/cpuinfo:
microcode : 0x2006a08

Now:
microcode : 0x2006d05
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!