rcu_sched self-detected stall on CPU

signalcodec

New Member
Apr 13, 2020
7
0
1
32
I'm getting "rcu_sched self-detected stall on CPU" errors in every VM at startup, and then if the VM's are under a heavy load, the proxmox host machine will also lockup and begin saying the same message.

I'm not sure even how to figure out my root cause.

Numerous google have found similar complaints but from much much older versions of the kernel .

help?
 
Hi,

what kernel do you use and which CPU?
Is this an NUMA machine?
 
5.3.10-1-pve, 2x Intel Xeon E5-2670
And I'm not sure if it's a NUMA, not sure how to check. It's a dell R720 if that helps.

Thank you for replying, I appreciate it.
 
Yes, you have a NUMA (Non-Uniform memory access).
All multi Soket Sytems are NUMA.

I do not find any bug related to your description in the log.
Try the following.
Ensure that all power savings are disabled in the BIOS.
Update to Kernel 5.4.
Code:
apt install pve-kernel-5.4
 
So I'm now on 5.4.27.1-pve, checked, but I've got no power savings enabled. Problem hasn't gone away :(

Is there anything I can do my end to help find a particular cause of this?
 
Last edited:
What do use as vCPU type?
 
Please try vCPU Type "host".
Or set it to the oldest model in your cluster if it is a cluster with different Host CPUs.
This is necessary on a mixed Cluster to allow live migration.
 
Sorry for the few day delay, it took me a few days to get a chance to change them.

So, I set them all to 'host' like you suggested but i'm still getting the same issue:
1588261705195.png

I also want to clarify I only have 1 host, so I don't have a cluster setup. Just 1 host with 2 CPUs.
 
Same issue here with 6.1-8 on pve-kernel-5.3.18-3-pve and pve-kernel-5.4.27-1-pve on a i7-3720 CPU.
Tried vCPU kvm64, qemu64 and host. All had the same lock up issue :-(
I do have a mixed cluster, with AMD and Intel CPU's, but this also happens when spinning up a new vm on the i7-3720 host.
 
Last edited:
Update: I reinstalled the machine ( this time not with root on ZFS ) and the issue seems to have disappeared.
 
This just started happening to me this week on servers that have never seen a problem. Today four of my VMs are all console-locked with the "rcu_sched self-detected stalls on..." . Important note however, they are all RUNNING just fine, I just no longer have console access. It kind of coincides with having upgraded my 2-server cluster to latest Proxmox... but that's a stretch.
So @wolfgang if you're looking for clues, this seems to be one.
 
  • Like
Reactions: jaychoudhary14
This is old thread but I run into this problem today too.

Choosing recovery mode from grub I saw controller problem. Switching from default (LSI 53C895A) to VirtIO SCSI solve the problem.

It was the last VM running with LSI 53C895A controller mode and upgrading kernel somehow matched with the problem.
 
Hi,
This is old thread but I run into this problem today too.

Choosing recovery mode from grub I saw controller problem. Switching from default (LSI 53C895A) to VirtIO SCSI solve the problem.

It was the last VM running with LSI 53C895A controller mode and upgrading kernel somehow matched with the problem.
glad you found a workaround! This is a regression in our QEMU 7.2 package with the LSI SCSI controller.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!