rcu_sched self-detected stall on CPU

Hugues-ST

New Member
Aug 5, 2021
2
0
1
41
Hi,

I have a cluster of 5 servers with AMD and Intel, and so far they were rock stable. Whether it's CPU or memory I'm nowhere near the maximum. And for storage I use ZFS, so far so good.

I just had several VM crashes, with the message in the title, on my 2 new servers. Unlike my old servers I have a RAID card that I don't really use (I don't mount the RAID on it but from the OS with ZFS).

I tried to find out through the logs where the problem could have come from but I don't see anything on the servers in particular. And the monitoring didn't show anything special when the VMs crashed.

In doubt I updated the servers.

Any help ? :)
 
I have mitigated with the help of this thread :
https://bugzilla.kernel.org/show_bug.cgi?id=199727#c18

"fully gone after setting VirtIO SCSI Single / iothread=1 / aio=threads on all our KVM guests"

With this, it was less unstable.

And this thread :
https://forum.proxmox.com/threads/p...anic-on-linux-freeze-on-windows.109645/page-2

I rollback to pve-kernel-5.15.30-2-pve, after that i was "stable" again.

if I do massive migrations (several vm at the same time) the problem still occurs. So I have limited the bandwidth in case of migration, but while waiting for an official solution I don't do live migration anymore but migrations of switched off vm.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!