rcu_sched self-detected stall on CPU

Hugues-ST

New Member
Aug 5, 2021
2
0
1
43
Hi,

I have a cluster of 5 servers with AMD and Intel, and so far they were rock stable. Whether it's CPU or memory I'm nowhere near the maximum. And for storage I use ZFS, so far so good.

I just had several VM crashes, with the message in the title, on my 2 new servers. Unlike my old servers I have a RAID card that I don't really use (I don't mount the RAID on it but from the OS with ZFS).

I tried to find out through the logs where the problem could have come from but I don't see anything on the servers in particular. And the monitoring didn't show anything special when the VMs crashed.

In doubt I updated the servers.

Any help ? :)
 
I have mitigated with the help of this thread :
https://bugzilla.kernel.org/show_bug.cgi?id=199727#c18

"fully gone after setting VirtIO SCSI Single / iothread=1 / aio=threads on all our KVM guests"

With this, it was less unstable.

And this thread :
https://forum.proxmox.com/threads/p...anic-on-linux-freeze-on-windows.109645/page-2

I rollback to pve-kernel-5.15.30-2-pve, after that i was "stable" again.

if I do massive migrations (several vm at the same time) the problem still occurs. So I have limited the bandwidth in case of migration, but while waiting for an official solution I don't do live migration anymore but migrations of switched off vm.