Proxmox 8.1 - kernel 6.5.11-4 - rcu_sched stall CPU

We had this problem on hosts with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz.
kernel 6.5.11-5 fixed it.
Thanks for your (and others) feedback!
when will it hit the enterprise repository?
If nothing comes up, it should get moved to the enterprise repositories still this week (but not today, that'd be rushing it, even if it's a very targeted small fix).
 
We've also just applied 6.5.11-5 to various Intel-based hypervisors (differing Xeon models) and it looks to have resolved the issue, thanks
 
when will it hit the enterprise repository?
FYI: Thanks to the feedback here, and the minimal regression potential between the previous version, we moved that kernel to the enterprise repository today.
 
If a live VM is migrated from the affected node (6.5.11-4) to the repaired node (6.5.11-6), will the VM survive? Or for a successful live migration VM both nodes must have 6.5.11-6?

Thanks.
 
Hi,
If a live VM is migrated from the affected node (6.5.11-4) to the repaired node (6.5.11-6), will the VM survive? Or for a successful live migration VM both nodes must have 6.5.11-6?

Thanks.
the issue is specific to the target of the migration, so migrating to a target with the fixed kernel is fine.
 
  • Like
Reactions: Jackobli
FYI, a preliminary fix for the issue was applied in git and will be included in the next kernel build. There is no package available yet, but should be soon if no issues pop up during internal testing.

EDIT: build is currently available on the pvetest repository. You can temporarily enable it (e.g. via the Repositories window in the UI, select your node, it's a sub-entry of Updates), run apt update, pull in the updated kernel with apt install proxmox-kernel-6.5, disable the repository and run apt update again.

EDIT2: To be specific, the first version with the fix is 6.5.11-5. The package is also availabel on the no-subscription repository since a while and likely not too long until it'll be available on the enterprise repository too.


If I didn't miss anything, nobody with AMD CPUs reported the issue yet and the fix from upstream also talks about Intel CPUs. So you should be fine, but it also shouldn't be too long until a fixed kernel is available.
I am running an AMD Ryzen 7 2700 and I get RCU errors and crashes on 8.0.rcu-error.png
 
Hi,
I am running an AMD Ryzen 7 2700 and I get RCU errors and crashes on 8.0.View attachment 59589
what kernel version? Note that this thread is about issues within the guest after migration. If this is on the host, please open a new thread and make sure that have latest BIOS updates and amd64-microcode package installed. In both cases, please provide more details, VM or host configuration, installed package versions, i.e. pveversion -v, workload, etc.
 
Hi,
we are running proxmox (pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.2.16-19-pve)) on an AMD Ryzen 9 5950x CPU with the ASRock Rack X470d4u Mainboard at the latest available BIOS version. Since last week we are experiencing the same RCU errors as @glorwinger .
We also tried different kerneles such as 6.2.17-19-pve and 6.2.16.20-pve but without luck. Now our server is not booting at all. It always hangs on boot with RCU errors. Could it help to reinstall proxmox from scratch?

Best regards Florian
 
Hi,
we are running proxmox (pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.2.16-19-pve)) on an AMD Ryzen 9 5950x CPU with the ASRock Rack X470d4u Mainboard at the latest available BIOS version. Since last week we are experiencing the same RCU errors as @glorwinger .
We also tried different kerneles such as 6.2.17-19-pve and 6.2.16.20-pve but without luck. Now our server is not booting at all. It always hangs on boot with RCU errors. Could it help to reinstall proxmox from scratch?

Best regards Florian
it's should be fixed in 6.5.x kernel. (Seem that your are still using old 6.2.x kernel)
 
Thanks for the heads up. Somehow I missed that. But then again : as my current proxmox installation is not booting anymore how do I get the newest kernel installed?
 
Hi,
We also tried different kerneles such as 6.2.17-19-pve and 6.2.16.20-pve but without luck. Now our server is not booting at all. It always hangs on boot with RCU errors. Could it help to reinstall proxmox from scratch?
what exactly is the error you get? Please try the latest 6.5 kernel, which should've been pulled in automatically by the upgrade to Proxmox VE 8.1. If it wasn't, make sure you have the proxmox-default-kernel package installed.
Thanks for the heads up. Somehow I missed that. But then again : as my current proxmox installation is not booting anymore how do I get the newest kernel installed?
You can use a live-CD and chroot. You might also be able to get the full boot log for a failed attempt like that.

it's should be fixed in 6.5.x kernel. (Seem that your are still using old 6.2.x kernel)
That is the issue in guests upon live migration. It very much sounds like @glorwinger and @Florian Westphal are experiencing issues on the host.
 
Hi fiona,

yes our proxmox server itself does not boot anymore. I managed to capture the output of the last failed boot attempts (see attached files).

And I have to correct myself again: it seems we already have kernel 6.5.11-7-pve installed. I tried that too but the server did not boot but instead hangs with the errors you can see in the attached pictures.

I am starting to wonder if this is not a software but a hardware issue. I will try to boot some other OS from a live medium and see if they manage to boot. If other OS can not boot either it should indicate a hardware problem right?

vlcsnap-2023-12-15-09h22m35s401_cropped.jpg
 

Attachments

  • vlcsnap-2023-12-15-09h21m36s199_cropped.jpg
    vlcsnap-2023-12-15-09h21m36s199_cropped.jpg
    229 KB · Views: 5
Last edited:
I am starting to wonder if this is not a software but a hardware issue. I will try to boot some other OS from a live medium and see if they manage to boot. If other OS can not boot either it should indicate a hardware problem right?
If you do succeed to boot with a live CD, you could try to chroot into the system and install the latest microcode package: apt install amd64-microcode for which you'll need to enable the non-free-firmware component of the Debian repositories: https://pve.proxmox.com/wiki/Package_Repositories#sysadmin_debian_firmware_repo
 
Thanks but sadly neither a proxmox installer nor an arch linux stick could boot of from the server. While working on the machine I saw that the mainboard debug led was showing some error codes. I will check that. But for the time beeing I think this is not a promox problem but rather a CPU or mainboard issue. Now I just have to figure out which one is the culprit...

Thanks for you help.
 
some problem here on older Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
as bonus, filesystems corrupted, probably because i was migration guests wtih 128GB RAM
 
I can add that I first swapped the mainboard but the problems persisted. Then I installed a different AM4 CPU and now the Server boots. So in my case it probably was a faulty CPU.
 
Hi,
some problem here on older Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
as bonus, filesystems corrupted, probably because i was migration guests wtih 128GB RAM
did you face the issue on the host or in the guest? What version are you running pveversion -v? What is the VM configuration qm config <ID>?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!