Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository

I am writing this to express my deep frustration regarding the pve-qemu-kvm 10.2.1-1 update. I manage a massive infrastructure of over 1,200 nodes, and this untested update has caused significant distress across my entire operation.

I have spent the last two nights without sleep, monitoring spikes in IO Delay, CPU, and RAM usage that appeared immediately after this update. It is honestly disappointing to see a core package released to the Trixie repository with such a glaring regression that impacts real-world performance, not just "graphs."

After extensive testing and stress, I confirmed that downgrading to 10.1.2-7 resolved the issue on my clusters. In an enterprise-grade environment of this scale, we rely on the stability of these updates. Having to manually intervene across such a large fleet due to an avoidable bug is unacceptable.

I hope this serves as a wake-up call for more rigorous QA before pushing updates that handle core hypervisor functions. I am still recovering from the stress and lack of sleep this has caused.

Looking forward to a stable, properly tested fix soon.

Sorry, I'm out. All of the above is on you. Who told you to implement the "test" repository in "an enterprise-grade environment of this scale".

It is not published in either pve-no-subscription or pve-enterprise repositories.

I don't dispute you experienced an issue - however remember one workaround was clearly documented by @uzumo (thanks that user!) and further diagnostic suggestions were made by @fiona
I mean, the results of which for MY environment were supplied - subsequent to which @fiona had confirmed replicating an issue. Its being looked at.

You might - just might have an argument had this reached a production repo but its a test repo. To be clear, I had my own issues with a bug report recently but you cannot blame a TEST repo for your production issues.

Happy Proxmox in future.
 
I am writing this to express my deep frustration regarding the pve-qemu-kvm 10.2.1-1 update. I manage a massive infrastructure of over 1,200 nodes, and this untested update has caused significant distress across my entire operation.

I have spent the last two nights without sleep, monitoring spikes in IO Delay, CPU, and RAM usage that appeared immediately after this update. It is honestly disappointing to see a core package released to the Trixie repository with such a glaring regression that impacts real-world performance, not just "graphs."

After extensive testing and stress, I confirmed that downgrading to 10.1.2-7 resolved the issue on my clusters. In an enterprise-grade environment of this scale, we rely on the stability of these updates. Having to manually intervene across such a large fleet due to an avoidable bug is unacceptable.

I hope this serves as a wake-up call for more rigorous QA before pushing updates that handle core hypervisor functions. I am still recovering from the stress and lack of sleep this has caused.

Looking forward to a stable, properly tested fix soon.

1000011798.gif
 
Üzgünüm, ben yokum. Yukarıdakilerin hepsi sizin sorumluluğunuzda. "Bu ölçekte kurumsal düzeyde bir ortamda" "test" deposunu uygulamanızı size kim söyledi?

Bu, ne pve-no-subscription ne de pve-enterprise depolarında yayınlanmamıştır.

Yaşadığınız sorunu inkar etmiyorum - ancak @uzumo tarafından açıkça belgelenmiş bir geçici çözüm olduğunu (teşekkürler o kullanıcı!) ve @fiona tarafından daha fazla teşhis önerisi yapıldığını hatırlayın.
Yani, benim ortamım için sonuçlar sağlandı - ardından @fiona sorunun tekrarlandığını doğruladı. İnceleniyor.

Bu durum üretim deposuna ulaşsaydı belki bir tartışma çıkabilirdi, ancak bu bir test deposu. Açık olmak gerekirse, yakın zamanda bir hata raporuyla ilgili kendi sorunlarım oldu, ancak üretim sorunlarınızdan bir test deposunu sorumlu tutamazsınız.

Gelecekte Proxmox ile mutlu günler dilerim.

I cannot imagine manager who said "let's use test repo", and then blames everyone else on that decision.

Let’s be clear: I have been managing this 1,200+ node infrastructure for over 2 years using this exact same repository structure without a single critical regression—until now.

While you point out it's a "test" repo, in a professional development cycle, a package should only reach a public test repository after passing initial Alpha, Beta, and RC (Release Candidate) stages. As evidenced by https://download.qemu.org/, QEMU itself follows a strict RC process (RC0, RC1, etc.) precisely to avoid these kinds of fundamental reporting bugs before wider testing.

When a core component like pve-qemu-kvm is pushed to a repository accessible to thousands of nodes—even a test one—there is an expectation that basic IO reporting hasn't been broken. This wasn't a minor edge-case bug; it was a global regression affecting CPU, RAM, and IO Delay metrics across an entire data center rack.

I am thankful for the workarounds provided by @uzumo and @fiona, but my frustration stems from the fact that a 2-year streak of stability was broken by a package that clearly lacked basic stress testing before its public "test" debut.

I've stabilized my environment by downgrading. Now, let’s focus on the fix so others don't have to spend two sleepless nights fixing "test" code that behaved like an early Alpha.

Happy Proxmox to you too.
 
Let’s be clear: I have been managing this 1,200+ node infrastructure for over 2 years using this exact same repository structure without a single critical regression—until now.

While you point out it's a "test" repo, in a professional development cycle, a package should only reach a public test repository after passing initial Alpha, Beta, and RC (Release Candidate) stages. As evidenced by https://download.qemu.org/, QEMU itself follows a strict RC process (RC0, RC1, etc.) precisely to avoid these kinds of fundamental reporting bugs before wider testing.

When a core component like pve-qemu-kvm is pushed to a repository accessible to thousands of nodes—even a test one—there is an expectation that basic IO reporting hasn't been broken. This wasn't a minor edge-case bug; it was a global regression affecting CPU, RAM, and IO Delay metrics across an entire data center rack.

I am thankful for the workarounds provided by @uzumo and @fiona, but my frustration stems from the fact that a 2-year streak of stability was broken by a package that clearly lacked basic stress testing before its public "test" debut.

I've stabilized my environment by downgrading. Now, let’s focus on the fix so others don't have to spend two sleepless nights fixing "test" code that behaved like an early Alpha.

Happy Proxmox to you too.
i repeat, dont use the test-repo for production.
developers state its only for testing-purposes and not for production-use as evidenced by my screenshot from their own wiki.
you ignoring this is purely on you and on noone else.
you could have run the test-repo in a lab (as intended), found out about the regression and reported it without impacting your production environment.

let this be a lesson for the future and switch your stuff to the enterprise-repo to not run into such things in the future.
 
Those expectations arise precisely because you’ve actually used it and experienced its shortcomings firsthand.

I understand your concerns, but ultimately, the use of the test repository is at your own risk.
 
  • Like
Reactions: monkfish
its like using debian sid and then complaining about issues with it.
according to his logic debian sid should be qc-tested since the repository is available to thousands.
thats not how it works.

test = bleeding edge stuff with a decent chance of defects but newest releases
no-subscription = somewhat tested stuff with a lesser chance of defects (but slightly older releases than test)
enterprise = thoroughly tested stuff with the least chance of defects (but also older releases than the previous two).
 
  • Like
Reactions: Neobin and monkfish
i repeat, dont use the test-repo for production.
developers state its only for testing-purposes and not for production-use as evidenced by my screenshot from their own wiki.
you ignoring this is purely on you and on noone else.
you could have run the test-repo in a lab (as intended), found out about the regression and reported it without impacting your production environment.

let this be a lesson for the future and switch your stuff to the enterprise-repo to not run into such things in the future.
Those expectations arise precisely because you’ve actually used it and experienced its shortcomings firsthand.

I understand your concerns, but ultimately, the use of the test repository is at your own risk.
its like using debian sid and then complaining about issues with it.
according to his logic debian sid should be qc-tested since the repository is available to thousands.
thats not how it works.

test = bleeding edge stuff with a decent chance of defects but newest releases
no-subscription = somewhat tested stuff with a lesser chance of defects (but slightly older releases than test)
enterprise = thoroughly tested stuff with the least chance of defects (but also older releases than the previous two).

I fully understand and accept your point regarding the repository hierarchy. You are right; relying on the test-repo for a production environment of this scale was a risk that ultimately fell on me, regardless of how stable it had been over the past two years.

This experience has been a clear wake-up call. I am currently spending another night without sleep to rectify the situation and ensure that my infrastructure is migrated to a more stable repository structure (Enterprise/No-Subscription) to prevent such regressions in the future.

My intention wasn't to shift blame, but rather to highlight the severity of the bug's impact on a large-scale real-world setup. I appreciate the feedback and the technical insights provided by the community.

Now, back to work to finalize the fixes. Thanks for the "lesson" and the discussion.