[SOLVED] Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository

monkfish · Mar 31, 2026

djsami said:
I am writing this to express my deep frustration regarding the pve-qemu-kvm 10.2.1-1 update. I manage a massive infrastructure of over 1,200 nodes, and this untested update has caused significant distress across my entire operation.

I have spent the last two nights without sleep, monitoring spikes in IO Delay, CPU, and RAM usage that appeared immediately after this update. It is honestly disappointing to see a core package released to the Trixie repository with such a glaring regression that impacts real-world performance, not just "graphs."

After extensive testing and stress, I confirmed that downgrading to 10.1.2-7 resolved the issue on my clusters. In an enterprise-grade environment of this scale, we rely on the stability of these updates. Having to manually intervene across such a large fleet due to an avoidable bug is unacceptable.

I hope this serves as a wake-up call for more rigorous QA before pushing updates that handle core hypervisor functions. I am still recovering from the stress and lack of sleep this has caused.

Looking forward to a stable, properly tested fix soon.

Sorry, I'm out. All of the above is on you. Who told you to implement the "test" repository in "an enterprise-grade environment of this scale".

It is not published in either pve-no-subscription or pve-enterprise repositories.

I don't dispute you experienced an issue - however remember one workaround was clearly documented by @uzumo (thanks that user!) and further diagnostic suggestions were made by @fiona
I mean, the results of which for MY environment were supplied - subsequent to which @fiona had confirmed replicating an issue. Its being looked at.

You might - just might have an argument had this reached a production repo but its a test repo. To be clear, I had my own issues with a bug report recently but you cannot blame a TEST repo for your production issues.

Happy Proxmox in future.

ness1602 · Mar 31, 2026

I cannot imagine manager who said "let's use test repo", and then blames everyone else on that decision.

Neobin · Mar 31, 2026

djsami said:
I am writing this to express my deep frustration regarding the pve-qemu-kvm 10.2.1-1 update. I manage a massive infrastructure of over 1,200 nodes, and this untested update has caused significant distress across my entire operation.

I have spent the last two nights without sleep, monitoring spikes in IO Delay, CPU, and RAM usage that appeared immediately after this update. It is honestly disappointing to see a core package released to the Trixie repository with such a glaring regression that impacts real-world performance, not just "graphs."

After extensive testing and stress, I confirmed that downgrading to 10.1.2-7 resolved the issue on my clusters. In an enterprise-grade environment of this scale, we rely on the stability of these updates. Having to manually intervene across such a large fleet due to an avoidable bug is unacceptable.

I hope this serves as a wake-up call for more rigorous QA before pushing updates that handle core hypervisor functions. I am still recovering from the stress and lack of sleep this has caused.

Looking forward to a stable, properly tested fix soon.

djsami · Mar 31, 2026

monkfish said:
Üzgünüm, ben yokum. Yukarıdakilerin hepsi sizin sorumluluğunuzda. "Bu ölçekte kurumsal düzeyde bir ortamda" "test" deposunu uygulamanızı size kim söyledi?

Bu, ne pve-no-subscription ne de pve-enterprise depolarında yayınlanmamıştır.

Yaşadığınız sorunu inkar etmiyorum - ancak @uzumo tarafından açıkça belgelenmiş bir geçici çözüm olduğunu (teşekkürler o kullanıcı!) ve @fiona tarafından daha fazla teşhis önerisi yapıldığını hatırlayın.
Yani, benim ortamım için sonuçlar sağlandı - ardından @fiona sorunun tekrarlandığını doğruladı. İnceleniyor.

Bu durum üretim deposuna ulaşsaydı belki bir tartışma çıkabilirdi, ancak bu bir test deposu. Açık olmak gerekirse, yakın zamanda bir hata raporuyla ilgili kendi sorunlarım oldu, ancak üretim sorunlarınızdan bir test deposunu sorumlu tutamazsınız.

Gelecekte Proxmox ile mutlu günler dilerim.

ness1602 said:
I cannot imagine manager who said "let's use test repo", and then blames everyone else on that decision.

Let’s be clear: I have been managing this 1,200+ node infrastructure for over 2 years using this exact same repository structure without a single critical regression—until now.

While you point out it's a "test" repo, in a professional development cycle, a package should only reach a public test repository after passing initial Alpha, Beta, and RC (Release Candidate) stages. As evidenced by https://download.qemu.org/, QEMU itself follows a strict RC process (RC0, RC1, etc.) precisely to avoid these kinds of fundamental reporting bugs before wider testing.

When a core component like pve-qemu-kvm is pushed to a repository accessible to thousands of nodes—even a test one—there is an expectation that basic IO reporting hasn't been broken. This wasn't a minor edge-case bug; it was a global regression affecting CPU, RAM, and IO Delay metrics across an entire data center rack.

I am thankful for the workarounds provided by @uzumo and @fiona, but my frustration stems from the fact that a 2-year streak of stability was broken by a package that clearly lacked basic stress testing before its public "test" debut.

I've stabilized my environment by downgrading. Now, let’s focus on the fix so others don't have to spend two sleepless nights fixing "test" code that behaved like an early Alpha.

Happy Proxmox to you too.

beisser · Mar 31, 2026

djsami said:
Let’s be clear: I have been managing this 1,200+ node infrastructure for over 2 years using this exact same repository structure without a single critical regression—until now.

While you point out it's a "test" repo, in a professional development cycle, a package should only reach a public test repository after passing initial Alpha, Beta, and RC (Release Candidate) stages. As evidenced by https://download.qemu.org/, QEMU itself follows a strict RC process (RC0, RC1, etc.) precisely to avoid these kinds of fundamental reporting bugs before wider testing.

When a core component like pve-qemu-kvm is pushed to a repository accessible to thousands of nodes—even a test one—there is an expectation that basic IO reporting hasn't been broken. This wasn't a minor edge-case bug; it was a global regression affecting CPU, RAM, and IO Delay metrics across an entire data center rack.

I am thankful for the workarounds provided by @uzumo and @fiona, but my frustration stems from the fact that a 2-year streak of stability was broken by a package that clearly lacked basic stress testing before its public "test" debut.

I've stabilized my environment by downgrading. Now, let’s focus on the fix so others don't have to spend two sleepless nights fixing "test" code that behaved like an early Alpha.

Happy Proxmox to you too.

i repeat, dont use the test-repo for production.
developers state its only for testing-purposes and not for production-use as evidenced by my screenshot from their own wiki.
you ignoring this is purely on you and on noone else.
you could have run the test-repo in a lab (as intended), found out about the regression and reported it without impacting your production environment.

let this be a lesson for the future and switch your stuff to the enterprise-repo to not run into such things in the future.

uzumo · Mar 31, 2026

Those expectations arise precisely because you’ve actually used it and experienced its shortcomings firsthand.

I understand your concerns, but ultimately, the use of the test repository is at your own risk.

beisser · Mar 31, 2026

its like using debian sid and then complaining about issues with it.
according to his logic debian sid should be qc-tested since the repository is available to thousands.
thats not how it works.

test = bleeding edge stuff with a decent chance of defects but newest releases
no-subscription = somewhat tested stuff with a lesser chance of defects (but slightly older releases than test)
enterprise = thoroughly tested stuff with the least chance of defects (but also older releases than the previous two).

djsami · Mar 31, 2026

beisser said:
i repeat, dont use the test-repo for production.
developers state its only for testing-purposes and not for production-use as evidenced by my screenshot from their own wiki.
you ignoring this is purely on you and on noone else.
you could have run the test-repo in a lab (as intended), found out about the regression and reported it without impacting your production environment.

let this be a lesson for the future and switch your stuff to the enterprise-repo to not run into such things in the future.

uzumo said:
Those expectations arise precisely because you’ve actually used it and experienced its shortcomings firsthand.

I understand your concerns, but ultimately, the use of the test repository is at your own risk.

beisser said:
its like using debian sid and then complaining about issues with it.
according to his logic debian sid should be qc-tested since the repository is available to thousands.
thats not how it works.

test = bleeding edge stuff with a decent chance of defects but newest releases
no-subscription = somewhat tested stuff with a lesser chance of defects (but slightly older releases than test)
enterprise = thoroughly tested stuff with the least chance of defects (but also older releases than the previous two).

I fully understand and accept your point regarding the repository hierarchy. You are right; relying on the test-repo for a production environment of this scale was a risk that ultimately fell on me, regardless of how stable it had been over the past two years.

This experience has been a clear wake-up call. I am currently spending another night without sleep to rectify the situation and ensure that my infrastructure is migrated to a more stable repository structure (Enterprise/No-Subscription) to prevent such regressions in the future.

My intention wasn't to shift blame, but rather to highlight the severity of the bug's impact on a large-scale real-world setup. I appreciate the feedback and the technical insights provided by the community.

Now, back to work to finalize the fixes. Thanks for the "lesson" and the discussion.

fiona · Apr 1, 2026

From the analysis until now, the IO pressure seems to be a cosmetic issue or rather accounting issue in the kernel. QEMU switched to using io_uring for event loops with QEMU 10.2. The issue appears in combination with IO threads, where a blocking aio_poll() call is used. To model the blocking, the io_uring management in QEMU submits a timeout request (man 3 io_uring_prep_timeout).

The following C program is a small reproducer:

Code:

#include <errno.h>
#include <stdio.h>
#include <liburing.h>

int main(void) {
    int ret;
    struct io_uring ring;
    struct __kernel_timespec ts;
    struct io_uring_sqe *sqe;
    int wait_nr = 1;

    ret = io_uring_queue_init(128, &ring, 0);
    if (ret != 0) {
        printf("Failed to initialize io_uring\n");
        return ret;
    }


    ts = (struct __kernel_timespec){
        .tv_sec = 60,
    };

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        printf("Full sq\n");
        return -1;
    }

    io_uring_prep_timeout(sqe, &ts, 1, 0);
    io_uring_sqe_set_data(sqe, NULL);

    do {
        ret = io_uring_submit_and_wait(&ring, wait_nr);
        printf("got ret %d\n", ret);
    } while (ret == -EINTR ||
             (ret >= 0 && wait_nr > io_uring_cq_ready(&ring)));

    io_uring_queue_exit(&ring);

    return 0;
}

If you have installed the liburing-dev package, and compile and run the program with gcc io-pressure.c -luring -o io-pressure && ./io-pressure you get a minute of full "IO pressure" on your system even though there is no actual IO being done.

EDIT: clarify type of the issue

EDIT 2: upstream discussion: https://lore.kernel.org/io-uring/bec202bd-cf01-4423-b3f6-f551bf269c8f@proxmox.com/T/

robertlukan · Apr 3, 2026

I noticed the same problem. I tested all Async IO options and the situation is not better. If I disable IO thread feature, IO delay goes away.

beisser · Apr 3, 2026

robertlukan said:
I noticed the same problem. I tested all Async IO options and the situation is not better. If I disable IO thread feature, IO delay goes away.

according to fiona this seems to be a cosmetic issue only and should not have any actual impact other than the graphs.

robertlukan · Apr 3, 2026

beisser said:
according to fiona this seems to be a cosmetic issue only and should not have any actual impact other than the graphs.

Maybe, lets see what is the outcome of the upstream discussion.

xkpx · Apr 20, 2026

Mine chart that i somehow manage to ignore for almost a month....
I disabled io thread in disk for the vm and no more problem.. but this is using the testing repo and the broken somehow package pve-qemu-kvm 10.2.1-1.
I hope they fix it
After disabled io_thread 0 stall but...

fiona · Apr 21, 2026

Leaving IO thread enabled is strongly recommended. The high IO wait is only a cosmetic issue. @t.lamprecht proposed a patch to work around the issue.

martin_comp · May 2, 2026

Any News on this?
I observe the same "bug" and noted, that also, since this happens not just "IO Pressure Stall" is up, but also the CPUs "IO Delay".
We run on Enterprise U.2 NVMe SSDs, which are all PLP protected and should not have any issues with any load.

uzumo · May 2, 2026

Judging by the link, it seems the issue was identified upstream. It should be fixed soon.

beisser · May 2, 2026

martin_comp said:
Any News on this?
I observe the same "bug" and noted, that also, since this happens not just "IO Pressure Stall" is up, but also the CPUs "IO Delay".
We run on Enterprise U.2 NVMe SSDs, which are all PLP protected and should not have any issues with any load.

and you are not having any issue.
its cosmetic. it displays wrong. that is all.
there is no actual issue with a negative impact.

martin_comp · May 3, 2026

Yeah indeed the real performance stood the same. I got this from the thread.

But I did not read anything about IO delay itself, only IO Pressure Stall and wonder if it is the very same issue, or needs a separate fix.

fiona · May 4, 2026

@martin_comp Yes, the issue affects IO delay (the name for iowait in the UI) too.

A fix was applied in git master and the new pve-qemu-kvm=10.2.1-2 package is currently going through internal testing.

monkfish · May 4, 2026

@fiona Thanks for that - I see pve-qemu-kvm-10.2.1-2 in pve-test as of time of writing.
I had 2 updates on my test lab whether or not the other one is applicable to this bug or just coincidence wouldn't know.

For those that used workaround from @uzumo to hold back pve-qemu-kvm, remember to perform
a "apt-mark unhold pve-qemu-kvm"

proxmox-widget-toolkit-5.1.10
pve-qemu-kvm-10.2.1-2

were the two updates I received. On my gear, a selection of Dell 7070 and HP Microserver Gen 11 I confirm that the spurious graphing no longer occurs! Thank you team Proxmox!

[SOLVED] Applying pve-qemu-kvm 10.2.1-1 may cause extremely high “I/O Delay” and extremely high “I/O pressure stalls”. (Patches in the test repository

Renowned Member

Famous Member

Distinguished Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Renowned Member

Proxmox Staff Member

Member

Renowned Member

Member

Member

Proxmox Staff Member

New Member

Attachments

Well-Known Member

Renowned Member

New Member

Proxmox Staff Member

Renowned Member

We value your privacy