[TTM] Buffer eviction failed

I am also running into this problem on a proxmox machine running 7.4-17 with a VM running Linux Mint 22 Cinnamon 6.2.9 with kernel version 6.8.0-47-generic. This proxmox machine previously has had no issues running VMs for weeks at a time. So I'm prone to thinking its something wrong with the VM as when I start VMs with older OS's it will still run for long periods of time (weeks/months) without error. This issue has persisted through to a second proxmox machine running 8.2.7 where the first Linux Mint VM was copied over to the machine. As a test I installed a VM with PopOS 22.04 LTS with kernel version 6.9.3-76060903-generic and the QXL error has occured there too.

I'm going to transfer one of the VMs that I have not had a problem with over to the 8.2.7 proxmox machine and see if i get a QXL error.

If anyone has some recommended tests they would like me to do to help solve this problem I would be more than happy to assist!
 
  • Like
Reactions: jebbam
No joy on any permutation or combination of RAM/VRAM and vgamem settings - for me, the QXL error occurs in all cases and still seems to be random.

Edit:
Assuming I'm reading it right, after reading through the kernel changelog for Ubuntu's 6.8.0-48 kernel (covering e.g. Ubuntu 24.04, Linux Mint 22 and others if the most up-to-date kernel is installed), it seems that the following occured with regard to the alleged QXL driver bug fix, the discussion of which I previously linked to:

14 Jun 2024
Reverted "drm/qxl: simplify qxl_fence_wait" in upstream kernel 6.8.7, which was pulled into Ubuntu 6.8.0-1008.8-22.04.1 [6.8.0-38.38]

19 Jul 2024
Reapplied "drm/qxl: simplify qxl_fence_wait" in upstream kernel 6.8.10, which was pulled into Ubuntu 6.8.0-1010.10-22.04.1 [6.8.0-40.40]

What's not clear is whether the bug had actually been fixed when the code was reapplied (in 6.8.0-40) or whether it was reapplied in original (buggy) form awaiting a future fix.

To answer my own question, from kernel.org's changelog for upstream kernel 6.8.10:
commit 3dfe35d8683daf9ba69278643efbabe40000bbf6
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon May 6 13:28:59 2024 -0700

Reapply "drm/qxl: simplify qxl_fence_wait"

commit 3628e0383dd349f02f882e612ab6184e4bb3dc10 upstream.

This reverts commit 07ed11afb68d94eadd4ffc082b97c2331307c5ea.

Stephen Rostedt reports:
"I went to run my tests on my VMs and the tests hung on boot up.
Unfortunately, the most I ever got out was:

[ 93.607888] Testing event system initcall: OK
[ 93.667730] Running tests on all trace events:
[ 93.669757] Testing all events: OK
[ 95.631064] ------------[ cut here ]------------
Timed out after 60 seconds"

and further debugging points to a possible circular locking dependency
between the console_owner locking and the worker pool locking.

Reverting the commit allows Steve's VM to boot to completion again.

[ This may obviously result in the "[TTM] Buffer eviction failed"
messages again, which was the reason for that original revert. But at
this point this seems preferable to a non-booting system... ]

Reported-and-bisected-by: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20240502081641.457aa25f@gandalf.local.home/

So, any downstream (distro) kernel that pulls from upstream linux kernel <6.8.7 or >=6.8.10 will have the buggy QXL code. That's for the 6.8 series; other kernel series probably also have the buggy code (e.g. 5.15).

For Ubuntu and derivatives, it looks like kernels 6.8.0-38 and 6.8.0-39 have the reverted code, so I'll see if I can test those.
 

Attachments

Last edited:
I have seen this in Debian bullseye, bookworm, trixie, and sid. It has been around for a lot of different kernel versions.
Yes, originally some 3-4 years ago when the QXL driver was first simplified. It's probably in most kernel series since then. But I'm only testing the 6.8 series at the moment.
 
Yes, originally some 3-4 years ago when the QXL driver was first simplified. It's probably in most kernel series since then. But I'm only testing the 6.8 series at the moment.
Hi THX1138 - do you have any updates on your testing of the specific kernels? Did -38 or -39 fix the issue?
 
Hi THX1138 - do you have any updates on your testing of the specific kernels? Did -38 or -39 fix the issue?
Yes, the 6.8.0-38 and -39 kernels both work perfectly: over 200 hours of testing without the bug reappearing. In the middle of that, I re-tested the 6.8.0-49 and -50 kernels and they both failed within a few hours.

Just to be clear (for the benefit of people just finding this), it's the guest kernel we're talking about (the host kernel doesn't seem to matter at all) and it's much broader than the 6.8 kernel series as the bug was introduced in the upstream linux kernel over 3 years ago through simplification of the kvm/qemu QXL guest video driver and has propagated from there. The bug was removed briefly by reverting the prior change (via the upstream linux kernel, but pulled into the kernel of many distros, e.g. 6.8.0-38 and -39 in Ubuntu and derivatives) before being re-introduced because the developer said the reverted, unsimplified code caused crashes in their testing environments. I have seen no crashes or other issues whatsoever.

It seems as though we're stuck with it, since the developer expressed the opinion that people would just shift to virtio or some other guest video driver (virtio works, but is extremely slow - so much so as to be essentially unusable for me).

Hope that helps.
 
  • Like
Reactions: shadeless
Thank you for testing and sharing the results!

Will switch to one of the kernels and continue testing. My VMs last for about half a day before crashing unfortunately.
For anyone reading - if i dont report back, the older Kernels also worked for me :)
 
First of all: thank you for the intensive test!
I've been following this thread for quite some time now as I'm facing the same issue. I'm not using proxmox but Debian with QEMU for work. Had this issue on Debian and Kali guests. The issue is in the QXL kernel driver, since the working version nothing much changed in the source code. Would anything prevent me from just compiling the driver in a working state and using it with a new kernel? If I just clone the Linux Kernel Repo, roll back the qxl driver to the working files, and compile them, would that work? I'm not that deep into how the kernel works so maybe this could be a stupid question.
 
I bump into this issue a few weeks ago and recently it appears more frequently. I'm using QEMU/KVM on Ubuntu 22.04, kernel version 6.8.0-51-generic. And the guest OS is also Ubuntu 22.04. The graphics console got frozen initially and then lost respond to keyboard and mouse. Since I still can access the VM via SSH, I can see the error in dmesg like below,
Code:
[Tue Jan 14 16:27:55 2025] [TTM] Buffer eviction failed
[Tue Jan 14 16:27:55 2025] qxl 0000:00:01.0: object_init failed for (262144, 0x00000001)
[Tue Jan 14 16:27:55 2025] [drm:qxl_gem_object_create [qxl]] *ERROR* Failed to allocate GEM object (260772, 1, 4096, -12)
[Tue Jan 14 16:27:55 2025] [drm:qxl_alloc_ioctl [qxl]] *ERROR* qxl_alloc_ioctl: failed to create gem ret=-12
...
[Tue Jan 14 16:28:10 2025] [TTM] Buffer eviction failed
[Tue Jan 14 16:28:10 2025] qxl 0000:00:01.0: object_init failed for (3149824, 0x00000001)
[Tue Jan 14 16:28:10 2025] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
...
[Tue Jan 14 16:28:11 2025] p4v.bin[201907]: segfault at 7463ab25ea30 ip 00007463ab25ea30 sp 00007fff42f591d8 error 15 in libQt6Core.so.6[7463ab247000+206000] likely on CPU 0 (core 0, socket 0)
...

Since I was using p4v, but at the moment of crash, no operations on p4v at all. I had several crash logs but every time the trigger seemed from p4v. This is another one,
Code:
[Fri Jan  3 13:35:50 2025] [TTM] Buffer eviction failed
[Fri Jan  3 13:35:50 2025] qxl 0000:00:01.0: object_init failed for (258048, 0x00000001)
[Fri Jan  3 13:35:50 2025] [drm:qxl_gem_object_create [qxl]] *ERROR* Failed to allocate GEM object (256020, 1, 4096, -12)
[Fri Jan  3 13:35:50 2025] [drm:qxl_alloc_ioctl [qxl]] *ERROR* qxl_alloc_ioctl: failed to create gem ret=-12
[Fri Jan  3 13:36:05 2025] [TTM] Buffer eviction failed
[Fri Jan  3 13:36:05 2025] qxl 0000:00:01.0: object_init failed for (3149824, 0x00000001)
[Fri Jan  3 13:36:05 2025] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
[Fri Jan  3 13:36:06 2025] p4v.bin[312116]: segfault at 711c6e85ea30 ip 0000711c6e85ea30 sp 00007ffe1fba31f8 error 15 in libQt6Core.so.6[711c6e847000+206000] likely on CPU 2 (core 0, socket 2)
[Fri Jan  3 13:36:06 2025] Code: 65 64 28 51 4f 62 6a 65 63 74 20 2a 29 00 32 64 65 73 74 72 6f 79 65 64 28 51 4f 62 6a 65 63 74 20 2a 29 00 00 00 00 00 00 00 <32> 31 51 4f 62 6a 65 63 74 43 6c 65 61 6e 75 70 48 61 6e 64 6c 65

I post here and wonder if there is a simple fix other than downgrade the kernel? Thanks.
 
Since I was using p4v, but at the moment of crash, no operations on p4v at all. I had several crash logs but every time the trigger seemed from p4v.
If you look at your logs, the p4v segfault occurs several seconds after the QXL driver (TTM Buffer eviction failed) error. The p4v binary may have its own bug that is triggered by the QXL driver crash but, since the majority of systems experiencing this problem don't have p4v (Helix visual client) installed, I'd be fairly certain that p4v is not the cause.

I post here and wonder if there is a simple fix other than downgrade the kernel? Thanks.
The problem is in the simplified version of the QXL guest video driver (used with qemu/kvm). You can use an alternate guest video driver if that works for you.
 
First of all: thank you for the intensive test!
You're welcome. I had some spare time while the cricket was on. :)
Would anything prevent me from just compiling the driver in a working state and using it with a new kernel? If I just clone the Linux Kernel Repo, roll back the qxl driver to the working files, and compile them, would that work? I'm not that deep into how the kernel works so maybe this could be a stupid question.
I don't know but, theoretically, I suppose it should ... I've never done that myself ... patching the kernel and recompiling, that is ... I'm not that deep into kernel development either. Maybe a better solution would be if a lot of us petition the developer (via kernel bug reports) to fix the bug?
 
Last edited:
Maybe a better solution would be if a lot of us petition the developer (via kernel bug reports) to fix the bug?
Yes I like that solution way more. Just thought this could be a temporary fix for this issue. It's really bugging me at work
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!