Troubleshooting host-side virtio-gpu/virgl not using host GPU

pipe2null

Member
Feb 26, 2023
6
1
8
Host is using software rendering, not using host-owned GPU for vm guests configured with virtio-gpu/gl. This might be just a normal graphics driver issue, but I've exhausted my minimal knowledge on the subject and from all the net searching, I have the impression that basic virtio-gpu/virgl should "just work". But I have been unsuccessful, and even though the vm shows correct setup the host-side is using software rendering and not using the host gpu. Same problem whether using virtio-gpu or virtio-gl.

I recently upgraded to PVE 9.1.6 from "8.latest" (I forget exact version) and I have the exact same problem with my homelab primary workstation.

Basics of config:
- Host-owned GPU: nvidia A400 (at the moment)
- Other gpus being passed through to other guests: Tesla P4, 4090, 5090
- virtio-gpu/gl only for vms without dedicated gpus, both linux and windows guests
- For a long time, I've used a Hydra configuration: USB Quad-Controller pcie plus multiple gpus to pass through for up to 4 "physical" workstations with dedicated montors/keyboard/etc. but currently paring that back to 1 or 2 seats. Trying to get a reasonably performant guest desktop environments with minimal futzing with client config without passthrough for most vms. At this time, I am not pursuing vGPU at all until I have hardware that doesn't require a license server.


> lspci -nnk | grep -iE "vga|3d controller" -A 3
1:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107GL [RTX A400] [10de:25b2] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1879]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GB202 [GeForce RTX 5090] [10de:2b85] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device [19da:1761]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
09:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 30)
DeviceName: ASPEED Video AST2400
Subsystem: Super Micro Computer Inc Device [15d9:0852]
Kernel modules: ast
--
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:5104]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
85:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

In all cases so far, for the nvidia drivers I have tried, "nvidia-smi -l" shows the same thing (differing by version):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142 Driver Version: 580.142 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A400 Off | 00000000:01:00.0 Off | N/A |
| 30% 43C P8 N/A / 50W | 0MiB / 4094MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

Drivers I've tried, direct from NVidia:
- 570.211
- 580.142
previous versions tend to have build problems due to changes in newer kernels (tried 535 and 550.163.01 plus a couple older ones while still on PVE 8).

/etc/kernel# cat cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs console=tty0 intel_iommu=on iommu=pt nomodeset

/etc/modprobe.d# cat gpu.conf
blacklist nouveau
options nouveau modeset=0

/etc/modprobe.d# cat vfio-pci.conf
options vfio-pci ids=10de:2b85,10de:22e8,10de:2684,10de:22ba,10de:1bb3

etc/modprobe.d# cat pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE
# nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb



Linux Guest "glxinfo -B" shows virgl as renderer, windows shows virtio-gpu in use.
Wiggling a guest window around to force basic rendering workload and watching both Proxmox vm summary CPU usage and guest system monitor shows CPU usage spike on host-side only, the guest cpu usage monitor remains at idle while proxmox ui shows guest cpu max out. Looping nvidia-smi -l shows that zero VRAM is used with zero host GPU utilization.


Would appreciate any direction you might have to offer, thanks