pve-qemu-kvm 6.2 breaks NVIDIA GPU passthrough after PVE 7.2 upgrade

Apr 1, 2022
5
2
8
Any ideas what might be wrong with qemu 6.2 - after pve version upgrade passthrough of NVIDIA A100 stopped working,

Code:
root@gpu-test-vm ~ # nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

root@gpu-test-vm ~ # dmesg
[  113.406581] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[  113.406586] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:20b0)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 470.103.01 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[  113.409224] nvidia: probe of 0000:01:00.0 failed with error -1
[  113.409245] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  113.409246] NVRM: None of the NVIDIA devices were initialized.
[  113.409952] nvidia-nvlink: Unregistered the Nvlink Core, major device number 244

...if we roll back to apt install pve-qemu-kvm=6.1.1-2 and reboot vm passthrough is working again.

In dmesg we see address conflicts,

Code:
root@gpu-test-vm ~ # diff -u dmesg_working dmesg_not_working | grep BAR
 pci 0000:00:01.0: BAR 0: assigned to efifb
+pci 0000:00:1a.0: can't claim BAR 4 [io  0xd300-0xd31f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1a.1: can't claim BAR 4 [io  0xd2e0-0xd2ff]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1a.2: can't claim BAR 4 [io  0xd2c0-0xd2df]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1d.0: can't claim BAR 4 [io  0xd2a0-0xd2bf]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1d.1: can't claim BAR 4 [io  0xd280-0xd29f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1d.2: can't claim BAR 4 [io  0xd260-0xd27f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1f.2: can't claim BAR 4 [io  0xd240-0xd25f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1f.3: can't claim BAR 4 [io  0xd200-0xd23f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
 pci 0000:01:00.0: can't claim BAR 0 [mem 0xff000000-0xffffffff]: no compatible bridge window
 pci 0000:01:00.0: can't claim BAR 1 [mem 0xfffffff000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
 pci 0000:01:00.0: can't claim BAR 3 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
 pci 0000:00:01.0: can't claim BAR 6 [mem 0xffff0000-0xffffffff pref]: no compatible bridge window
 pci 0000:06:12.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
 pci 0000:00:1c.0: BAR 15: assigned [mem 0x1000000000-0x27ffffffff 64bit pref]
-pci 0000:00:1c.0: BAR 14: assigned [mem 0x80000000-0x80ffffff]
-pci 0000:00:01.0: BAR 6: assigned [mem 0x81000000-0x8100ffff pref]
+pci 0000:00:1c.1: BAR 15: assigned [mem 0x800200000-0x8003fffff 64bit pref]
+pci 0000:00:1c.2: BAR 15: assigned [mem 0x800400000-0x8005fffff 64bit pref]
+pci 0000:00:1c.3: BAR 15: assigned [mem 0x800600000-0x8007fffff 64bit pref]
+pci 0000:00:01.0: BAR 6: assigned [mem 0x80000000-0x8000ffff pref]
+pci 0000:00:1f.3: BAR 4: assigned [io  0x1000-0x103f]
+pci 0000:00:1a.0: BAR 4: assigned [io  0x1040-0x105f]
+pci 0000:00:1a.1: BAR 4: assigned [io  0x1060-0x107f]
+pci 0000:00:1a.2: BAR 4: assigned [io  0x1080-0x109f]
+pci 0000:00:1d.0: BAR 4: assigned [io  0x10a0-0x10bf]
+pci 0000:00:1d.1: BAR 4: assigned [io  0x10c0-0x10df]
+pci 0000:00:1d.2: BAR 4: assigned [io  0x10e0-0x10ff]
+pci 0000:00:1f.2: BAR 4: assigned [io  0x1400-0x141f]
 pci 0000:01:00.0: BAR 1: assigned [mem 0x1000000000-0x1fffffffff 64bit pref]
 pci 0000:01:00.0: BAR 3: assigned [mem 0x2000000000-0x2001ffffff 64bit pref]
-pci 0000:01:00.0: BAR 0: assigned [mem 0x80000000-0x80ffffff]
+pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000]
+pci 0000:01:00.0: BAR 0: trying firmware assignment [mem 0xff000000-0xffffffff]
+pci 0000:01:00.0: BAR 0: assigned [mem 0xff000000-0xffffffff]
 pci 0000:06:12.0: BAR 6: assigned [mem 0xc1640000-0xc167ffff pref]
Virtual machine configuration,
Code:
agent: 1
args: -global q35-pcihost.pci-hole64-size=2048G
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 16
cpu: host
efidisk0: ceph-vm:vm-107-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:01:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 16384
meta: creation-qemu=6.1.0,ctime=1642774395
name: gpu-test
net0: virtio=8E:C7:92:8F:2D:4B,bridge=vmbr0,firewall=1,tag=1234
numa: 1
ostype: l26
scsi0: ceph-vm:vm-107-disk-1,discard=on,iothread=1,size=50G
scsihw: virtio-scsi-single
smbios1: uuid=3e37cffc-6534-4a81-a0c5-bec22ac4b228
snaptime: 1652715711
sockets: 1
vmgenid: 1ee48706-9eca-4477-b98a-2313901a473e

EDIT: It seems that forcing machine type to pc-q35-6.1 also fixes the issue and BAR address conflicts disappear. Any ideas why pc-q35-6.2 is not working?
 
Last edited:
The version number was introduced because of changes to the PCI(e) layout of the virtual machines. Maybe the NVidia driver cannot handle the change made in QEMU 6.2?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!