pve-qemu-kvm 6.2 breaks NVIDIA GPU passthrough after PVE 7.2 upgrade

Apr 1, 2022
5
2
8
Any ideas what might be wrong with qemu 6.2 - after pve version upgrade passthrough of NVIDIA A100 stopped working,

Code:
root@gpu-test-vm ~ # nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

root@gpu-test-vm ~ # dmesg
[  113.406581] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[  113.406586] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:20b0)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 470.103.01 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[  113.409224] nvidia: probe of 0000:01:00.0 failed with error -1
[  113.409245] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  113.409246] NVRM: None of the NVIDIA devices were initialized.
[  113.409952] nvidia-nvlink: Unregistered the Nvlink Core, major device number 244

...if we roll back to apt install pve-qemu-kvm=6.1.1-2 and reboot vm passthrough is working again.

In dmesg we see address conflicts,

Code:
root@gpu-test-vm ~ # diff -u dmesg_working dmesg_not_working | grep BAR
 pci 0000:00:01.0: BAR 0: assigned to efifb
+pci 0000:00:1a.0: can't claim BAR 4 [io  0xd300-0xd31f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1a.1: can't claim BAR 4 [io  0xd2e0-0xd2ff]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1a.2: can't claim BAR 4 [io  0xd2c0-0xd2df]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1d.0: can't claim BAR 4 [io  0xd2a0-0xd2bf]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1d.1: can't claim BAR 4 [io  0xd280-0xd29f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1d.2: can't claim BAR 4 [io  0xd260-0xd27f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1f.2: can't claim BAR 4 [io  0xd240-0xd25f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
+pci 0000:00:1f.3: can't claim BAR 4 [io  0xd200-0xd23f]: address conflict with PCI Bus 0000:01 [io  0xd000-0xdfff]
 pci 0000:01:00.0: can't claim BAR 0 [mem 0xff000000-0xffffffff]: no compatible bridge window
 pci 0000:01:00.0: can't claim BAR 1 [mem 0xfffffff000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
 pci 0000:01:00.0: can't claim BAR 3 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
 pci 0000:00:01.0: can't claim BAR 6 [mem 0xffff0000-0xffffffff pref]: no compatible bridge window
 pci 0000:06:12.0: can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
 pci 0000:00:1c.0: BAR 15: assigned [mem 0x1000000000-0x27ffffffff 64bit pref]
-pci 0000:00:1c.0: BAR 14: assigned [mem 0x80000000-0x80ffffff]
-pci 0000:00:01.0: BAR 6: assigned [mem 0x81000000-0x8100ffff pref]
+pci 0000:00:1c.1: BAR 15: assigned [mem 0x800200000-0x8003fffff 64bit pref]
+pci 0000:00:1c.2: BAR 15: assigned [mem 0x800400000-0x8005fffff 64bit pref]
+pci 0000:00:1c.3: BAR 15: assigned [mem 0x800600000-0x8007fffff 64bit pref]
+pci 0000:00:01.0: BAR 6: assigned [mem 0x80000000-0x8000ffff pref]
+pci 0000:00:1f.3: BAR 4: assigned [io  0x1000-0x103f]
+pci 0000:00:1a.0: BAR 4: assigned [io  0x1040-0x105f]
+pci 0000:00:1a.1: BAR 4: assigned [io  0x1060-0x107f]
+pci 0000:00:1a.2: BAR 4: assigned [io  0x1080-0x109f]
+pci 0000:00:1d.0: BAR 4: assigned [io  0x10a0-0x10bf]
+pci 0000:00:1d.1: BAR 4: assigned [io  0x10c0-0x10df]
+pci 0000:00:1d.2: BAR 4: assigned [io  0x10e0-0x10ff]
+pci 0000:00:1f.2: BAR 4: assigned [io  0x1400-0x141f]
 pci 0000:01:00.0: BAR 1: assigned [mem 0x1000000000-0x1fffffffff 64bit pref]
 pci 0000:01:00.0: BAR 3: assigned [mem 0x2000000000-0x2001ffffff 64bit pref]
-pci 0000:01:00.0: BAR 0: assigned [mem 0x80000000-0x80ffffff]
+pci 0000:01:00.0: BAR 0: no space for [mem size 0x01000000]
+pci 0000:01:00.0: BAR 0: trying firmware assignment [mem 0xff000000-0xffffffff]
+pci 0000:01:00.0: BAR 0: assigned [mem 0xff000000-0xffffffff]
 pci 0000:06:12.0: BAR 6: assigned [mem 0xc1640000-0xc167ffff pref]
Virtual machine configuration,
Code:
agent: 1
args: -global q35-pcihost.pci-hole64-size=2048G
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 16
cpu: host
efidisk0: ceph-vm:vm-107-disk-0,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:01:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 16384
meta: creation-qemu=6.1.0,ctime=1642774395
name: gpu-test
net0: virtio=8E:C7:92:8F:2D:4B,bridge=vmbr0,firewall=1,tag=1234
numa: 1
ostype: l26
scsi0: ceph-vm:vm-107-disk-1,discard=on,iothread=1,size=50G
scsihw: virtio-scsi-single
smbios1: uuid=3e37cffc-6534-4a81-a0c5-bec22ac4b228
snaptime: 1652715711
sockets: 1
vmgenid: 1ee48706-9eca-4477-b98a-2313901a473e

EDIT: It seems that forcing machine type to pc-q35-6.1 also fixes the issue and BAR address conflicts disappear. Any ideas why pc-q35-6.2 is not working?
 
Last edited:
The version number was introduced because of changes to the PCI(e) layout of the virtual machines. Maybe the NVidia driver cannot handle the change made in QEMU 6.2?