A6000 VGPU + VE8 - Vm crashes during driver install

hundlos

Member
Jul 23, 2021
5
1
8
39
Hi,

i am trying to get a new Dell Server with Sapphire Rapids CPU´s and a Nvidia A6000 to work in VGPU Mode

i have tried everything in regards to drivers or Proxmox Version 7.4. or 8.0 with newest or little bit older versions of drivers.

Once i am done installing the Host and enabled the sr-iov i could assign the newly created devices to the Proxmox VM.

I can start the VM just fine - once the VM is up ( i also tried newer Windows 2022 Server or Windows 11 ) it crashes once i install the newest Grid / KVM Driver on the VM. it does not give any error msg - the VM just reboots .

during loadup it looks like this on Version 8 with newest VGPU Drivers.

Code:
Aug 01 10:56:14 ber1proxencoder09 systemd[1]: Created slice qemu.slice - Slice /qemu.
Aug 01 10:56:14 ber1proxencoder09 systemd[1]: Started 101.scope.
Aug 01 10:56:14 ber1proxencoder09 kernel: device tap101i0 entered promiscuous mode
Aug 01 10:56:14 ber1proxencoder09 kernel: vmbr0: port 2(tap101i0) entered blocking state
Aug 01 10:56:14 ber1proxencoder09 kernel: vmbr0: port 2(tap101i0) entered disabled state
Aug 01 10:56:14 ber1proxencoder09 kernel: vmbr0: port 2(tap101i0) entered blocking state
Aug 01 10:56:14 ber1proxencoder09 kernel: vmbr0: port 2(tap101i0) entered forwarding state
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000101 GPU PCI id 00:0a:00.4 config params vgpu_type_id=530
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=530
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_env_log: Successfully updated env symbols!
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): detected a VF at 0:a:0.4
Aug 01 10:56:20 ber1proxencoder09 kernel: NVRM: Software scheduler timeslice set to 2083uS.
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): gpu-pci-id : 0xa00
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): Framebuffer: 0x2cc000000
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): Virtual Device Id: 0x2230:0x1502
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: ######## vGPU Manager Information: ########
Aug 01 10:56:20 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: Driver Version: 535.54.06
Aug 01 10:56:21 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
Aug 01 10:56:21 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Aug 01 10:56:21 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): vGPU migration enabled
Aug 01 10:56:21 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: (0x0): vGPU manager is running in SRIOV mode.
Aug 01 10:56:21 ber1proxencoder09 nvidia-vgpu-mgr[3205]: notice: vmiop_log: display_init inst: 0 successful
Aug 01 10:56:21 ber1proxencoder09 kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000101: vGPU migration enabled with upstream V2 migration protocol


wondering if the iommu is the issue :

Code:
[    0.008403] ACPI: DMAR 0x0000000077501000 000438 (v01 DELL   PE_SC3   00000001 INTL 20091013)
[    0.008425] ACPI: Reserving DMAR table memory at [mem 0x77501000-0x77501437]
[    0.316192] DMAR: IOMMU enabled
[    0.640952] DMAR: Host address width 46
[    0.640953] DMAR: DRHD base: 0x000000d97fc000 flags: 0x0
[    0.640960] DMAR: dmar0: reg_base_addr d97fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.640963] DMAR: DRHD base: 0x000000e17fc000 flags: 0x0
[    0.640966] DMAR: dmar1: reg_base_addr e17fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.640968] DMAR: DRHD base: 0x000000e97fc000 flags: 0x0
[    0.640971] DMAR: dmar2: reg_base_addr e97fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.640973] DMAR: DRHD base: 0x000000f17fc000 flags: 0x0
[    0.640977] DMAR: dmar3: reg_base_addr f17fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.640979] DMAR: DRHD base: 0x000000f97fc000 flags: 0x0
[    0.640988] DMAR: dmar4: reg_base_addr f97fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.640989] DMAR: DRHD base: 0x000000d13fc000 flags: 0x0
[    0.640993] DMAR: dmar5: reg_base_addr d13fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.640994] DMAR: DRHD base: 0x000000f9ffc000 flags: 0x0
[    0.640998] DMAR: dmar6: reg_base_addr f9ffc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.640999] DMAR: DRHD base: 0x000000fa7fc000 flags: 0x0
[    0.641003] DMAR: dmar7: reg_base_addr fa7fc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641004] DMAR: DRHD base: 0x000000faffc000 flags: 0x0
[    0.641007] DMAR: dmar8: reg_base_addr faffc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641009] DMAR: DRHD base: 0x000000fb7fc000 flags: 0x0
[    0.641012] DMAR: dmar9: reg_base_addr fb7fc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641013] DMAR: DRHD base: 0x0000009fbfc000 flags: 0x0
[    0.641017] DMAR: dmar10: reg_base_addr 9fbfc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.641018] DMAR: DRHD base: 0x000000a97fc000 flags: 0x0
[    0.641022] DMAR: dmar11: reg_base_addr a97fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.641024] DMAR: DRHD base: 0x000000b33fc000 flags: 0x0
[    0.641027] DMAR: dmar12: reg_base_addr b33fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.641028] DMAR: DRHD base: 0x000000bcffc000 flags: 0x0
[    0.641031] DMAR: dmar13: reg_base_addr bcffc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.641032] DMAR: DRHD base: 0x000000c67fc000 flags: 0x0
[    0.641035] DMAR: dmar14: reg_base_addr c67fc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.641037] DMAR: DRHD base: 0x000000c6ffc000 flags: 0x0
[    0.641040] DMAR: dmar15: reg_base_addr c6ffc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641041] DMAR: DRHD base: 0x000000c77fc000 flags: 0x0
[    0.641044] DMAR: dmar16: reg_base_addr c77fc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641045] DMAR: DRHD base: 0x000000c7ffc000 flags: 0x0
[    0.641048] DMAR: dmar17: reg_base_addr c7ffc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641050] DMAR: DRHD base: 0x000000c87fc000 flags: 0x0
[    0.641053] DMAR: dmar18: reg_base_addr c87fc000 ver 6:0 cap 9ed008c40780466 ecap 3ef9ea6f050df
[    0.641054] DMAR: DRHD base: 0x00000095ffc000 flags: 0x1
[    0.641057] DMAR: dmar19: reg_base_addr 95ffc000 ver 6:0 cap 9ed008c40780466 ecap 3ee9e86f050df
[    0.641059] DMAR: RMRR base: 0x0000006e675000 end: 0x0000006e677fff
[    0.641061] DMAR: ATSR flags: 0x0
[    0.641062] DMAR: RHSA base: 0x00000095ffc000 proximity domain: 0x0
[    0.641063] DMAR: RHSA base: 0x0000009fbfc000 proximity domain: 0x0
[    0.641064] DMAR: RHSA base: 0x000000a97fc000 proximity domain: 0x0
[    0.641065] DMAR: RHSA base: 0x000000b33fc000 proximity domain: 0x0
[    0.641065] DMAR: RHSA base: 0x000000bcffc000 proximity domain: 0x0
[    0.641066] DMAR: RHSA base: 0x000000c67fc000 proximity domain: 0x0
[    0.641067] DMAR: RHSA base: 0x000000c6ffc000 proximity domain: 0x0
[    0.641067] DMAR: RHSA base: 0x000000c77fc000 proximity domain: 0x0
[    0.641068] DMAR: RHSA base: 0x000000c7ffc000 proximity domain: 0x0
[    0.641069] DMAR: RHSA base: 0x000000c87fc000 proximity domain: 0x0
[    0.641070] DMAR: RHSA base: 0x000000d97fc000 proximity domain: 0x1
[    0.641070] DMAR: RHSA base: 0x000000e17fc000 proximity domain: 0x1
[    0.641071] DMAR: RHSA base: 0x000000e97fc000 proximity domain: 0x1
[    0.641072] DMAR: RHSA base: 0x000000f17fc000 proximity domain: 0x1
[    0.641072] DMAR: RHSA base: 0x000000f97fc000 proximity domain: 0x1
[    0.641073] DMAR: RHSA base: 0x000000d13fc000 proximity domain: 0x1
[    0.641074] DMAR: RHSA base: 0x000000f9ffc000 proximity domain: 0x1
[    0.641075] DMAR: RHSA base: 0x000000fa7fc000 proximity domain: 0x1
[    0.641075] DMAR: RHSA base: 0x000000faffc000 proximity domain: 0x1
[    0.641076] DMAR: RHSA base: 0x000000fb7fc000 proximity domain: 0x1
[    0.641077] DMAR: SATC flags: 0x0
[    0.641079] DMAR-IR: IOAPIC id 8 under DRHD base  0x95ffc000 IOMMU 19
[    0.641081] DMAR-IR: HPET id 0 under DRHD base 0x95ffc000
[    0.641082] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.647743] DMAR-IR: Enabled IRQ remapping in x2apic mode
[   14.112934] DMAR: IOMMU feature pasid inconsistent
[   14.112936] DMAR: IOMMU feature prs inconsistent
[   14.112937] DMAR: IOMMU feature pasid inconsistent
[   14.112938] DMAR: IOMMU feature prs inconsistent
[   14.112939] DMAR: IOMMU feature pasid inconsistent
[   14.112940] DMAR: IOMMU feature prs inconsistent
[   14.112940] DMAR: IOMMU feature pasid inconsistent
[   14.112941] DMAR: IOMMU feature prs inconsistent
[   14.112942] DMAR: IOMMU feature pasid inconsistent
[   14.112942] DMAR: IOMMU feature prs inconsistent
[   14.112944] DMAR: IOMMU feature pasid inconsistent
[   14.112944] DMAR: IOMMU feature prs inconsistent
[   14.112945] DMAR: IOMMU feature pasid inconsistent
[   14.112946] DMAR: IOMMU feature prs inconsistent
[   14.112947] DMAR: IOMMU feature pasid inconsistent
[   14.112947] DMAR: IOMMU feature prs inconsistent
[   14.112948] DMAR: IOMMU feature pasid inconsistent
[   14.112948] DMAR: IOMMU feature prs inconsistent
[   14.112949] DMAR: IOMMU feature pasid inconsistent
[   14.112950] DMAR: IOMMU feature prs inconsistent
[   14.112951] DMAR: IOMMU feature pasid inconsistent
[   14.112951] DMAR: IOMMU feature prs inconsistent
[   14.112952] DMAR: IOMMU feature pasid inconsistent
[   14.112952] DMAR: IOMMU feature prs inconsistent
[   14.112953] DMAR: dmar18: Using Queued invalidation
[   14.112961] DMAR: Translation was enabled for dmar18 but we are not in kdump mode
[   14.112963] DMAR: dmar17: Using Queued invalidation
[   14.112965] DMAR: Translation was enabled for dmar17 but we are not in kdump mode
[   14.112966] DMAR: dmar16: Using Queued invalidation
[   14.112968] DMAR: Translation was enabled for dmar16 but we are not in kdump mode
[   14.112969] DMAR: dmar15: Using Queued invalidation
[   14.112971] DMAR: Translation was enabled for dmar15 but we are not in kdump mode
[   14.112972] DMAR: dmar14: Using Queued invalidation
[   14.112981] DMAR: Translation was enabled for dmar14 but we are not in kdump mode
[   14.112982] DMAR: dmar13: Using Queued invalidation
[   14.112984] DMAR: Translation was enabled for dmar13 but we are not in kdump mode
[   14.112985] DMAR: dmar12: Using Queued invalidation
[   14.112987] DMAR: Translation was enabled for dmar12 but we are not in kdump mode
[   14.112988] DMAR: dmar11: Using Queued invalidation
[   14.112990] DMAR: Translation was enabled for dmar11 but we are not in kdump mode
[   14.112991] DMAR: dmar10: Using Queued invalidation
[   14.112998] DMAR: Translation was enabled for dmar10 but we are not in kdump mode
[   14.112999] DMAR: dmar9: Using Queued invalidation
[   14.113001] DMAR: Translation was enabled for dmar9 but we are not in kdump mode
[   14.113003] DMAR: dmar8: Using Queued invalidation
[   14.113005] DMAR: Translation was enabled for dmar8 but we are not in kdump mode
[   14.113006] DMAR: dmar7: Using Queued invalidation
[   14.113009] DMAR: Translation was enabled for dmar7 but we are not in kdump mode
[   14.113010] DMAR: dmar6: Using Queued invalidation
[   14.113018] DMAR: Translation was enabled for dmar6 but we are not in kdump mode
[   14.113019] DMAR: dmar5: Using Queued invalidation
[   14.113022] DMAR: Translation was enabled for dmar5 but we are not in kdump mode
[   14.113023] DMAR: dmar4: Using Queued invalidation
[   14.113025] DMAR: Translation was enabled for dmar4 but we are not in kdump mode
[   14.113026] DMAR: dmar3: Using Queued invalidation
[   14.113029] DMAR: Translation was enabled for dmar3 but we are not in kdump mode
[   14.113030] DMAR: dmar2: Using Queued invalidation
[   14.113036] DMAR: Translation was enabled for dmar2 but we are not in kdump mode
[   14.113038] DMAR: dmar1: Using Queued invalidation
[   14.113040] DMAR: Translation was enabled for dmar1 but we are not in kdump mode
[   14.113041] DMAR: dmar0: Using Queued invalidation
[   14.113043] DMAR: Translation was enabled for dmar0 but we are not in kdump mode
[   14.113046] DMAR: dmar19: Using Queued invalidation
[   14.113049] DMAR: Translation was enabled for dmar19 but we are not in kdump mode
[   14.131672] DMAR: Intel(R) Virtualization Technology for Directed I/O


any ideas ?
 
which guest driver did you try to install exactly?
 
Latest Driver is :
535.54.06 on the Host
536.25 grid for wind 10 11 19 and 22 -

got it working right now with windows 11 and this driver but : unticked PCI Express + ROMBar + Primary GPU in the PCI EDIT Menu -


now every two starts the windows 11 comes up

i keep on trying :)
 
pci express/rom-bar/primary-gpu i'd leave on the default, this mostly relevant for physical card passthrough rather than vgpus

can you post the vm config? (qm config ID)
 
so i got it working right now with Rom Bar / PCI Express / Primary GPU off

my config is :

Code:
agent: 1
bios: ovmf
boot: order=ide0;ide2;net0
cores: 16
cpu: x86-64-v2-AES
efidisk0: local:300/vm-300-disk-0.raw,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:0a:00.5,mdev=nvidia-527,rombar=0
ide0: local:300/vm-300-disk-1.raw,size=42G
ide2: local:iso/Win11_22H2_German_x64v2.iso,media=cdrom,size=5702262K
machine: pc-q35-8.0
memory: 32768
meta: creation-qemu=8.0.2,ctime=1690876337
name: windows11
net0: e1000=36:51:FC:79:07:6F,bridge=vmbr0
numa: 1
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=9c8ba66d-660d-493e-816c-3395ae2d86a5
sockets: 2
tpmstate0: local:300/vm-300-disk-2.raw,size=4M,version=v2.0
vga: none
vmgenid: aae57529-60ac-409a-be98-2946b75a40d6
 
  • Like
Reactions: dcsapak

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!