[SOLVED] vGPU just stopped working randomly (solution includes 6.14, pascal fixes for 17.5, changing mock p4 to A5500 thanks to GreenDam )

zenowl77 · Mar 29, 2025

stopped one VM started another, says it cannot allocate memory on the gpu. rebooted, reinstalled drivers, etc, nothing is working, it was just working fine before i closed the last VM. (which now also will not start back up)

error message:

Code:

error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
could not create 'type' for pci devices '0000:17:00.0'
TASK ERROR: could not create mediated device

system log:

Code:

Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: cmd: 0xc03 failed.
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] Failed to get instances, 0x40
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: vGPU creation failed on device 0x1700. -5
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: Failed to create mdev device
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] Failed to allocate vGPU device
Mar 29 00:50:01 prox kernel: nvidia-vgpu-vfio: probe of 00000001-0000-0000-0000-000000000128 failed with error -12
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to get instances for GPU 0x1700, 0x40
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: cmd: 0xc02 failed.
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to create device on GPU 0x1700 0x66
Mar 29 00:50:01 prox pvedaemon[10973]: error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
Mar 29 00:50:01 prox pvedaemon[10973]: could not create 'type' for pci devices '0000:17:00.0'
Mar 29 00:50:01 prox pvedaemon[10973]: could not create mediated device
Mar 29 00:50:01 prox pvedaemon[3384]: <root@pam> end task UPID:prox:00002ADD:00009F52:67E743B8:qmstart:128:root@pam: could not create mediated device
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: cmd: 0xc03 failed.
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] Failed to get instances, 0x40
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: vGPU creation failed on device 0x1700. -5
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: Failed to create mdev device
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] Failed to allocate vGPU device
Mar 29 00:50:06 prox kernel: nvidia-vgpu-vfio: probe of 00000001-0000-0000-0000-000000000128 failed with error -12
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to get instances for GPU 0x1700, 0x40
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: cmd: 0xc02 failed.
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to create device on GPU 0x1700 0x66
Mar 29 00:50:06 prox pvedaemon[11053]: error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
Mar 29 00:50:06 prox pvedaemon[11053]: could not create 'type' for pci devices '0000:17:00.0'
Mar 29 00:50:06 prox pvedaemon[11053]: could not create mediated device
Mar 29 00:50:06 prox pvedaemon[3385]: <root@pam> end task UPID:prox:00002B2D:0000A175:67E743BE:qmstart:128:root@pam: could not create mediated device
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: cmd: 0xc03 failed.
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] Failed to get instances, 0x40
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: vGPU creation failed on device 0x1700. -5
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: Failed to create mdev device
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] Failed to allocate vGPU device
Mar 29 00:50:09 prox kernel: nvidia-vgpu-vfio: probe of 00000001-0000-0000-0000-000000000128 failed with error -12
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to get instances for GPU 0x1700, 0x40
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: cmd: 0xc02 failed.
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to create device on GPU 0x1700 0x66
Mar 29 00:50:09 prox pvedaemon[11101]: error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
Mar 29 00:50:09 prox pvedaemon[11101]: could not create 'type' for pci devices '0000:17:00.0'
Mar 29 00:50:09 prox pvedaemon[11101]: could not create mediated device
Mar 29 00:50:09 prox pvedaemon[3384]: <root@pam> end task UPID:prox:00002B5D:0000A2AC:67E743C1:qmstart:128:root@pam: could not create mediated device

zenowl77 · Mar 29, 2025

Recently tried the pve test repo hoping to see some updates/fixes for a few other things, Most recent installed packages:

Code:

2025-03-28 18:52:42 upgrade pve-edk2-firmware-ovmf:all 4.2025.02-1 4.2025.02-2
2025-03-28 18:52:43 upgrade pve-edk2-firmware-legacy:all 4.2025.02-1 4.2025.02-2
2025-03-28 18:52:43 upgrade pve-edk2-firmware:all 4.2025.02-1 4.2025.02-2
2025-03-28 18:52:43 upgrade pve-firmware:all 3.15-1 3.15-2

The pve-edk2-firmware packages are the only packages recently installed that seem like they could potentially break the vm booting with vgpu. (I am pretty sure these were updated while the last vm was running.)

zenowl77 · Apr 3, 2025

for anyone else having this problem, it isn't exactly a fix, but i found running systemctl restart nvidia-vgpu-mgr.service to restart the vGPU manager, after boot up restores vGPU functionality.

readyspace · Apr 3, 2025

Hi @zenowl77 , it looks to me that it is a kernel issue. There is a new kernel released. You might want to try updated to the latest kernel too.

zenowl77 · Apr 15, 2025

fixed with driver update and new patches, went ahead with a kernel update to go to 6.14 too

Kernel 6.14 page | patch for 17.5 & 16.9 for 6.14 ( post ) | A5500 Patch for Pascal cards for the 553 drivers work. ( post )

all patches and fixes thanks to the great work of user GreenDam,

Randell · Apr 16, 2025

Slight threadjack:

What version of what driver are you running now?

With kernel 6.14 and patched 16.9 drivers, when I start up a VM that has my P4 passed thru (not using the A5500 patch yet) I have this in the host:

Note: I do not get this with kernel 6.11 on the host (just 6.14)

Code:

[  141.872136] ------------[ cut here ]------------
[  141.872163] WARNING: CPU: 19 PID: 7908 at ./include/linux/rwsem.h:85 remap_pfn_range_internal+0x4af/0x5a0
[  141.872192] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics nvme_keyring nfnetlink_cttimeout softdog sunrpc binfmt_misc bonding tls openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 psample nfnetlink_log nfnetlink nvidia_vgpu_vfio(OE) xfs nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel ipmi_ssif crypto_simd cryptd mdev ast rapl pcspkr kvm acpi_ipmi ccp k10temp ipmi_si ptdma ipmi_devintf ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq hid_generic usbkbd usbmouse mlx4_ib ib_uverbs usbhid ses hid enclosure ib_core mlx4_en mpt3sas igb xhci_pci nvme raid_class i2c_algo_bit ahci
[  141.872278]  dca mlx4_core scsi_transport_sas libahci nvme_core xhci_hcd i2c_piix4 i2c_smbus nvme_auth
[  141.872414] CPU: 19 UID: 0 PID: 7908 Comm: CPU 1/KVM Tainted: P           OE      6.14.0-2-pve #1
[  141.872432] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  141.872445] Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 3.0 07/01/2024
[  141.872460] RIP: 0010:remap_pfn_range_internal+0x4af/0x5a0
[  141.872473] Code: 31 db c3 cc cc cc cc 48 8b 7d a8 4c 89 fa 4c 89 ce 4c 89 4d c0 e8 81 e2 ff ff 85 c0 75 9c 4c 8b 4d c0 4d 8b 01 e9 aa fd ff ff <0f> 0b e9 d7 fb ff ff 0f 0b 48 8b 7d a8 4c 89 fa 48 89 de 4c 89 45
[  141.873338] RSP: 0018:ffffae58873774b0 EFLAGS: 00010246
[  141.873726] RAX: 00000000280200fb RBX: ffff8d25d7b6a0b8 RCX: 0000000000001000
[  141.874111] RDX: 0000000000000000 RSI: 00007bb81fe00000 RDI: ffff8d25d7b6a0b8
[  141.874520] RBP: ffffae5887377568 R08: 8000000000000037 R09: 0000000000000000
[  141.874903] R10: 0000000000000000 R11: ffff8d254bc47380 R12: 000000002000fdf1
[  141.875297] R13: 00007bb81fe01000 R14: 00007bb81fe00000 R15: 8000000000000037
[  141.875683] FS:  00007bbc4affd6c0(0000) GS:ffff8d638e980000(0000) knlGS:0000000000000000
[  141.876074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  141.876469] CR2: 0000000000b50b10 CR3: 00000001ce5b8000 CR4: 0000000000350ef0
[  141.876859] Call Trace:
[  141.877245]  <TASK>
[  141.877620]  ? show_regs+0x6c/0x80
[  141.877994]  ? __warn+0x8d/0x150
[  141.878367]  ? remap_pfn_range_internal+0x4af/0x5a0
[  141.878731]  ? report_bug+0x182/0x1b0
[  141.879091]  ? handle_bug+0x6e/0xb0
[  141.879453]  ? exc_invalid_op+0x18/0x80
[  141.879809]  ? asm_exc_invalid_op+0x1b/0x20
[  141.880172]  ? remap_pfn_range_internal+0x4af/0x5a0
[  141.880530]  ? pat_pagerange_is_ram+0x7a/0xa0
[  141.880886]  ? memtype_lookup+0x3b/0x70
[  141.881244]  ? lookup_memtype+0xd1/0xf0
[  141.881595]  remap_pfn_range+0x5c/0xb0
[  141.881946]  ? up+0x58/0xa0
[  141.882302]  vgpu_mmio_fault_wrapper+0x1fa/0x340 [nvidia_vgpu_vfio]
[  141.882661]  __do_fault+0x3a/0x180
[  141.883016]  do_fault+0xca/0x4f0
[  141.883373]  __handle_mm_fault+0x840/0x10b0
[  141.883717]  handle_mm_fault+0x1a5/0x360
[  141.884056]  __get_user_pages+0x1f2/0x15d0
[  141.884402]  get_user_pages_unlocked+0xe7/0x370
[  141.884732]  hva_to_pfn+0x380/0x4c0 [kvm]
[  141.885127]  ? __perf_event_task_sched_out+0x5a/0x4a0
[  141.885447]  kvm_follow_pfn+0x97/0x100 [kvm]
[  141.885825]  __kvm_faultin_pfn+0x5c/0x90 [kvm]
[  141.886194]  kvm_mmu_faultin_pfn+0x19d/0x6e0 [kvm]
[  141.886576]  kvm_tdp_page_fault+0x8e/0xe0 [kvm]
[  141.886938]  kvm_mmu_do_page_fault+0x243/0x290 [kvm]
[  141.887301]  kvm_mmu_page_fault+0x8e/0x6d0 [kvm]
[  141.887646]  ? nv_vgpu_vfio_access+0x2d4/0x450 [nvidia_vgpu_vfio]
[  141.887915]  npf_interception+0xba/0x190 [kvm_amd]
[  141.888181]  svm_invoke_exit_handler+0x182/0x1b0 [kvm_amd]
[  141.888448]  svm_handle_exit+0xa2/0x200 [kvm_amd]
[  141.888705]  vcpu_enter_guest+0x4e8/0x1640 [kvm]
[  141.889033]  ? kvm_arch_vcpu_load+0xac/0x290 [kvm]
[  141.889359]  ? restore_fpregs_from_fpstate+0x3d/0xd0
[  141.889599]  kvm_arch_vcpu_ioctl_run+0x35d/0x750 [kvm]
[  141.889903]  kvm_vcpu_ioctl+0x2c2/0xaa0 [kvm]
[  141.890195]  ? kvm_vcpu_ioctl+0x23e/0xaa0 [kvm]
[  141.890488]  ? nv_vfio_mdev_read+0x23/0x70 [nvidia_vgpu_vfio]
[  141.890713]  __x64_sys_ioctl+0xa4/0xe0
[  141.890933]  x64_sys_call+0xb45/0x2540
[  141.891148]  do_syscall_64+0x7e/0x170
[  141.891364]  ? syscall_exit_to_user_mode+0x38/0x1d0
[  141.891575]  ? do_syscall_64+0x8a/0x170
[  141.891781]  ? arch_exit_to_user_mode_prepare.constprop.0+0xc8/0xd0
[  141.891989]  ? syscall_exit_to_user_mode+0x38/0x1d0
[  141.892193]  ? do_syscall_64+0x8a/0x170
[  141.892405]  ? syscall_exit_to_user_mode+0x38/0x1d0
[  141.892613]  ? do_syscall_64+0x8a/0x170
[  141.892820]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[  141.893029]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  141.893245] RIP: 0033:0x7bbc53e81d1b
[  141.893496] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  141.893983] RSP: 002b:00007bbc4aff7ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  141.894270] RAX: ffffffffffffffda RBX: 00005da48d103680 RCX: 00007bbc53e81d1b
[  141.894518] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000030
[  141.894764] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
[  141.895011] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  141.895264] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  141.895508]  </TASK>
[  141.895744] ---[ end trace 0000000000000000 ]---

In addition, I get this in the VM (with 6.11 and 6.14 running on the host)

Code:

[  141.872136] 
[    3.407489] ------------[ cut here ]------------
[    3.407491] WARNING: CPU: 1 PID: 560 at drivers/pci/msi/msi.c:888 __pci_enable_msi_range+0x1b3/0x1d0
[    3.407500] Modules linked in: overlay lz4 lz4_compress zram zsmalloc binfmt_misc nls_ascii nls_cp437 vfat fat nvidia_drm(POE) nvidia_modeset(POE) intel_rapl_msr intel_rapl_common nvidia(POE) kvm_amd ccp kvm nouveau irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core mxm_wmi video snd_hwdep wmi snd_pcm iTCO_wdt drm_display_helper aesni_intel cec snd_timer rc_core intel_pmc_bxt crypto_simd iTCO_vendor_support snd cryptd pcspkr hid_generic watchdog i2c_algo_bit virtio_console soundcore button joydev evdev sg serio_raw fuse loop efi_pstore dm_mod configfs efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic usbhid hid sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic virtio_net virtio_scsi net_failover failover virtio_pci ahci virtio_pci_legacy_dev libahci ehci_pci virtio_pci_modern_dev virtio crct10dif_pclmul libata crct10dif_common
[    3.407572]  virtio_ring bochs crc32_pclmul drm_vram_helper uhci_hcd crc32c_intel drm_kms_helper scsi_mod psmouse drm_ttm_helper ttm i2c_i801 scsi_common i2c_smbus lpc_ich ehci_hcd drm usbcore usb_common
[    3.407585] CPU: 1 PID: 560 Comm: nvidia-gridd Tainted: P        W  OE      6.1.0-33-amd64 #1  Debian 6.1.133-1
[    3.407589] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2025.02-3 04/03/2025
[    3.407590] RIP: 0010:__pci_enable_msi_range+0x1b3/0x1d0
[    3.407593] Code: 4c 89 ef e8 df fb ff ff 89 c6 85 c0 0f 84 68 ff ff ff 78 0e 39 c5 7f 8c 4d 85 e4 75 cc 41 89 f6 eb d8 41 89 c6 e9 50 ff ff ff <0f> 0b 41 be ea ff ff ff e9 43 ff ff ff 41 be de ff ff ff e9 38 ff
[    3.407595] RSP: 0018:ffffa973c0813998 EFLAGS: 00010202
[    3.407597] RAX: 0000000000000010 RBX: 0000000000000001 RCX: 0000000000000000
[    3.407598] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b1c40fb7000
[    3.407599] RBP: 0000000000000001 R08: 0000000000000001 R09: ffff9b1c4b759708
[    3.407600] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[    3.407601] R13: ffff9b1c40fb7000 R14: ffff9b1c4586d3e0 R15: ffff9b1c4586d000
[    3.407604] FS:  00007fb7dd066040(0000) GS:ffff9b1fafc80000(0000) knlGS:0000000000000000
[    3.407605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.407607] CR2: 0000000000b50b10 CR3: 0000000108f32000 CR4: 0000000000350ee0
[    3.407610] Call Trace:
[    3.407612]  <TASK>
[    3.407615]  ? __warn+0x7d/0xc0
[    3.407618]  ? __pci_enable_msi_range+0x1b3/0x1d0
[    3.407620]  ? report_bug+0xe2/0x150
[    3.407623]  ? handle_bug+0x41/0x70
[    3.407627]  ? exc_invalid_op+0x13/0x60
[    3.407629]  ? asm_exc_invalid_op+0x16/0x20
[    3.407634]  ? __pci_enable_msi_range+0x1b3/0x1d0
[    3.407636]  pci_enable_msi+0x16/0x30
[    3.407638]  nv_init_msi+0x1a/0xe0 [nvidia]
[    3.408007]  nv_open_device+0x843/0x940 [nvidia]
[    3.408364]  nvidia_open+0x361/0x610 [nvidia]
[    3.408721]  ? kobj_lookup+0xf1/0x160
[    3.408725]  nvidia_frontend_open+0x50/0xa0 [nvidia]
[    3.409109]  chrdev_open+0xc1/0x250
[    3.409113]  ? __unregister_chrdev+0x50/0x50
[    3.409116]  do_dentry_open+0x1e2/0x410
[    3.409119]  path_openat+0xb7d/0x1260
[    3.409122]  do_filp_open+0xaf/0x160
[    3.409126]  do_sys_openat2+0xaf/0x170
[    3.409128]  __x64_sys_openat+0x6a/0xa0
[    3.409131]  do_syscall_64+0x55/0xb0
[    3.409134]  ? call_rcu+0xde/0x630
[    3.409137]  ? mntput_no_expire+0x4a/0x250
[    3.409141]  ? kmem_cache_free+0x15/0x310
[    3.409144]  ? do_unlinkat+0xb8/0x320
[    3.409146]  ? exit_to_user_mode_prepare+0x40/0x1e0
[    3.409149]  ? syscall_exit_to_user_mode+0x1e/0x40
[    3.409150]  ? do_syscall_64+0x61/0xb0
[    3.409153]  ? exit_to_user_mode_prepare+0x40/0x1e0
[    3.409155]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[    3.409157] RIP: 0033:0x7fb7dcc36fc1
[    3.409159] Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 2a 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[    3.409161] RSP: 002b:00007fffd8dede30 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[    3.409163] RAX: ffffffffffffffda RBX: 0000000000080002 RCX: 00007fb7dcc36fc1
[    3.409164] RDX: 0000000000080002 RSI: 00007fffd8dedec0 RDI: 00000000ffffff9c
[    3.409165] RBP: 00007fffd8dedec0 R08: 0000000000000000 R09: 0000000000000064
[    3.409166] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fffd8dedfec
[    3.409167] R13: 0000000000c22560 R14: 0000000000c22560 R15: 0000000000c22560
[    3.409169]  </TASK>
[    3.409170] ---[ end trace 0000000000000000 ]---
[    3.409171] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.

Everything seems to be working, regardless of these warnings. I just don't remember seeing it before I started changing the kernels and updating the drivers.

Is there any real benefit to using the 17.x drivers and mocking an A5500? To apply the A5500 patch, I guess I need to reverse and reinstall the vgpu_unlock-rs but using GreenDamTan repo and instructions?

zenowl77 · Apr 16, 2025

No worries, I am running 17.5 / 550.144.02

Not sure why that would show up for you, but if everything works it should be okay im guessing? It just looks like one newer technology isnt working so its falling back to an alternative, so that should be fine, the p4 might not be compatible with some newer functionality, I could be wrong though.

I checked my log though and i am not seeing the same error message, so that leads me to believe this is not related to the p4 or driver unless it was something on 16.9

Only benefit is newer drivers with more fixes, optimizations, etc, theres a feature to set system fallback on or off that isn’t present in 535, they seem to work slightly better. I am certainly happier being able to use the newer drivers, one thing too is once your drivers are so old new software stops supporting it, functionality stops working with tools, etc, so the newer you can get working with your old card the better, it will be a lot longer before 550/553 sees being dropped for being too old and a lot of softwares are probably more likely to support a A5500 than a p4/p40

Mostly all you need to do is replace the vgpu_unlock-rs (i just downloaded the files and threw them in the folder on proxmox manually), then build it.

After that just go through with installing the patched 17.5/550.144.02 driver.

Should be fairly effortless. Its made to fit right into everything in the original vGPU guide, GreenDam did a great job.

Randell · Apr 16, 2025

Did you also need to copy the vgpuConfig.xml over from the v16 drivers? I'm going thru the process now to try out v17 with everything.

zenowl77 · Apr 16, 2025

Randell said:
Did you also need to copy the vgpuConfig.xml over from the v16 drivers? I'm going thru the process now to try out v17 with everything.

I did do that yes, forgot that step. Thank you for bringing it up for anyone who needs this in the future too.

(You need that or the p4 wont show up as MDEV capable)

Randell · Apr 16, 2025

I've spent some time trying to figure this out, but I might be missing something.

I updated my vgpu_unlock-rs to be GreenDam's, did the cargo build step and rebooted.
Installed the 17.5 host drivers and copied the vgpuConfig.xml over.

I am able to see my P4 with nvidia-smi and I can see the mdev types.

I set the vendor/subvendor to 0x10de / 0x10de
I set the device-id/sub-device-id to 0x2233 / 0x165a

When I boot up the VM (Debian 12) I see

Code:

01:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [RTX A5500] [10de:2233] (rev a1)

Ok, so far, so good. I then try to install the client 17.5 drivers (NVIDIA-Linux-x86_64-550.144.03-grid.run), but end up getting

Code:

[  218.236040] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2233)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 550.144.03 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[  218.237225] nvidia: probe of 0000:01:00.0 failed with error -1

I'm not sure what I'm overlooking.

Edit: Looks like I need to spoof a V1000 when presenting to the VM and not an A5500. I changed it to:
01:00.0 VGA compatible controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
0x10de/0x10de
0x1db6/0x12bf

And I was able to install the v17 guest driver

zenowl77 · Apr 16, 2025

I believe that you don’t need to change hw id in the profile_override file just set to config file to = true as if you have an unsupported card that needs unlocked, then it will appear as a A5500 instead of a p40 to the VM and when you spoof via override to a specific VM you lose certain features of the card.

Randell · Apr 16, 2025

Indeed, that is what I was missing.

I then restarted the nvidia services but couldn't get my VM to start. I just restarted the host and when I went to start the VM it complained about my mdev selection. I didn't realize it now presents the A5500 profiles to MDEV (if I read more on the unlocking stuff I probably would have known this) but changing it to a 4Q profile allowed the VM to boot up and I could install the v17.5 client drivers in the VM, all without having to override the device id.

For my use case (just needing CUDA in the VM), I'm not sure which way is "better", but I'm glad to get it back running.

Thanks for your help.

zenowl77 · Apr 16, 2025

Randell said:
Indeed, that is what I was missing.

I then restarted the nvidia services but couldn't get my VM to start. I just restarted the host and when I went to start the VM it complained about my mdev selection. I didn't realize it now presents the A5500 profiles to MDEV (if I read more on the unlocking stuff I probably would have known this) but changing it to a 4Q profile allowed the VM to boot up and I could install the v17.5 client drivers in the VM, all without having to override the device id.

For my use case (just needing CUDA in the VM), I'm not sure which way is "better", but I'm glad to get it back running.

Thanks for your help.

I think the hw id spoofing via override breaks cuda. It works differently from the unlock itself doing the spoofing.

You’re welcome, happy to help, besides this just adds to the usefulness of this thread for anyone in the future.

Randell · Apr 16, 2025

It seemed like it was working with hw spoofing. I was using the codeprojectai with cuda12 for image processing and the times matched what they were back when I was using the older v16 drivers. Either way, still good to figure out how to bypass everything.

zenowl77 · Apr 16, 2025

Randell said:
It seemed like it was working with hw spoofing. I was using the codeprojectai with cuda12 for image processing and the times matched what they were back when I was using the older v16 drivers. Either way, still good to figure out how to bypass everything.

huh, interesting, when i tried it with older versions cuda always broke. glad to hear its all working well for you.

that is mostly what i use my P4 for too, haha. always looking for new AI tools, but mostly just use ollama, easydiffusion/comfyui and random things like STT/TTS, etc. other than that just gaming and having 3D acceleration so VMs are nicer to use if i am using windows via RDP/Moonlight.

i have been thinking about how it could be possible to maybe combine datacenter drivers with kvm drivers, similar to how one would merge kvm and grid drivers and get both vGPU and LXC working, it would be nice if we could get more of these tools working in a way it could share with the host and only use resources when needed, many of these things would work quite well in an LXC and do not need to be held in memory all the time when not in use.

in my experience the kvm+grid merge doesnt allow LXC access to function, so i am hoping datacenter+kvm would maybe do it.

Randell · Apr 16, 2025

I never tried merging or dealing with LXC, felt too fragile. I like the ability to live migrate things. Now that we can do that with these mdevs (I tried it and it is working), I might try passing it to my VM that runs emby and see if I can get transcoding working.

zenowl77 · Apr 16, 2025

Randell said:
I never tried merging or dealing with LXC, felt too fragile. I like the ability to live migrate things. Now that we can do that with these mdevs (I tried it and it is working), I might try passing it to my VM that runs emby and see if I can get transcoding working.

so far all my trials with merging and lxc have mostly failed, i managed to get it kinda working once but then it caused problems and after a driver update i couldn't get anything on the newer drivers working. LXCs are a real pain with gpu access, esp since the KVM driver lacks cuda, etc, so by default it will never work.

transcoding definitely works great in VMs i use it in windows VMs for video encoding all the time, i used it for streaming and screen recording in one VM switching between only a 512mb and 1GB profile based on what was needed too. it should work great with Emby.

i mainly want it on LXC for jellyfin, just haven't got around to making it a VM and going through all the hassle of setting up SSHFS or something, with LXCs file access is so much easier and better performance. but i think it would be a lot better for AI stuff too since some run better on different OS's so its a pain to switch VMs around based on what i want to use at the moment, since i only have the one P4 to assign and 8GB isn't a lot for AI.

Randell · Apr 16, 2025

I originally had a bunch of lxcs a while back but went with a single VM and run docker inside the VM. Maintaining the apps with pulling new images via docker was a lot easier than whatever installation file/script was needed for each individual app.

What's nice is in my docker compose, I have a volumes backed by NFS to pass into the containers. So beyond installing docker in the VM, I don't really need to do much of anything else.

It took a while to get past the entire docker in a vm, gotta go deeper type thing, but I did and I find it a lot easier to manage.

Randell · Apr 16, 2025

I wonder if it showing up as an A5500 might cause some issues with transcoding. The P4 has 2 6th gen nvenc chips, but the A5500 has 1 7th gen nvenc chip. Will that mean the drivers won't use the 2nd chip or anything like that?

zenowl77 · Apr 16, 2025

Randell said:
I originally had a bunch of lxcs a while back but went with a single VM and run docker inside the VM. Maintaining the apps with pulling new images via docker was a lot easier than whatever installation file/script was needed for each individual app.

What's nice is in my docker compose, I have a volumes backed by NFS to pass into the containers. So beyond installing docker in the VM, I don't really need to do much of anything else.

It took a while to get past the entire docker in a vm, gotta go deeper type thing, but I did and I find it a lot easier to manage.

that is what i have been doing lately, running a docker inside a VM (ubuntu) and just using that vm for ollama, kokoro and other things, i haven't tried NFS yet, keep meaning to, just got started with SSHFS and it has worked well for all my VMs and devices, even windos devices with SSHFS win and the sshfs win manager software. its nice being able to mount the proxmox drives to my laptop, etc, so far with the performance of SSH i just haven't bothered to try anything else. but i keep meaning to because of the CPU usage and slight lag sometimes, makes it not great for multiple small files, etc. but i have managed to host games on my server and play them via my laptop and VMs so it works well enough.

Randell said:
I wonder if it showing up as an A5500 might cause some issues with transcoding. The P4 has 2 6th gen nvenc chips, but the A5500 has 1 7th gen nvenc chip. Will that mean the drivers won't use the 2nd chip or anything like that?

that, is a great question. i am currently using it in a windows VM and it shows video decode and video decode 1 in the task manager, so i think it is still using both? i don't think spoofing the ID changes the HW features reported to the OS? i could be wrong, but pretty sure it just looks like a A5500 with P4 features.

but i also cannot find a software that actually detects the number of NVENC chips to tell for sure. GPU-z, HWinfo, etc don't say.

[SOLVED] vGPU just stopped working randomly (solution includes 6.14, pascal fixes for 17.5, changing mock p4 to A5500 thanks to GreenDam )

Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Member

We value your privacy