Failed to destroy vGPU device.

Hi Guys,
I believe that after applying the workaround, whenever a MV is destroyed, proxmox does not release the vGPU properly and therefore it does not allow to use it back. I do believe that rebooting the system will allow me to have the Vgpus available that is something I would like to avoid.

[6045009.681314] nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000175: Removing from iommu group 349
[6045009.681673] nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000175: MDEV: detaching iommu
[6045009.682153] nvidia 0000:d0:00.5: MDEV: Unregistering
[6045009.682591] nvidia 0000:d0:00.6: MDEV: Unregistering
[6045009.683099] nvidia 0000:d0:00.7: MDEV: Unregistering
[6045009.683469] nvidia 0000:d0:01.0: MDEV: Unregistering
[6045009.683893] nvidia 0000:d0:01.1: MDEV: Unregistering
[6045009.684236] nvidia 0000:d0:01.2: MDEV: Unregistering
[6045009.684583] nvidia 0000:d0:01.3: MDEV: Unregistering
[6045009.684926] nvidia 0000:d0:01.4: MDEV: Unregistering
[6045009.685359] nvidia 0000:d0:01.5: MDEV: Unregistering
[6045009.685799] nvidia 0000:d0:01.6: MDEV: Unregistering
[6045009.686284] nvidia 0000:d0:01.7: MDEV: Unregistering
[6045009.686720] nvidia 0000:d0:02.0: MDEV: Unregistering
[6045009.687070] nvidia 0000:d0:02.1: MDEV: Unregistering
[6045009.687404] nvidia 0000:d0:02.2: MDEV: Unregistering
[6045009.687773] nvidia 0000:d0:02.3: MDEV: Unregistering
[6045009.782248] nvidia 0000:d0:00.0: driver left SR-IOV enabled after remove
[6045009.784155] vfio-pci 0000:d0:00.0: Cannot bind to PF with SR-IOV enabled
[6045009.784433] vfio-pci: probe of 0000:d0:00.0 failed with error -16

Right now, I can only see "half" my GPU on the host:

Wed Apr 26 14:10:02 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.07 Driver Version: 525.85.07 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A16 On | 00000000:CE:00.0 Off | 0 |
| 0% 45C P8 17W / 62W | 10944MiB / 15356MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A16 On | 00000000:CF:00.0 Off | 0 |
| 0% 54C P0 32W / 62W | 7296MiB / 15356MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3463262 C+G vgpu 3648MiB |
| 0 N/A N/A 4055010 C+G vgpu 7296MiB |
| 1 N/A N/A 2820335 C+G vgpu 3648MiB |
| 1 N/A N/A 3153685 C+G vgpu 3648MiB |
+-----------------------------------------------------------------------------+

Any ideas?

Best Regards
is there anything else in the logs/dmesg ?

the lines with 'MDEV: Unregistering' can happen when the nvidia-vgpu-mgr stops or the driver has a problem, never saw it when removing the mdev ...
 
Hi Dcsapak,
I ended up rebooting the node and then I recovered full access to the GPU. I believe the issue will hapen again, so what info should I get from the syslog /dmesg? both logs were kind of full of information and I'm not sure what could be interesting / usefull and what not.
Best Regards
 
Hi Dcsapak,
I ended up rebooting the node and then I recovered full access to the GPU. I believe the issue will hapen again, so what info should I get from the syslog /dmesg? both logs were kind of full of information and I'm not sure what could be interesting / usefull and what not.
Best Regards
in doubt, post the whole log,i also don't always know what's interesting/relevant before seeing it ;)
 
Hi all,
I recently needed to reboot my windows 10 vm for updates and I have ran into issues where I can no longer boot the VM. The initial boot results in a windows blue screen "video memory management error" then when rebooting the VM hangs on the proxmox boot splash screen. I thought I might need to ammend lines of code in my QemuServer.pm script. However, from this thread I noticed that the ammendments were already included in my QemuServer.pm (as I am running Proxmox 8.1.4). So I am wondering what else could be the problem/fix. Included are my logs of the blue screen event and the reboot and hang. Please let me know if you see a simply fix to this or if you have any suggestions. Thanks for your time and expertise. Best Regards,
-Jo

included syslog as the (.txt) seems to butcher the readability:

Mar 12 16:04:26 systemd[1]: Started 103.scope.
Mar 12 16:04:27 kernel: tap103i0: entered promiscuous mode
Mar 12 16:04:27 kernel: vmbr0: port 4(fwpr103p0) entered blocking state
Mar 12 16:04:27 kernel: vmbr0: port 4(fwpr103p0) entered disabled state
Mar 12 16:04:27 kernel: fwpr103p0: entered allmulticast mode
Mar 12 16:04:27 kernel: fwpr103p0: entered promiscuous mode
Mar 12 16:04:27 kernel: vmbr0: port 4(fwpr103p0) entered blocking state
Mar 12 16:04:27 kernel: vmbr0: port 4(fwpr103p0) entered forwarding state
Mar 12 16:04:27 kernel: fwbr103i0: port 1(fwln103i0) entered blocking state
Mar 12 16:04:27 kernel: fwbr103i0: port 1(fwln103i0) entered disabled state
Mar 12 16:04:27 kernel: fwln103i0: entered allmulticast mode
Mar 12 16:04:27 kernel: fwln103i0: entered promiscuous mode
Mar 12 16:04:27 kernel: fwbr103i0: port 1(fwln103i0) entered blocking state
Mar 12 16:04:27 kernel: fwbr103i0: port 1(fwln103i0) entered forwarding state
Mar 12 16:04:27 kernel: fwbr103i0: port 2(tap103i0) entered blocking state
Mar 12 16:04:27 kernel: fwbr103i0: port 2(tap103i0) entered disabled state
Mar 12 16:04:27 kernel: tap103i0: entered allmulticast mode
Mar 12 16:04:27 kernel: fwbr103i0: port 2(tap103i0) entered blocking state
Mar 12 16:04:27 kernel: fwbr103i0: port 2(tap103i0) entered forwarding state
Mar 12 16:04:27 nvidia-vgpu-mgr[1220]: Nv0000CtrlVgpuGetStartDataParams {
mdev_uuid: {00000000-0000-0000-0000-000000000103},
config_params: "vgpu_type_id=65",
qemu_pid: 10156,
gpu_pci_id: 0x300,
vgpu_id: 1,
gpu_pci_bdf: 768,
}
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000103 GPU PCI id 00:03:00.0 config params vgpu_type_id=65
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=65
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_env_log: Successfully updated env symbols!
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: NvA081CtrlVgpuConfigGetVgpuTypeInfoParams {
vgpu_type: 65,
vgpu_type_info: NvA081CtrlVgpuInfo {
vgpu_type: 65,
vgpu_name: "GRID P4-4Q",
vgpu_class: "Quadro",
vgpu_signature: [],
license: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
max_instance: 2,
num_heads: 4,
max_resolution_x: 7680,
max_resolution_y: 4320,
max_pixels: 58982400,
frl_config: 60,
cuda_enabled: 1,
ecc_supported: 1,
gpu_instance_size: 0,
multi_vgpu_supported: 0,
vdev_id: 0x1bb31206,
pdev_id: 0x1bb3,
profile_size: 0x100000000,
fb_length: 0xec000000,
gsp_heap_size: 0x0,
fb_reservation: 0x14000000,
mappable_video_size: 0x400000,
encoder_capacity: 0x64,
bar1_length: 0x100,
frl_enable: 1,
adapter_name: "GRID P4-4Q",
adapter_name_unicode: "GRID P4-4Q",
short_gpu_name_string: "GP104GL-A",
licensed_product_name: "NVIDIA RTX Virtual Workstation",
vgpu_extra_params: "",
ftrace_enable: 0,
gpu_direct_supported: 0,
nvlink_p2p_supported: 0,
multi_vgpu_exclusive: 0,
exclusive_type: 0,
exclusive_size: 1,
gpu_instance_profile_id: 4294967295,
},
}
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Applying profile nvidia-65 overrides
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/num_heads: 4 -> 1
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/max_resolution_x: 7680 -> 1920
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/max_resolution_y: 4320 -> 1080
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/max_pixels: 58982400 -> 2073600
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/cuda_enabled: 1 -> 1
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/fb_length: 3959422976 -> 3959422976
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/fb_reservation: 335544320 -> 335544320
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: Patching nvidia-65/frl_enable: 1 -> 1
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: cmd: 0xa0810115 failed.
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): gpu-pci-id : 0x300
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): vgpu_type : Quadro
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): Framebuffer: 0xec000000
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1bb3:0x1206
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: ######## vGPU Manager Information: ########
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: Driver Version: 535.129.03
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
Mar 12 16:04:27 kernel: NVRM: Software scheduler timeslice set to 2083uS.
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): vGPU migration enabled
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): vGPU manager is running in non-SRIOV mode.
Mar 12 16:04:27 nvidia-vgpu-mgr[10268]: notice: vmiop_log: display_init inst: 0 successful
Mar 12 16:04:27 kernel: [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000103: vGPU migration enabled with upstream V2 migration protocol
Mar 12 16:04:27 pvedaemon[1571]: <root@pam> end task UPID:murphy:00002786:0003B495:65F0C353:qmstart:103:root@pam: OK
Mar 12 16:04:27 pvedaemon[1570]: <root@pam> starting task UPID:murphy:00002822:0003B80E:65F0C35B:vncproxy:103:root@pam:
Mar 12 16:04:27 pvedaemon[10274]: starting vnc proxy UPID:murphy:00002822:0003B80E:65F0C35B:vncproxy:103:root@pam:
Mar 12 16:04:39 nvidia-vgpu-mgr[10268]: error: vmiop_log: (0x0): RPC RINGs are not valid
Mar 12 16:04:46 pvedaemon[1570]: <root@pam> end task UPID:murphy:00002822:0003B80E:65F0C35B:vncproxy:103:root@pam: OK
Mar 12 16:06:03 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): Detected ECC enabled by guest.
Mar 12 16:06:03 nvidia-vgpu-mgr[10268]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Mar 12 16:06:03 nvidia-vgpu-mgr[10268]: notice: vmiop_log: Driver Version: 537.70
Mar 12 16:06:03 nvidia-vgpu-mgr[10268]: notice: vmiop_log: vGPU version: 0x120001
Mar 12 16:06:03 nvidia-vgpu-mgr[10268]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Mar 12 16:06:33 pvedaemon[10795]: starting vnc proxy UPID:murphy:00002A2B:0003E905:65F0C3D9:vncproxy:103:root@pam:
Mar 12 16:06:33 pvedaemon[1571]: <root@pam> starting task UPID:murphy:00002A2B:0003E905:65F0C3D9:vncproxy:103:root@pam:
Mar 12 16:06:37 pvedaemon[1571]: <root@pam> end task UPID:murphy:00002A2B:0003E905:65F0C3D9:vncproxy:103:root@pam: OK
 
Last edited:
do you have the correct driver in the guest installed? did you try with a different vm? from the log there is nothing obviosly wrong (to me at least), and bluescreen etc. seems like a guest issue (e.g. corrupted driver etc.) instead of a host one...
 
do you have the correct driver in the guest installed? did you try with a different vm? from the log there is nothing obviosly wrong (to me at least), and bluescreen etc. seems like a guest issue (e.g. corrupted driver etc.) instead of a host one...
Thanks for your response! I think I have the correct guest driver installed. However, I did notice a difference in version number between host and guest drivers. For example when I download the drivers from nvidia's website using a trial. I get the following set of guest and host drivers. Note in this photo win10/win11/win-server have a 537.XX version where as the host and the rest of the guest operating systems have 535.XX. Perhaps this is the issue? I could jump to 16.4 or newer grid drivers.

Screenshot 2024-03-18 at 7.48.11 PM.png

For now as a band-aid solution; I remove the pci-e device start the vm using just cpu. Then I DDU my gpu drivers and start over adding back the pci-e device and reinstalling the guest driver. With the idea that I never shut down the vm. It works but is not ideal.

Thanks again for you input,
-Jo
 
I could jump to 16.4 or newer grid drivers.
using the latest version of a track is a good idea, there are always some bugfixes in them (no idea if your issue is fixed)

what gpu do you have?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!