vGPU just stopped working randomly for no apparent reason.

zenowl77

Member
Feb 22, 2024
105
14
18
stopped one VM started another, says it cannot allocate memory on the gpu. rebooted, reinstalled drivers, etc, nothing is working, it was just working fine before i closed the last VM. (which now also will not start back up)

error message:
Code:
error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
could not create 'type' for pci devices '0000:17:00.0'
TASK ERROR: could not create mediated device

system log:
Code:
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: cmd: 0xc03 failed.
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] Failed to get instances, 0x40
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: vGPU creation failed on device 0x1700. -5
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: Failed to create mdev device
Mar 29 00:50:01 prox kernel: [nvidia-vgpu-vfio] Failed to allocate vGPU device
Mar 29 00:50:01 prox kernel: nvidia-vgpu-vfio: probe of 00000001-0000-0000-0000-000000000128 failed with error -12
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to get instances for GPU 0x1700, 0x40
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: cmd: 0xc02 failed.
Mar 29 00:50:01 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to create device on GPU 0x1700 0x66
Mar 29 00:50:01 prox pvedaemon[10973]: error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
Mar 29 00:50:01 prox pvedaemon[10973]: could not create 'type' for pci devices '0000:17:00.0'
Mar 29 00:50:01 prox pvedaemon[10973]: could not create mediated device
Mar 29 00:50:01 prox pvedaemon[3384]: <root@pam> end task UPID:prox:00002ADD:00009F52:67E743B8:qmstart:128:root@pam: could not create mediated device
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: cmd: 0xc03 failed.
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] Failed to get instances, 0x40
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: vGPU creation failed on device 0x1700. -5
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: Failed to create mdev device
Mar 29 00:50:06 prox kernel: [nvidia-vgpu-vfio] Failed to allocate vGPU device
Mar 29 00:50:06 prox kernel: nvidia-vgpu-vfio: probe of 00000001-0000-0000-0000-000000000128 failed with error -12
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to get instances for GPU 0x1700, 0x40
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: cmd: 0xc02 failed.
Mar 29 00:50:06 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to create device on GPU 0x1700 0x66
Mar 29 00:50:06 prox pvedaemon[11053]: error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
Mar 29 00:50:06 prox pvedaemon[11053]: could not create 'type' for pci devices '0000:17:00.0'
Mar 29 00:50:06 prox pvedaemon[11053]: could not create mediated device
Mar 29 00:50:06 prox pvedaemon[3385]: <root@pam> end task UPID:prox:00002B2D:0000A175:67E743BE:qmstart:128:root@pam: could not create mediated device
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: cmd: 0xc03 failed.
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] Failed to get instances, 0x40
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: vGPU creation failed on device 0x1700. -5
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] 00000001-0000-0000-0000-000000000128: Failed to create mdev device
Mar 29 00:50:09 prox kernel: [nvidia-vgpu-vfio] Failed to allocate vGPU device
Mar 29 00:50:09 prox kernel: nvidia-vgpu-vfio: probe of 00000001-0000-0000-0000-000000000128 failed with error -12
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to get instances for GPU 0x1700, 0x40
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: cmd: 0xc02 failed.
Mar 29 00:50:09 prox nvidia-vgpu-mgr[3016]: error: vmiop_env_log: Failed to create device on GPU 0x1700 0x66
Mar 29 00:50:09 prox pvedaemon[11101]: error writing '00000001-0000-0000-0000-000000000128' to '/sys/bus/pci/devices/0000:17:00.0/mdev_supported_types/nvidia-70/create': Cannot allocate memory
Mar 29 00:50:09 prox pvedaemon[11101]: could not create 'type' for pci devices '0000:17:00.0'
Mar 29 00:50:09 prox pvedaemon[11101]: could not create mediated device
Mar 29 00:50:09 prox pvedaemon[3384]: <root@pam> end task UPID:prox:00002B5D:0000A2AC:67E743C1:qmstart:128:root@pam: could not create mediated device
 
Recently tried the pve test repo hoping to see some updates/fixes for a few other things, Most recent installed packages:

Code:
2025-03-28 18:52:42 upgrade pve-edk2-firmware-ovmf:all 4.2025.02-1 4.2025.02-2
2025-03-28 18:52:43 upgrade pve-edk2-firmware-legacy:all 4.2025.02-1 4.2025.02-2
2025-03-28 18:52:43 upgrade pve-edk2-firmware:all 4.2025.02-1 4.2025.02-2
2025-03-28 18:52:43 upgrade pve-firmware:all 3.15-1 3.15-2

The pve-edk2-firmware packages are the only packages recently installed that seem like they could potentially break the vm booting with vgpu. (I am pretty sure these were updated while the last vm was running.)
 
for anyone else having this problem, it isn't exactly a fix, but i found running systemctl restart nvidia-vgpu-mgr.service to restart the vGPU manager, after boot up restores vGPU functionality.