After VGPU is allocated to PVE+A16 virtual machine, it will run for a period of time, but it will not start when the virtual machine is restarted.

yiyang5188

New Member
May 19, 2025
7
0
1
I'm PVE+A16, with driver version 18.0, and the installation document is https://pve.proxmox.com/wiki/NVIDIA _ vgpu _ on _ proxmox _ ve # cite _ note-driver-versions-10. I allocate 2G video memory to each virtual machine, and all virtual machines can start normally at first, but after a few days. Restarting the virtual machine has been unable to open the machine, prompting [NVIDIA-VGPU-VFIO] no VGPU device found for VF0000: 8B: 01.7.[228095.276939] [nvidia-vgpu-vfio] current_vgpu_type of VF not configured
1747630698992.png
1747630725941.png
1747630746332.png
Please help me to see what is going on.
 
hi,

can post the complete task log output from a failed start and maybe from a restart too?

also the output of e.g.

Code:
lspci
nvidia-smi vgpu 
nvidia-smi vgpu -s
nvidia-smi vgpu -c

would be helpful
 
hi,

can post the complete task log output from a failed start and maybe from a restart too?

also the output of e.g.

Code:
lspci
nvidia-smi vgpu
nvidia-smi vgpu -s
nvidia-smi vgpu -c

would be helpful
1747701754234.png
1747701778991.png
1747701809465.png

1747701826375.png
1747701850025.png
Restarting the virtual machine every few hours will always be stuck, but if you restart the server, it is ok to restart the virtual machine after the restart, which is really strange.
 
first, please in the future, post text as text and not as images (it becomes much harder to see, especially since the images are so large ;) )

also can you post your resource mappings + vm configs and the full output of `lspci -nnk` (as text please)

as far as i can see it tries to use the 8a:00 card, but can not find a free slot there (which makes sense since that card is fully used with 2 8Q profiles), it then seemingly has trouble using the 8b:.00 card, but i need more info to see what's going on there

also I'd need the full task log of such a failed start.


EDIT:

i just noticed that you use the v17 of the vgpu drivers, which is not really supported on Proxmox VE, just fyi ;) (the driver 550.144.02 of the nvidia-smi output is for v17, v18 would be 570)
 
Last edited:
first, please in the future, post text as text and not as images (it becomes much harder to see, especially since the images are so large ;) )

also can you post your resource mappings + vm configs and the full output of `lspci -nnk` (as text please)

as far as i can see it tries to use the 8a:00 card, but can not find a free slot there (which makes sense since that card is fully used with 2 8Q profiles), it then seemingly has trouble using the 8b:.00 card, but i need more info to see what's going on there

also I'd need the full task log of such a failed start.


EDIT:

i just noticed that you use the v17 of the vgpu drivers, which is not really supported on Proxmox VE, just fyi ;) (the driver 550.144.02 of the nvidia-smi output is for v17, v18 would be 570)
The same problem occurred with the version V18 I used before. Yesterday, I changed to the version V17, and the problem still exists. It should be the 8b that I used directly. I also had that report when I started it normally. I should have used up the memory of 8a, so I got this prompt. The memory of 8b is still available, and I don't know why it can't be used.
 

Attachments

  • e9c16b6bb7dda1dd81c2968f268daccb.png
    e9c16b6bb7dda1dd81c2968f268daccb.png
    140.4 KB · Views: 4
  • 1b58b5c568ae9f72a6bcfaa6f70c9c06.png
    1b58b5c568ae9f72a6bcfaa6f70c9c06.png
    241.2 KB · Views: 4
  • nvidia.txt
    nvidia.txt
    76.3 KB · Views: 4
can you post the full vm config and resource config in text format?
Code:
qm config ID
cat /etc/pve/mapping/pci.cfg

also I need the full output of a failed task log please

maybe also a journal output since boot, otherwise it's hard to diagnose what actually happens
 
您能以文本格式发布完整的虚拟机配置和资源配置吗?
[代码]
qm 配置 ID
猫/etc/pve/mapping/pci.cfg
[/代码]

另外我需要失败任务日志的完整输出

也许还有启动以来的日志输出,否则很难诊断实际发生了什么
 

Attachments

can you post the full vm config and resource config in text format?
Code:
qm config ID
cat /etc/pve/mapping/pci.cfg

also I need the full output of a failed task log please

maybe also a journal output since boot, otherwise it's hard to diagnose what actually happens
I found that only four virtual machines with graphics cards can be turned on, and the fifth one has been stuck. Turn off one of the started virtual machines, and the one that can't be turned on can be turned on. Why can only four virtual machines with graphics cards be turned on? Aren't all the memory used up?
 

Attachments

  • 4444.png
    4444.png
    39.2 KB · Views: 2
您能以文本格式发布完整的虚拟机配置和资源配置吗?
[代码]
qm 配置 ID
猫/etc/pve/mapping/pci.cfg
[/代码]

另外我需要失败任务日志的完整输出

也许还有启动以来的日志输出,否则很难诊断实际发生了什么
你能帮我看看如何解决这个问题吗?
 
please write your posts in english, otherwise not many here will be able to help you.

i'm still waiting for the full task log output of a failed start.
Also the full journal output during such a failed start would be good

also I need the full output of a failed task log please

maybe also a journal output since boot, otherwise it's hard to diagnose what actually happens
 
I found that only four virtual machines with graphics cards can be turned on, and the fifth one has been stuck. Turn off one of the started virtual machines, and the one that can't be turned on can be turned on. Why can only four virtual machines with graphics cards be turned on? Aren't all the memory used up?

I am seeing the same issue. I'm guessing this is an IOMMU limitation of the hardware (AMD Ryzen Threadripper 1950X 16-Core Processor/ASUS MB). After four virtual machines with vgpu are powered on, fifth VM fails to start with the following in the log:

Code:
kvm: -device vfio-pci,host=0000:43:01.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 0000:43:01.0: Could not enable error recovery for the device
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
error writing '0' to '/sys/bus/pci/devices/0000:43:01.0/nvidia/current_vgpu_type': Operation not permitted
could not cleanup nvidia vgpu for '0000:43:01.0'

pci.cfg:
Code:
RTX-A5000
        map id=10de:2231,iommugroup=46,node=cube-001,path=0000:43:00.4,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=47,node=cube-001,path=0000:43:00.5,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=48,node=cube-001,path=0000:43:00.6,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=49,node=cube-001,path=0000:43:00.7,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=50,node=cube-001,path=0000:43:01.0,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=51,node=cube-001,path=0000:43:01.1,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=52,node=cube-001,path=0000:43:01.2,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=53,node=cube-001,path=0000:43:01.3,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=54,node=cube-001,path=0000:43:01.4,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=55,node=cube-001,path=0000:43:01.5,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=56,node=cube-001,path=0000:43:01.6,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=57,node=cube-001,path=0000:43:01.7,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=58,node=cube-001,path=0000:43:02.0,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=59,node=cube-001,path=0000:43:02.1,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=60,node=cube-001,path=0000:43:02.2,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=61,node=cube-001,path=0000:43:02.3,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=62,node=cube-001,path=0000:43:02.4,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=63,node=cube-001,path=0000:43:02.5,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=64,node=cube-001,path=0000:43:02.6,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=65,node=cube-001,path=0000:43:02.7,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=66,node=cube-001,path=0000:43:03.0,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=67,node=cube-001,path=0000:43:03.1,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=68,node=cube-001,path=0000:43:03.2,subsystem-id=103c:0000
        map id=10de:2231,iommugroup=69,node=cube-001,path=0000:43:03.3,subsystem-id=103c:0000
        mdev 1

Interestingly, the fifth vgpu allocation shows on the host:

nvidia-smi vgpu -q
Code:
GPU 00000000:43:00.0
    Active vGPUs                          : 5
    vGPU ID                               : 3251634204
        VM UUID                           : f70928dd-e62b-40d6-8749-a401141a10c1
        VM Name                           : xxxxxxxx,debug-threads=on
        vGPU Name                         : NVIDIA RTXA5000-4Q
        vGPU Type                         : 662
        vGPU UUID                         : 8e0b96e2-4bf3-11f0-9d37-29f0e9407c20
        Guest Driver Version              : 573.07
        License Status                    : Licensed (Expiry: 2025-6-19 3:23:44 GMT)
        GPU Instance ID                   : N/A
        Placement ID                      : N/A
        Accounting Mode                   : Disabled
        ECC Mode                          : Enabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : 60 FPS
        PCI
            Bus Id                        : 00000000:01:00.0
        FB Memory Usage
            Total                         : 4096 MiB
            Used                          : 525 MiB
            Free                          : 3571 MiB
        Utilization
            GPU                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
            Jpeg                          : 0 %
            Ofa                           : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
    vGPU ID                               : 3251634264
        VM UUID                           : a53aaa3a-2ae3-468a-ae06-d49b44d52966
        VM Name                           : xxxxxxxx,debug-threads=on
        vGPU Name                         : NVIDIA RTXA5000-4Q
        vGPU Type                         : 662
        vGPU UUID                         : 14301db2-4bf4-11f0-bd8b-29f0e9407c20
        Guest Driver Version              : 570.148.08
        License Status                    : Licensed (Expiry: 2025-6-18 23:27:11 GMT)
        GPU Instance ID                   : N/A
        Placement ID                      : N/A
        Accounting Mode                   : Disabled
        ECC Mode                          : Enabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : 60 FPS
        PCI
            Bus Id                        : 00000000:01:00.0
        FB Memory Usage
            Total                         : 4096 MiB
            Used                          : 384 MiB
            Free                          : 3712 MiB
        Utilization
            GPU                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
            Jpeg                          : 0 %
            Ofa                           : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
    vGPU ID                               : 3251634295
        VM UUID                           : f8251634-1443-4544-a677-7de901bb556b
        VM Name                           : xxxxxxxx,debug-threads=on
        vGPU Name                         : NVIDIA RTXA5000-4Q
        vGPU Type                         : 662
        vGPU UUID                         : 58d2fce0-4bf4-11f0-aa1a-891f29f0e940
        Guest Driver Version              : 570.148.08
        License Status                    : Licensed (Expiry: 2025-6-18 23:29:0 GMT)
        GPU Instance ID                   : N/A
        Placement ID                      : N/A
        Accounting Mode                   : Disabled
        ECC Mode                          : Enabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : 60 FPS
        PCI
            Bus Id                        : 00000000:01:00.0
        FB Memory Usage
            Total                         : 4096 MiB
            Used                          : 384 MiB
            Free                          : 3712 MiB
        Utilization
            GPU                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
            Jpeg                          : 0 %
            Ofa                           : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
    vGPU ID                               : 3251634326
        VM UUID                           : a243c185-2e29-4e40-ab45-5dbc05da0bb5
        VM Name                           : xxxxxxxx,debug-threads=on
        vGPU Name                         : NVIDIA RTXA5000-4Q
        vGPU Type                         : 662
        vGPU UUID                         : 8c26e3db-4bf4-11f0-a7e6-047ec3ef0d89
        Guest Driver Version              : N/A
        License Status                    : N/A (Expiry: N/A)
        GPU Instance ID                   : N/A
        Placement ID                      : N/A
        Accounting Mode                   : N/A
        ECC Mode                          : Disabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : N/A
        PCI
            Bus Id                        : 00000000:00:00.0
        FB Memory Usage
            Total                         : 4096 MiB
            Used                          : 0 MiB
            Free                          : 4096 MiB
        Utilization
            GPU                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
            Jpeg                          : 0 %
            Ofa                           : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
    vGPU ID                               : 3251634336
        VM UUID                           : 27903eea-da22-429b-b6f7-a2654b7cf120
        VM Name                           : xxxxxxxx,debug-threads=on
        vGPU Name                         : NVIDIA RTXA5000-4Q
        vGPU Type                         : 662
        vGPU UUID                         : e403d3cd-4bf4-11f0-a775-c3ef0d891f29
        Guest Driver Version              : N/A
        License Status                    : N/A (Expiry: N/A)
        GPU Instance ID                   : N/A
        Placement ID                      : N/A
        Accounting Mode                   : N/A
        ECC Mode                          : Disabled
        Accounting Buffer Size            : 4000
        Frame Rate Limit                  : N/A
        PCI
            Bus Id                        : 00000000:00:00.0
        FB Memory Usage
            Total                         : 4096 MiB
            Used                          : 0 MiB
            Free                          : 4096 MiB
        Utilization
            GPU                           : 0 %
            Memory                        : 0 %
            Encoder                       : 0 %
            Decoder                       : 0 %
            Jpeg                          : 0 %
            Ofa                           : 0 %
        Encoder Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0
        FBC Stats
            Active Sessions               : 0
            Average FPS                   : 0
            Average Latency               : 0

lspci -nnk attached
 

Attachments

After more testing, only the first 4 mediated devices work (43:00:*). All the rest (in my case, 43:01:* through 43:03:*) fail with the "operation not permitted" message after first failing to enable error recovery (which happens on every VM using any of the vgpu resource mappings, and is not an issue). So it is not specifically four VMs being on, but just appears that way since most people select all mediated devices as one resource mapping, and the first 4 are the only ones that work.
 
Last edited:
After more testing, only the first 4 mediated devices work (43:00:*). All the rest (in my case, 43:01:* through 43:03:*) fail with the "operation not permitted" message after first failing to enable error recovery (which happens on every VM using any of the vgpu resource mappings, and is not an issue). So it is not specifically four VMs being on, but just appears that way since most people select all mediated devices as one resource mapping, and the first 4 are the only ones that work.
can you post the output of

Code:
dmesg
nvidia-smi vgpu
cat /sys/bus/pci/devices/0000:43:01.2/nvidia/current_vgpu_type
cat /sys/bus/pci/devices/0000:43:01.2/nvidia/creatable_vgpu_types
?

is there anything sticking out in the journal/syslog on boot with the nvidia cards?

if you don't include the first 4 devices in the mapping and reboot, is it working then?

it seems very strange to me that only the first 4 virtual functions allow creating/cleaning up the devices

also in your nvidia-smi vgpu -q output, it shows 5 devices, but only 3 seem to work (not 4?)
 
can you post the output of
dmesg: attached

nvidia-smi vgpu
Code:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.06             Driver Version: 570.148.06                |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A5000           | 00000000:43:00.0             |   0%       |
|      3251634199  NVIDIA RTXA... | f709...  vmxxxxxxxxxxx0,d... |      0%    |
|      3251634258  NVIDIA RTXA... | a53a...  vmxxxxxxxxxxx1,d... |      0%    |
|      3251634289  NVIDIA RTXA... | a243...  vmxxxxxx2,debug-... |      0%    |
|      3251634314  NVIDIA RTXA... | f825...  vmxxxxxx3,debug-... |      0%    |
+---------------------------------+------------------------------+------------+

cat /sys/bus/pci/devices/0000:43:01.2/nvidia/current_vgpu_type
Code:
0

cat /sys/bus/pci/devices/0000:43:01.2/nvidia/creatable_vgpu_types
Code:
ID    : vGPU Name
662   : NVIDIA RTXA5000-4Q
670   : NVIDIA RTXA5000-4A

is there anything sticking out in the journal/syslog on boot with the nvidia cards?
The only thing that seems interesting is that only the 43:00:* devices are "enabled" during boot, and these are the only ones that work:
Code:
[   12.158230] nvidia 0000:43:00.4: enabling device (0000 -> 0002)
[   12.158501] nvidia 0000:43:00.5: enabling device (0000 -> 0002)
[   12.158701] nvidia 0000:43:00.6: enabling device (0000 -> 0002)
[   12.158873] nvidia 0000:43:00.7: enabling device (0000 -> 0002)

if you don't include the first 4 devices in the mapping and reboot, is it working then?
I tried this, the first VM to start hangs and I get the original errors about permissions resetting the current_vgpu_type to 0. I then inverted the selection to be only those first four, and successfully booted the same VM and 3 others (the attached dmesg is from this boot, so you'll see the failure at 782-914, and the four successes at 1224, 1302, 1428, and 5406).

it seems very strange to me that only the first 4 virtual functions allow creating/cleaning up the devices

also in your nvidia-smi vgpu -q output, it shows 5 devices, but only 3 seem to work (not 4?)
It is bizarre, but I think we're closing in on the solution... I just don't have enough experience with this to see it yet. Also, the reason only 3 seemed to be working in the first post was because one of the VMs didn't have the guest drivers installed yet.

Thank you!
 

Attachments

please write your posts in english, otherwise not many here will be able to help you.

i'm still waiting for the full task log output of a failed start.
Also the full journal output during such a failed start would be good

Here, when the server is just started, all virtual machines with Vgpu can be started normally, but it will not work if it is restarted or started after a period of time. If other virtual machines with VGPU are turned off and only three virtual machines with VGPU are kept, the next one can be started. This is the system log since the server started, and VMID 114 has been unable to start.
 

Attachments

I think my issue is different than @yiyang5188, after finding this in the syslog that I overlooked previously. I noticed it after using /usr/lib/nvidia/sriov-manage to disable and enable virtual functions on the GPU:

Code:
[Sat Jun 21 12:09:13 2025] NVRM: GPU 0000:43:00.0: UnbindLock acquired
[Sat Jun 21 12:09:13 2025] pci-pf-stub 0000:43:00.0: claimed by pci-pf-stub
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.4: [10de:2231] type 00 class 0x030200 PCIe Endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.4: enabling Extended Tags
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.4: Enabling HDA controller
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.4: Adding to iommu group 46
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.5: [10de:2231] type 00 class 0x030200 PCIe Endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.5: enabling Extended Tags
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.5: Enabling HDA controller
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.5: Adding to iommu group 47
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.6: [10de:2231] type 00 class 0x030200 PCIe Endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.6: enabling Extended Tags
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.6: Enabling HDA controller
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.6: Adding to iommu group 48
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.7: [10de:2231] type 00 class 0x030200 PCIe Endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.7: enabling Extended Tags
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.7: Enabling HDA controller
[Sat Jun 21 12:09:14 2025] pci 0000:43:00.7: Adding to iommu group 49
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.0: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.0: Adding to iommu group 50
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.1: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.1: Adding to iommu group 51
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.2: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.2: Adding to iommu group 52
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.3: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.3: Adding to iommu group 53
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.4: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.4: Adding to iommu group 54
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.5: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.5: Adding to iommu group 55
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.6: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.6: Adding to iommu group 56
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.7: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:01.7: Adding to iommu group 57
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.0: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.0: Adding to iommu group 58
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.1: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.1: Adding to iommu group 59
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.2: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.2: Adding to iommu group 60
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.3: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.3: Adding to iommu group 61
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.4: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.4: Adding to iommu group 62
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.5: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.5: Adding to iommu group 63
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.6: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.6: Adding to iommu group 64
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.7: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:02.7: Adding to iommu group 65
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.0: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.0: Adding to iommu group 66
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.1: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.1: Adding to iommu group 67
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.2: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.2: Adding to iommu group 68
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.3: [10de:2231] type 00 class 0x030200 conventional PCI endpoint
[Sat Jun 21 12:09:14 2025] pci 0000:43:03.3: Adding to iommu group 69
[Sat Jun 21 12:09:14 2025] pci-pf-stub 0000:43:00.0: driver left SR-IOV enabled after remove
[Sat Jun 21 12:09:14 2025] NVRM: GPU at 0000:43:00.0 has software scheduler ENABLED with policy BEST_EFFORT.
[Sat Jun 21 12:09:18 2025] nvidia 0000:43:00.4: enabling device (0000 -> 0002)
[Sat Jun 21 12:09:18 2025] nvidia 0000:43:00.5: enabling device (0000 -> 0002)
[Sat Jun 21 12:09:18 2025] nvidia 0000:43:00.6: enabling device (0000 -> 0002)
[Sat Jun 21 12:09:18 2025] nvidia 0000:43:00.7: enabling device (0000 -> 0002)

Only the first four mdev's are initialized as a PCIE endpoint, and those are the only ones that work. It could be a coincidence; I am still researching but can only guess this is because of the consumer-grade hardware the rest of the system is built on. If anyone has any input on this, let me know!

Thanks!
 
The ‘error writing 0’ means that your vGPU cannot be cleaned up after it stops, so then you can indeed not re-enable it because the VF is still in use. The question is why does THAT happen and typically this happens because the GPU is still ‘in use’ or indeed your CPU may not be able to deallocate the memory assigned to the VM - I’m presuming you’re not using different sized GPU profiles (they’re all identical nvidia-xxx). I see in one of your configs, you have ballooning enabled, make sure it is off balloon=0 for ALL VM with vGPU. That’s where I would start.

I also see 4xA16 (64G cards). Do you have 512G of RAM in your machine? Because you’ll need to have a contiguous block of memory for the PCIe mappings + working memory for the VMs itself.
 
Last edited: