start vm times out, but qm start fails to run

jwsl224

Member
Apr 6, 2024
52
2
8
trying to start a vm with a pcie passthrough device, specifically an nvidia vGPU device, and starting the vm in the gui times out. when i try to
Code:
qm start vmid --timeout 300
i get
Code:
kvm: -device vfio-pci,sysfsdev=sys/bus/pci/devices/0000:16:01.3: vfio sys/bus/pci/devices/0000:16:01.3: no such host device: No such file or directory
start failed: QEMU exited with code 1
this seems like a pve issue. troubleshooting steps i tried:
shut down already running vGPU vm and start it again with qm start, get the same error. start previously running vGPU vm with gui, start works.
there are already 8 2b vm's running on this rtx a5000, the nvidia docs say you can run 12. the host itself has about 100GB ram left to start this vm, but somehow still has a problem allocating the 16GB memory for this vm. i'm honestly by now not sure if this is an nvidia driver issue or pve issue.
 
Last edited:
Hi,

can you post your vm config and output of `lspci` (and mapping config, if any) ?

kvm: -device vfio-pci,sysfsdev=sys/bus/pci/devices/0000:16:01.3: vfio sys/bus/pci/devices/0000:16:01.3: no such host device: No such file or directory
this would mean that the pci device under that path does not exist
 
can you post your vm config and output of `lspci` (and mapping config, if any) ?
sure no problem:
Code:
root@pveg1:~# lspci
...
16:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
...

this would mean that the pci device under that path does not exist
that's the funny thing. if the pci device indeed does not exist, then starting the vm via gui should likewise fail. but it doesn't.

as is the procedure with vgpu, here is the vm config mapping the virtual function to the gpu:
Code:
agent: 1
args: -device vfio-pci,sysfsdev=sys/bus/pci/devices/0000\:16\:01.4 -uuid 889f5a88-fe20-41bb-8ee0-1126e464e9eb
 
that's the funny thing. if the pci device indeed does not exist, then starting the vm via gui should likewise fail. but it doesn't.
interesting, can you post the whole vm config and the task start log ?

also is there a reason why you pass the card through with the 'args' parameter instead of using the builtin method?

you should be able to create a pci resource mapping for all virtual functions, and then use that in the vm with the vgpu model you want, it will then auto allocate (and deallocate) on vm start. see https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE


EDIT:

also, i might know why it does not work: you have to use the full path,

you used
Code:
vfio-pci,sysfsdev=sys/bus/pci/devices/0000\:16\:01.4

but probably correct would be
Code:
vfio-pci,sysfsdev=/sys/bus/pci/devices/0000\:16\:01.4

(see the slash at the start)

nonetheless it would probably make more sense to use the built in features instead of using args
 
Last edited:
also, i might know why it does not work: you have to use the full path,
you're telling me i blasted a day away because of a typo? ‍:rolleyes::rolleyes: oh well. what else is new :p
so yes, that cured the error. why it only showed up the other day when there were already a whole bunch of vm's running that way i am not sure. i guess perhaps because i started them with the gui, which for some reason works.

also is there a reason why you pass the card through with the 'args' parameter instead of using the builtin method?
simply because i am a noob at this and was following the nvidia docs to the letter. this server has been running vgpu for months and months by now; since before official support from you guys. maybe the built-in method wasn't published back then? anyway, this is what i was following from the nvidia docs:
1747750749416.png

interesting, can you post the whole vm config and the task start log ?
this is the only thing left. should i be concerned with the warning?
Code:
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:16:01.4: warning: vfio 0000:16:01.4: Could not enable error recovery for the device
TASK OK

you should be able to create a pci resource mapping for all virtual functions, and then use that in the vm with the vgpu model you want, it will then auto allocate (and deallocate) on vm start. see https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE
i cannot find instruction on how to specify the vgpu type here. will that still have to be done be editing the "creatable_vgpu_types" file at the virtual functions file path, as per the nvidia docs?

nonetheless it would probably make more sense to use the built in features instead of using args
i will look into this now that it's officially rolled out; thanks for dropping by
 
so yes, that cured the error. why it only showed up the other day when there were already a whole bunch of vm's running that way i am not sure. i guess perhaps because i started them with the gui, which for some reason works.
mhmm i doubt it would simply work over the gui, since that does the same thing as starting the guest via the cli, but who knows what exactly happened there...

simply because i am a noob at this and was following the nvidia docs to the letter. this server has been running vgpu for months and months by now; since before official support from you guys. maybe the built-in method wasn't published back then? anyway, this is what i was following from the nvidia docs:
well, it's absolutely no bad thing to follow the documentation, in this case there was simply more documentation you didn't know ;)
the feature itself was already there for quite some time, but it was not officially supported by nvidia until this year.

this is the only thing left. should i be concerned with the warning?
no that warning is harmless and quite common, it has to do with PCI AER (here is a bit more detail: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE#Troubleshooting )

i cannot find instruction on how to specify the vgpu type here. will that still have to be done be editing the "creatable_vgpu_types" file at the virtual functions file path, as per the nvidia docs?
the gpu type itself is specified when adding the mapping to the vm (see this screenshot: https://pve.proxmox.com/wiki/File:PVE_select_a_vgpu_with_mapping.png ), the only quirk currently is that when editing that, you can only see the currently creatable models (we're working on changing that, but it's not done yet)


i will look into this now that it's officially rolled out; thanks for dropping by
no problem, just ask if there are further questions/issues
 
but who knows what exactly happened there...
lol. sometimes we are running on nothing but luck

the gpu type itself is specified when adding the mapping to the vm (see this screenshot: https://pve.proxmox.com/wiki/File:PVE_select_a_vgpu_with_mapping.png ), the only quirk currently is that when editing that, you can only see the currently creatable models (we're working on changing that, but it's not done yet)
thanks! i'll be all gui'ed up here shortly
 
Last edited:
  • Like
Reactions: dcsapak
lol. sometimes we are running on nothing but luck
actually after thinking a few seconds more about it, i have a theory:

It may be the case that the api daemons are started with '/' as their current working directory, so every command they execute have also that as their cwd. In that case, the 'sys/bus....' relative path can be resolved
if you want to try, it should also work on the cli if you do a 'cd /' before calling 'qm'