start vm times out, but qm start fails to run

jwsl224 · May 16, 2025

trying to start a vm with a pcie passthrough device, specifically an nvidia vGPU device, and starting the vm in the gui times out. when i try to

Code:

qm start vmid --timeout 300

i get

Code:

kvm: -device vfio-pci,sysfsdev=sys/bus/pci/devices/0000:16:01.3: vfio sys/bus/pci/devices/0000:16:01.3: no such host device: No such file or directory
start failed: QEMU exited with code 1

this seems like a pve issue. troubleshooting steps i tried:
shut down already running vGPU vm and start it again with qm start, get the same error. start previously running vGPU vm with gui, start works.
there are already 8 2b vm's running on this rtx a5000, the nvidia docs say you can run 12. the host itself has about 100GB ram left to start this vm, but somehow still has a problem allocating the 16GB memory for this vm. i'm honestly by now not sure if this is an nvidia driver issue or pve issue.

dcsapak · Monday at 10:57

Hi,

can you post your vm config and output of `lspci` (and mapping config, if any) ?

jwsl224 said:
kvm: -device vfio-pci,sysfsdev=sys/bus/pci/devices/0000:16:01.3: vfio sys/bus/pci/devices/0000:16:01.3: no such host device: No such file or directory

this would mean that the pci device under that path does not exist

jwsl224 · Monday at 18:56

dcsapak said:
can you post your vm config and output of `lspci` (and mapping config, if any) ?

sure no problem:

Code:

root@pveg1:~# lspci
...
16:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:00.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:01.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.4 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.5 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.6 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:02.7 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.0 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.1 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.2 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
16:03.3 3D controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1)
...

dcsapak said:
this would mean that the pci device under that path does not exist

that's the funny thing. if the pci device indeed does not exist, then starting the vm via gui should likewise fail. but it doesn't.

as is the procedure with vgpu, here is the vm config mapping the virtual function to the gpu:

Code:

agent: 1
args: -device vfio-pci,sysfsdev=sys/bus/pci/devices/0000\:16\:01.4 -uuid 889f5a88-fe20-41bb-8ee0-1126e464e9eb

dcsapak · Tuesday at 08:30

jwsl224 said:
that's the funny thing. if the pci device indeed does not exist, then starting the vm via gui should likewise fail. but it doesn't.

interesting, can you post the whole vm config and the task start log ?

also is there a reason why you pass the card through with the 'args' parameter instead of using the builtin method?

you should be able to create a pci resource mapping for all virtual functions, and then use that in the vm with the vgpu model you want, it will then auto allocate (and deallocate) on vm start. see https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE

EDIT:

also, i might know why it does not work: you have to use the full path,

you used

Code:

vfio-pci,sysfsdev=sys/bus/pci/devices/0000\:16\:01.4

but probably correct would be

Code:

vfio-pci,sysfsdev=/sys/bus/pci/devices/0000\:16\:01.4

(see the slash at the start)

nonetheless it would probably make more sense to use the built in features instead of using args

jwsl224 · Tuesday at 16:24

dcsapak said:
also, i might know why it does not work: you have to use the full path,

you're telling me i blasted a day away because of a typo? ‍

oh well. what else is new

so yes, that cured the error. why it only showed up the other day when there were already a whole bunch of vm's running that way i am not sure. i guess perhaps because i started them with the gui, which for some reason works.

dcsapak said:
also is there a reason why you pass the card through with the 'args' parameter instead of using the builtin method?

simply because i am a noob at this and was following the nvidia docs to the letter. this server has been running vgpu for months and months by now; since before official support from you guys. maybe the built-in method wasn't published back then? anyway, this is what i was following from the nvidia docs:

dcsapak said:
interesting, can you post the whole vm config and the task start log ?

this is the only thing left. should i be concerned with the warning?

Code:

swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:16:01.4: warning: vfio 0000:16:01.4: Could not enable error recovery for the device
TASK OK

dcsapak said:
you should be able to create a pci resource mapping for all virtual functions, and then use that in the vm with the vgpu model you want, it will then auto allocate (and deallocate) on vm start. see https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE

i cannot find instruction on how to specify the vgpu type here. will that still have to be done be editing the "creatable_vgpu_types" file at the virtual functions file path, as per the nvidia docs?

dcsapak said:
nonetheless it would probably make more sense to use the built in features instead of using args

i will look into this now that it's officially rolled out; thanks for dropping by

dcsapak · Wednesday at 10:59

jwsl224 said:
so yes, that cured the error. why it only showed up the other day when there were already a whole bunch of vm's running that way i am not sure. i guess perhaps because i started them with the gui, which for some reason works.

mhmm i doubt it would simply work over the gui, since that does the same thing as starting the guest via the cli, but who knows what exactly happened there...

jwsl224 said:
simply because i am a noob at this and was following the nvidia docs to the letter. this server has been running vgpu for months and months by now; since before official support from you guys. maybe the built-in method wasn't published back then? anyway, this is what i was following from the nvidia docs:

well, it's absolutely no bad thing to follow the documentation, in this case there was simply more documentation you didn't know

the feature itself was already there for quite some time, but it was not officially supported by nvidia until this year.

jwsl224 said:
this is the only thing left. should i be concerned with the warning?

no that warning is harmless and quite common, it has to do with PCI AER (here is a bit more detail: https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE#Troubleshooting )

jwsl224 said:
i cannot find instruction on how to specify the vgpu type here. will that still have to be done be editing the "creatable_vgpu_types" file at the virtual functions file path, as per the nvidia docs?

the gpu type itself is specified when adding the mapping to the vm (see this screenshot: https://pve.proxmox.com/wiki/File:PVE_select_a_vgpu_with_mapping.png ), the only quirk currently is that when editing that, you can only see the currently creatable models (we're working on changing that, but it's not done yet)

jwsl224 said:
i will look into this now that it's officially rolled out; thanks for dropping by

no problem, just ask if there are further questions/issues

jwsl224 · Wednesday at 16:13

dcsapak said:
but who knows what exactly happened there...

lol. sometimes we are running on nothing but luck

dcsapak said:
the gpu type itself is specified when adding the mapping to the vm (see this screenshot: https://pve.proxmox.com/wiki/File:PVE_select_a_vgpu_with_mapping.png ), the only quirk currently is that when editing that, you can only see the currently creatable models (we're working on changing that, but it's not done yet)

thanks! i'll be all gui'ed up here shortly

dcsapak · Wednesday at 16:49

jwsl224 said:
lol. sometimes we are running on nothing but luck

actually after thinking a few seconds more about it, i have a theory:

It may be the case that the api daemons are started with '/' as their current working directory, so every command they execute have also that as their cwd. In that case, the 'sys/bus....' relative path can be resolved
if you want to try, it should also work on the cli if you do a 'cd /' before calling 'qm'

Search

Search

start vm times out, but qm start fails to run

jwsl224

Member

dcsapak

Proxmox Staff Member

jwsl224

Member

dcsapak

Proxmox Staff Member

jwsl224

Member

dcsapak

Proxmox Staff Member

jwsl224

Member

dcsapak

Proxmox Staff Member

We value your privacy