Help Understanding Error Message

NOiSEA · Jan 14, 2023

Can someone please help me understand this error message? What should I be looking for in this to troubleshoot my issue? The VM worked fine until I shutdown the host. The next time I booted up the host I got this message when I started the VM and now I get it every time.

TASK ERROR: start failed: command '/usr/bin/kvm -id 181 -name 'Plex,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/181.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/181.pid -daemonize -smbios 'type=1,uuid=5797d198-77e9-4973-b56b-6b94c553e812' -drive 'if=pflash,unit=0,format=raw,readonly=on,file=/usr/share/pve-edk2-firmware//OVMF_CODE_4M.secboot.fd' -drive 'if=pflash,unit=1,id=drive-efidisk0,format=qcow2,file=/mnt/pve/VMadmins/images/181/vm-181-disk-0.qcow2' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/181.vnc,password=on' -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 8192 -S -object 'iothread,id=iothread-virtioscsi0' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=32d1cff5-e0d2-47c4-bc57-1bb045a1e69d' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:00:02.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -chardev 'socket,path=/var/run/qemu-server/181.qga,server=on,wait=off,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:f28623ce676e' -drive 'if=none,id=drive-ide2,media=cdrom,aio=io_uring' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' -drive 'file=/mnt/pve/VMadmins/images/181/vm-181-disk-1.qcow2,if=none,id=drive-scsi0,format=qcow2,cache=none,aio=io_uring,detect-zeroes=on' -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,rotation_rate=1,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap181i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=8A:CE:5D:EA:A0:5E,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=1024,bootindex=102' -machine 'type=q35+pve0'' failed: got timeout

Additionally, once i get this message, I can no longer use the host until I press the physical button on the machine to shut it down.

leesteken · Jan 15, 2023

It's a long message but it does not contain any information about the actual problem (only the VM configuration in an unreadable way). If there are no error messages in the system logs and all you got is the timeout, then it's most likely that the Proxmox host has not enough (continuous) memory available for the VM.
PCI(e) passthrough requires all VM memory to be pinned into actual host memory, so it needs 8GiB avaiable. Are you sure you are passing through the correct device? PCI ID's can change due to adding or removing hardware and the host loses connection to all devices in the same IOMMU group when starting the VM.
Check the Syslog in the Proxmox GUI or journalctl on the console and see if there are error messages around the time of starting VM 181.

NOiSEA · Jan 16, 2023

Thank you for the reply. I am pretty confident that I am passing through the correct device and it seems to be the only device in the pci group. it is more likely an issue with the memory. I have tried a few things that I saw in other posts for similar issues. for example, I have tried setting the hugepages=1024, 2, and any - none of which worked. I also used commands to clear the memory cache and tried running the vm using a qm command that was supposed to bypass the lockout.

The proxmox shell repeats a message like this starting right after I try to start the VM:

kernel:[ 528.399006] watchdog: BUG: soft lockup - CPU#9 stuck for 205s! [kworker/9:3:314]

this is what the syslog says from the point where I started it.

Jan 15 15:25:12 blackwhole pvedaemon[2793]: start VM 181: UPID:blackwhole:00000AE9:00005188:63C48B58:qmstart:181:root@pam:
Jan 15 15:25:12 blackwhole kernel: vfio-pci 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+memwns=io+mem
Jan 15 15:25:12 blackwhole kernel: vfio-pci 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+memwns=io+mem
Jan 15 15:25:12 blackwhole kernel: vfio-pci 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+memwns=io+mem
Jan 15 15:25:13 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:25:13 blackwhole pvestatd[2215]: status update time (12.437 seconds)
Jan 15 15:25:14 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 1023ms after FLR; waiting
Jan 15 15:25:15 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 2047ms after FLR; waiting
Jan 15 15:25:17 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 4095ms after FLR; waiting
Jan 15 15:25:19 blackwhole pvestatd[2215]: storage 'Legendary-NFS-NAS' is not online
Jan 15 15:25:21 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 8191ms after FLR; waiting
Jan 15 15:25:26 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:25:26 blackwhole pvestatd[2215]: status update time (12.394 seconds)
Jan 15 15:25:30 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 16383ms after FLR; waiting
Jan 15 15:25:32 blackwhole pvestatd[2215]: storage 'Legendary-NFS-NAS' is not online
Jan 15 15:25:38 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:25:38 blackwhole pvestatd[2215]: status update time (12.421 seconds)
Jan 15 15:25:44 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:25:46 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 32767ms after FLR; waiting
Jan 15 15:25:50 blackwhole pvestatd[2215]: storage 'Legendary-NFS-NAS' is not online
Jan 15 15:25:50 blackwhole pvestatd[2215]: status update time (12.404 seconds)
Jan 15 15:25:57 blackwhole pvestatd[2215]: storage 'Legendary-NFS-NAS' is not online
Jan 15 15:26:03 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:26:03 blackwhole pvestatd[2215]: status update time (12.428 seconds)
Jan 15 15:26:09 blackwhole pvestatd[2215]: storage 'Legendary-NFS-NAS' is not online
Jan 15 15:26:15 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:26:15 blackwhole pvestatd[2215]: status update time (12.416 seconds)
Jan 15 15:26:21 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 65535ms after FLR; giving up
Jan 15 15:26:21 blackwhole systemd[1]: Created slice qemu.slice.
Jan 15 15:26:21 blackwhole systemd[1]: Started 181.scope.
Jan 15 15:26:21 blackwhole systemd-udevd[2983]: Using default interface naming scheme 'v247'.
Jan 15 15:26:21 blackwhole systemd-udevd[2983]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 15:26:21 blackwhole pvestatd[2215]: storage 'Legendary-NFS-NAS' is not online
Jan 15 15:26:22 blackwhole kernel: device tap181i0 entered promiscuous mode
Jan 15 15:26:22 blackwhole systemd-udevd[2983]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 15:26:22 blackwhole systemd-udevd[2983]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 15:26:22 blackwhole systemd-udevd[2986]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 15:26:22 blackwhole systemd-udevd[2986]: Using default interface naming scheme 'v247'.
Jan 15 15:26:22 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered blocking state
Jan 15 15:26:22 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered disabled state
Jan 15 15:26:22 blackwhole kernel: device fwpr181p0 entered promiscuous mode
Jan 15 15:26:22 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered blocking state
Jan 15 15:26:22 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered forwarding state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered blocking state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered disabled state
Jan 15 15:26:22 blackwhole kernel: device fwln181i0 entered promiscuous mode
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered blocking state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered forwarding state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered blocking state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered disabled state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered blocking state
Jan 15 15:26:22 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered forwarding state
Jan 15 15:26:32 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online
Jan 15 15:26:32 blackwhole pvestatd[2215]: status update time (16.322 seconds)
Jan 15 15:26:34 blackwhole pvedaemon[2240]: VM 181 qmp command failed - VM 181 qmp command 'query-proxmox-support' failed - got timeout
Jan 15 15:26:40 blackwhole pvestatd[2215]: VM 181 qmp command failed - VM 181 qmp command 'query-proxmox-support' failed - unable to connect to VM 181 qmp socket - timeout after 51 retries
Jan 15 15:26:43 blackwhole pvedaemon[2240]: VM 181 qmp command failed - VM 181 qmp command 'query-proxmox-support' failed - unable to connect to VM 181 qmp socket - timeout after 51 retries
Jan 15 15:26:43 blackwhole pvedaemon[2240]: <root@pam> starting task UPID:blackwhole:00000C1B:000074E6:63C48BB3:vncproxy:181:root@pam:
Jan 15 15:26:43 blackwhole pvedaemon[3099]: starting vnc proxy UPID:blackwhole:00000C1B:000074E6:63C48BB3:vncproxy:181:root@pam:
Jan 15 15:26:50 blackwhole pvestatd[2215]: storage 'Common-NFS-NAS' is not online

~~EDIT: I may have it working now after playing some settings roulette in the PVE GUI. I have had a lot of false successes on this though so I am not confident this will last.~~
EDIT2: Nevermind, it failed again. This is what happens - it works once and then after verifying the VM is correctly passing through the pic device and has the i915 driver, I make a backup and then once that is done I shut it down and check if it it survives a reboot. but I can never get the second boot to work.

This is what the syslog said this time:

Jan 15 16:11:49 blackwhole pvedaemon[7185]: start VM 181: UPID:blackwhole:00001C11:00029AEF:63C49645:qmstart:181:root@pam:
Jan 15 16:11:49 blackwhole pvedaemon[2239]: <root@pam> starting task UPID:blackwhole:00001C11:00029AEF:63C49645:qmstart:181:root@pam:
Jan 15 16:11:50 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 1023ms after FLR; waiting
Jan 15 16:11:51 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 2047ms after FLR; waiting
Jan 15 16:11:53 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 4095ms after FLR; waiting
Jan 15 16:11:58 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 8191ms after FLR; waiting
Jan 15 16:12:06 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 16383ms after FLR; waiting
Jan 15 16:12:24 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 32767ms after FLR; waiting
Jan 15 16:12:46 blackwhole pvedaemon[2239]: <root@pam> starting task UPID:blackwhole:00001CCB:0002B170:63C4967E:vncshell::root@pam:
Jan 15 16:12:46 blackwhole pvedaemon[7371]: starting termproxy UPID:blackwhole:00001CCB:0002B170:63C4967E:vncshell::root@pam:
Jan 15 16:12:47 blackwhole pvedaemon[2241]: <root@pam> successful auth for user 'root@pam'
Jan 15 16:12:47 blackwhole login[7376]: pam_unix(login:session): session opened for user root(uid=0) by (uid=0)
Jan 15 16:12:47 blackwhole systemd[1]: Created slice User Slice of UID 0.
Jan 15 16:12:47 blackwhole systemd[1]: Starting User Runtime Directory /run/user/0...
Jan 15 16:12:47 blackwhole systemd-logind[1897]: New session 1 of user root.
Jan 15 16:12:47 blackwhole systemd[1]: Finished User Runtime Directory /run/user/0.
Jan 15 16:12:47 blackwhole systemd[1]: Starting User Manager for UID 0...
Jan 15 16:12:47 blackwhole systemd[7382]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Jan 15 16:12:47 blackwhole systemd[7382]: Queued start job for default target Main User Target.
Jan 15 16:12:47 blackwhole systemd[7382]: Created slice User Application Slice.
Jan 15 16:12:47 blackwhole systemd[7382]: Reached target Paths.
Jan 15 16:12:47 blackwhole systemd[7382]: Reached target Timers.
Jan 15 16:12:47 blackwhole systemd[7382]: Listening on GnuPG network certificate management daemon.
Jan 15 16:12:47 blackwhole systemd[7382]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 15 16:12:47 blackwhole systemd[7382]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jan 15 16:12:47 blackwhole systemd[7382]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jan 15 16:12:47 blackwhole systemd[7382]: Listening on GnuPG cryptographic agent and passphrase cache.
Jan 15 16:12:47 blackwhole systemd[7382]: Reached target Sockets.
Jan 15 16:12:47 blackwhole systemd[7382]: Reached target Basic System.
Jan 15 16:12:47 blackwhole systemd[7382]: Reached target Main User Target.
Jan 15 16:12:47 blackwhole systemd[7382]: Startup finished in 94ms.
Jan 15 16:12:47 blackwhole systemd[1]: Started User Manager for UID 0.
Jan 15 16:12:47 blackwhole systemd[1]: Started Session 1 of user root.
Jan 15 16:12:47 blackwhole login[7397]: ROOT LOGIN on '/dev/pts/0'
Jan 15 16:12:58 blackwhole kernel: vfio-pci 0000:00:02.0: not ready 65535ms after FLR; giving up
Jan 15 16:12:58 blackwhole systemd[1]: Started 181.scope.
Jan 15 16:12:58 blackwhole systemd-udevd[7447]: Using default interface naming scheme 'v247'.
Jan 15 16:12:58 blackwhole systemd-udevd[7447]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 16:12:59 blackwhole kernel: device tap181i0 entered promiscuous mode
Jan 15 16:12:59 blackwhole systemd-udevd[7447]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 16:12:59 blackwhole systemd-udevd[7447]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 16:12:59 blackwhole systemd-udevd[7450]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jan 15 16:12:59 blackwhole systemd-udevd[7450]: Using default interface naming scheme 'v247'.
Jan 15 16:12:59 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered blocking state
Jan 15 16:12:59 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered disabled state
Jan 15 16:12:59 blackwhole kernel: device fwpr181p0 entered promiscuous mode
Jan 15 16:12:59 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered blocking state
Jan 15 16:12:59 blackwhole kernel: vmbr0: port 2(fwpr181p0) entered forwarding state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered blocking state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered disabled state
Jan 15 16:12:59 blackwhole kernel: device fwln181i0 entered promiscuous mode
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered blocking state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 1(fwln181i0) entered forwarding state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered blocking state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered disabled state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered blocking state
Jan 15 16:12:59 blackwhole kernel: fwbr181i0: port 2(tap181i0) entered forwarding state
Jan 15 16:13:07 blackwhole pvestatd[2211]: VM 181 qmp command failed - VM 181 qmp command 'query-proxmox-support' failed - got timeout

leesteken · Jan 16, 2023

It's not a memory issue because there is an error message while starting the VM:

NOiSEA said:
kernel: vfio-pci 0000:00:02.0: not ready 1023ms after FLR; waiting

The device is not resetting properly and appears unresponsive after a function level reset (FLR).
This usually indicates that the hardware (maybe in combination with drivers leaving it in a bad state) is not well suited for passthrough.

Does this only happen after stopping and then starting the VM, or does it also happen when starting the VM for the first time after a reboot of the Proxmox host?

NOiSEA said:
The VM worked fine until I shutdown the host. The next time I booted up the host I got this message when I started the VM and now I get it every time.

Did you update Proxmox and switch to a newer kernel version? Did you perhaps do a BIOS update? What changed between before the Proxmox host reboot and after? How did you prepare for the PCIe passthrough? Maybe some changes weren't needed and only because in effect after a reboot.

Maybe you can find someone that has the exact same GPU (which is?) working with passthrough and find out how they did it. Maybe this device needs some kind of work-around.

NOiSEA · Jan 16, 2023

leesteken said:
Does this only happen after stopping and then starting the VM, or does it also happen when starting the VM for the first time after a reboot of the Proxmox host?

After much trial and error, I have determined that: I can start the VM with the PCI passthrough only if I apply a backup that I took of the VM from right before I added the pci device and then re-add the PCI device (making sure that I do not select "All Functions"). If I do that process the VM will boot-up once and it will have the PCI passthrough applied and everything seems to be working fine.

However, the problem occurs on the next boot of the VM, and all future boots of the VM until I repeat the process above (apply backup & add pci device).

leesteken said:
Did you update Proxmox and switch to a newer kernel version? Did you perhaps do a BIOS update? What changed between before the Proxmox host reboot and after?

I installed Proxmox a few weeks ago and it was on 7.3-4 . I have checked for updates periodically but I don't remember any since the first day of install. My kernel version is 5.15.83-1, but I am not changing it during the process (that i know of anyway). I probably updated my motherboard when I installed it over a year ago but not recently. I think the best way to confirm that nothing is changing with the proxmox host before and after is that I can replicate the process above (apply backup & add pci device) without changing anything - and the same thing still happens.

Note: when the VM fails to start, the host freezes up too and cannot do anything. I have only one way to shutdown the host at that point and that is to manually hold the button on the physical machine till it shuts down. It has occurred to me that that might change something (I'm not sure) but that happens after the VM won't start so its probably not whats causing the problem.

leesteken said:
How did you prepare for the PCIe passthrough?

I have been doing PCI passthrough, not PCIe, however, after this post, I may try and see if it works by enabling PCI-Express. I wasn't using the PCI-Express option because its an iGPU but maybe thats flawed logic on a VM....

Here is how I prepared:

First i tried to do split iGPU GVT-g Passthrough because my processor intel Cometlake i5 10400 supposedly supports it so i used:
This post from the proxmox forums.
It didn't work for me. the only thing I did differently was that I didn't pin the older kernel from that post 5.13.19-5 .
Since that didn't work, I went back in and tried to revert all of the steps in the guide (perhaps I missed something that is causing this?)

Next I decided that maybe full iGPU passthrough might be a better option, so I used:
this guide.
... I just remembered something as I was writing this - I wasn't able to perform the last step checking to see if the GPU's driver initialization was working in the ubuntu server VM. I think I rebooted and since the VM wouldn't start - that became the problem that I focused on. based on what you said:

leesteken said:
The device is not resetting properly and appears unresponsive after a function level reset (FLR).
This usually indicates that the hardware (maybe in combination with drivers leaving it in a bad state) is not well suited for passthrough.

I think that I need to rerun:

cd /dev/dri && ls -la
on the VM while its in a working state and then figure out if the GPU's driver initialization is working. If not then that might be causing the VM to not boot?

leesteken said:
Maybe you can find someone that has the exact same GPU (which is?) working with passthrough and find out how they did it. Maybe this device needs some kind of work-around.

intel Cometlake i5-10400 iGPU. I will look again for other working passthrough posts for this iGPU. I did look around a few times early on in my research but it was slim pickings - I found a few posts but nothing with any specific workarounds to apply for the specific device. Maybe I missed something though.

leesteken · Jan 16, 2023

NOiSEA said:
After much trial and error, I have determined that: I can start the VM with the PCI passthrough only if I apply a backup that I took of the VM from right before I added the pci device and then re-add the PCI device (making sure that I do not select "All Functions"). If I do that process the VM will boot-up once and it will have the PCI passthrough applied and everything seems to be working fine.

However, the problem occurs on the next boot of the VM, and all future boots of the VM until I repeat the process above (apply backup & add pci device).

That's really weird; make sure to keep that backup. You are getting the "not ready after FLR; waiting" error because of something inside the VM? Even when you reboot the Proxmox host before starting the VM again? I can only imagine that the VM configuration from the backup contains a configuration change what it only applied after a shutdown. Can you share the VM configuration from the VM backup?

NOiSEA said:
I installed Proxmox a few weeks ago and it was on 7.3-4 . I have checked for updates periodically but I don't remember any since the first day of install. My kernel version is 5.15.83-1, but I am not changing it during the process (that i know of anyway). I probably updated my motherboard when I installed it over a year ago but not recently. I think the best way to confirm that nothing is changing with the proxmox host before and after is that I can replicate the process above (apply backup & add pci device) without changing anything - and the same thing still happens.

Okay, then it's most likely not because of Proxmox updates or kernel updates.

NOiSEA said:
Note: when the VM fails to start, the host freezes up too and cannot do anything. I have only one way to shutdown the host at that point and that is to manually hold the button on the physical machine till it shuts down. It has occurred to me that that might change something (I'm not sure) but that happens after the VM won't start so its probably not whats causing the problem.

The device no longer responding might become an issue for the PCIe bus and/or the CPU, which might cause system-wide problems. That's one of the risks with passthrough to VMs, it can influence other VMs and the host because of the physical hardware access (even though it shouldn't because of IOMMU).

NOiSEA said:
I have been doing PCI passthrough, not PCIe, however, after this post, I may try and see if it works by enabling PCI-Express. I wasn't using the PCI-Express option because its an iGPU but maybe thats flawed logic on a VM....

It usually does not matter. I did not mean to suggest that you should use PCIe instead of PCI, I just wanted to point to the manual for you to double check with your system changes. But drivers inside a VM sometimes assume things like PCI(e) layout, in which case it does matter.

NOiSEA said:
Here is how I prepared:

First i tried to do split iGPU GVT-g Passthrough because my processor intel Cometlake i5 10400 supposedly supports it so i used:
This post from the proxmox forums.
It didn't work for me. the only thing I did differently was that I didn't pin the older kernel from that post 5.13.19-5 .
Since that didn't work, I went back in and tried to revert all of the steps in the guide (perhaps I missed something that is causing this?)

Next I decided that maybe full iGPU passthrough might be a better option, so I used:
this guide.
... I just remembered something as I was writing this - I wasn't able to perform the last step checking to see if the GPU's driver initialization was working in the ubuntu server VM. I think I rebooted and since the VM wouldn't start - that became the problem that I focused on. based on what you said:

I think that I need to rerun:

cd /dev/dri && ls -la
on the VM while its in a working state and then figure out if the GPU's driver initialization is working. If not then that might be causing the VM to not boot?

I don't know those quides and I have no experience with iGPU passthrough, but I doubt that such a command (ls -la /dev/dri) would mess things up.

NOiSEA said:
intel Cometlake i5-10400 iGPU. I will look again for other working passthrough posts for this iGPU. I did look around a few times early on in my research but it was slim pickings - I found a few posts but nothing with any specific workarounds to apply for the specific device. Maybe I missed something though.

iGPU, and especially if it is used during system boot, is always more involved than discrete PCIe devices (that you can just early bind to vfio-pci, so nothing but the VM touches it).

Search

Search

Help Understanding Error Message

NOiSEA

New Member

leesteken

Distinguished Member

NOiSEA

New Member

leesteken

Distinguished Member

NOiSEA

New Member

leesteken

Distinguished Member

We value your privacy