Hey, I am brand new to Proxmox and Linux as a whole. I am a software developer and have been working with Windows my entire life, so I am very familiar, and I am quickly learning Linux and getting around the terminal. However, no matter what I do I just can't seem to fix this issue.
Just to get straight into it, in Proxmox VE 9.0.3 I have a Windows 11 VM and a PopOS! VM. The latest NVIDIA distro from their website. I have a 4090 from Nvidia and a 970. I have the terminal/proxmox output set to the 970 using the proper nvidia driver (didn't matter if it was nouveau, I just got nvidia in hopes detaching the 4090 from vfio, and attaching to nvidia, and then back would solve the problem. I can boot into the Windows VM, shutdown, start it up, shut down, start it up, as many times as I want. No issues. I can shut down from the Windows VM and boot into the PopOS! VM no problem. But once I shut down from the popos, whether be from popos itself, stopping the container immediately in proxmox, or using the shutdown button in proxmox, the 4090 will not show on any VM until I restart the host. I tried Fedora KDE, I tried Mint, I tried different drivers version on pop os. The end goal is to be able to swap from Windows to Linux and vice versa with a shorter time than it would take to dual boot and fully restart the machine. And I will be running LLM models and occasionally playing video games, so I need full GPU passthrough ideally on both OS'.
So summary: Once a Linux VM is started in Proxmox, and then shut down, the 4090 is rendered useless until a host restart. This does NOT happen with the Windows VM.
I have been trying to fix this for two days, and I can't seem to get passed this issue. I tried this after shutting down the linux vm:
echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.1" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.0" > /sys/bus/pci/drivers/nvidia/bind
echo "0000:01:00.1" > /sys/bus/pci/drivers/snd_hda_intel/bind
echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove
echo 1 > /sys/bus/pci/rescan
And I tried everything I could. I went to the /sys/bus/pci/devices/0000:01:00.0 I think, and did cat * on everything in there. From before starting linux and shutting it down, the only changes with the device driver was IRQ was changed and an msi-irq or something along that lines, folder was created. That was the only difference in all of the files. The power state remained active/on the whole time, there was seemingly no issues. The 4090 automatically binds to vfio-pci on startup. I tried pcie-aspm=off I think was the command, I tried to get ChatGPT and Grok paid plans to help me with this, I did as much research on the proxmox forums as I could before posting. I tried all types of the GRUB LINUX command line changes as possible. I have the intel_iommu=on and iommu=pt. I have tried SR-IOV on or off in BIOS, I have tried Re-Size BAR on or off in BIOS. I have tried Rom-BAR on or off in the passthrough settings, I have tried BIOS and UEFI, I have tried primary gpu toggle on and off, I have tried the pci express toggle on and off. I tried a different driver on the popos, I tried an older driver like 570 or something. I tried installing a fresh pop os. I didn't do anything special in windows, just installed latest driver and booted up, it was all good.
When I start a VM when the 4090 is in the 'locked' state, the primary and secondary monitor on the GPU's backlights come on, but nothing on the screen, and the 3rd monitor doesn't come up at all (it doesn't during the UEFI boot menu on Proxmox either, so that's not really something I'm considering, might just be a different detection setup in that monitor).
One thing I did notice is that, after installing the nvidia driver on the host machine, if I tried to unbind from vfio and bind to nvidia, nvidia-smi wouldn't show the 4090, so I did lspci -nnk -d 10de: and it showed the active driver being nvidia for both the 4090 and the 970. But nvidia-smi only showed the 970. So I did: dmesg | grep -i nvidia and that showed a error that it failed to allocate NvKmsKapiDevice. So I went searching through modprobe.d and went into pve-blocklist.conf and commented out the blacklist nvidiafb line, which kille the 970 screen, but then after a reboot I launched popos, shut it down, unbound the 4090 from vfio and bound to nvidia, and there was no error this time. But nvidia-smi still did not show the 4090. So I uncommented the line and the 970 screen hasn't returned but that's the least of my worries (yes I have been running update-grub and update-initramfs -u accordingly).
Please, anything that can help me, I am desperate, I REALLY want this to work if we can get it to work. Thank you for reading!
Just to get straight into it, in Proxmox VE 9.0.3 I have a Windows 11 VM and a PopOS! VM. The latest NVIDIA distro from their website. I have a 4090 from Nvidia and a 970. I have the terminal/proxmox output set to the 970 using the proper nvidia driver (didn't matter if it was nouveau, I just got nvidia in hopes detaching the 4090 from vfio, and attaching to nvidia, and then back would solve the problem. I can boot into the Windows VM, shutdown, start it up, shut down, start it up, as many times as I want. No issues. I can shut down from the Windows VM and boot into the PopOS! VM no problem. But once I shut down from the popos, whether be from popos itself, stopping the container immediately in proxmox, or using the shutdown button in proxmox, the 4090 will not show on any VM until I restart the host. I tried Fedora KDE, I tried Mint, I tried different drivers version on pop os. The end goal is to be able to swap from Windows to Linux and vice versa with a shorter time than it would take to dual boot and fully restart the machine. And I will be running LLM models and occasionally playing video games, so I need full GPU passthrough ideally on both OS'.
So summary: Once a Linux VM is started in Proxmox, and then shut down, the 4090 is rendered useless until a host restart. This does NOT happen with the Windows VM.
I have been trying to fix this for two days, and I can't seem to get passed this issue. I tried this after shutting down the linux vm:
echo "0000:01:00.0" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.1" > /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.0" > /sys/bus/pci/drivers/nvidia/bind
echo "0000:01:00.1" > /sys/bus/pci/drivers/snd_hda_intel/bind
echo 1 > /sys/bus/pci/devices/0000:01:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:01:00.1/remove
echo 1 > /sys/bus/pci/rescan
And I tried everything I could. I went to the /sys/bus/pci/devices/0000:01:00.0 I think, and did cat * on everything in there. From before starting linux and shutting it down, the only changes with the device driver was IRQ was changed and an msi-irq or something along that lines, folder was created. That was the only difference in all of the files. The power state remained active/on the whole time, there was seemingly no issues. The 4090 automatically binds to vfio-pci on startup. I tried pcie-aspm=off I think was the command, I tried to get ChatGPT and Grok paid plans to help me with this, I did as much research on the proxmox forums as I could before posting. I tried all types of the GRUB LINUX command line changes as possible. I have the intel_iommu=on and iommu=pt. I have tried SR-IOV on or off in BIOS, I have tried Re-Size BAR on or off in BIOS. I have tried Rom-BAR on or off in the passthrough settings, I have tried BIOS and UEFI, I have tried primary gpu toggle on and off, I have tried the pci express toggle on and off. I tried a different driver on the popos, I tried an older driver like 570 or something. I tried installing a fresh pop os. I didn't do anything special in windows, just installed latest driver and booted up, it was all good.
When I start a VM when the 4090 is in the 'locked' state, the primary and secondary monitor on the GPU's backlights come on, but nothing on the screen, and the 3rd monitor doesn't come up at all (it doesn't during the UEFI boot menu on Proxmox either, so that's not really something I'm considering, might just be a different detection setup in that monitor).
One thing I did notice is that, after installing the nvidia driver on the host machine, if I tried to unbind from vfio and bind to nvidia, nvidia-smi wouldn't show the 4090, so I did lspci -nnk -d 10de: and it showed the active driver being nvidia for both the 4090 and the 970. But nvidia-smi only showed the 970. So I did: dmesg | grep -i nvidia and that showed a error that it failed to allocate NvKmsKapiDevice. So I went searching through modprobe.d and went into pve-blocklist.conf and commented out the blacklist nvidiafb line, which kille the 970 screen, but then after a reboot I launched popos, shut it down, unbound the 4090 from vfio and bound to nvidia, and there was no error this time. But nvidia-smi still did not show the 4090. So I uncommented the line and the 970 screen hasn't returned but that's the least of my worries (yes I have been running update-grub and update-initramfs -u accordingly).
Please, anything that can help me, I am desperate, I REALLY want this to work if we can get it to work. Thank you for reading!