Sudden occurrence of GPU pass-through to the virtual machine causing the host to crash

ctkghost

New Member
Jun 22, 2024
1
0
1
Good morning/afternoon/evening everyone, I have a problem that has been bothering me for a week. I have been using Proxmox for two years and have been continuously adding hardware. The CPU used is AMD 5600G, the motherboard is Maxsun B550M, two 2.5-inch SSDs are used to form a ZFS mirror as the system disk, and a 500GB Samsung NVMe SSD is used as cache. Due to the onboard SATA having only four ports, when I expanded to the fifth hard drive, I connected an external PCIe to SATA card, which was passed through to the virtual machine via PCIe. Since there was only one PCIe x16 slot on the motherboard, for convenience, I directly inserted the PCIe to SATA expansion card into the PCIe x16 slot.

This year, in order to learn AI, I purchased an NVIDIA 4060Ti 16G and inserted it into the PCIe x4 channel, which was also passed through to the virtual machine, and everything was normal until then.

Until this Tuesday, when I was running Stable Diffusion on a Windows 11 virtual machine, Proxmox crashed. Upon inspection, I found that as soon as the graphics card was passed through to the virtual machine, PVE would crash. After checking the IOMMU groups, I found that the onboard SATA controller and the graphics card were in the same group, which probably caused the system SATA controller to be passed through to the virtual machine and led to the crash.
The attachment contains the information that appears when passing through the GPU.

What troubles me is that after installing the graphics card, it ran stably for more than half a year until this Tuesday. The only hardware change during this period was upgrading the memory from 32GB to 64GB last Saturday, which was also normal after the upgrade.

I have enabled IOMMU forced grouping in GRUB:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction"
GRUB_CMDLINE_LINUX=""

After unsuccessful grouping, I chose to compile a forced split kernel patch 0004-pci-Enable-overrides-for-missing-ACS-capabilities-4..patch:
+ /* Never override ACS for legacy devices or devices with ACS caps */
+ if (!pci_is_pcie(dev))
+ return -ENOTTY;
This was also ineffective.

Please forgive my poor English expression as my native language is not English, but I have been troubled for a long time, and I am very grateful to all of you.
 

Attachments

  • iommu.jpg
    iommu.jpg
    76.5 KB · Views: 6
  • iommu2.jpg
    iommu2.jpg
    30.1 KB · Views: 7
  • IMG20240620201054.jpg
    IMG20240620201054.jpg
    867.2 KB · Views: 7
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!