Good morning/afternoon/evening everyone, I have a problem that has been bothering me for a week. I have been using Proxmox for two years and have been continuously adding hardware. The CPU used is AMD 5600G, the motherboard is Maxsun B550M, two 2.5-inch SSDs are used to form a ZFS mirror as the system disk, and a 500GB Samsung NVMe SSD is used as cache. Due to the onboard SATA having only four ports, when I expanded to the fifth hard drive, I connected an external PCIe to SATA card, which was passed through to the virtual machine via PCIe. Since there was only one PCIe x16 slot on the motherboard, for convenience, I directly inserted the PCIe to SATA expansion card into the PCIe x16 slot.
This year, in order to learn AI, I purchased an NVIDIA 4060Ti 16G and inserted it into the PCIe x4 channel, which was also passed through to the virtual machine, and everything was normal until then.
Until this Tuesday, when I was running Stable Diffusion on a Windows 11 virtual machine, Proxmox crashed. Upon inspection, I found that as soon as the graphics card was passed through to the virtual machine, PVE would crash. After checking the IOMMU groups, I found that the onboard SATA controller and the graphics card were in the same group, which probably caused the system SATA controller to be passed through to the virtual machine and led to the crash.
The attachment contains the information that appears when passing through the GPU.
What troubles me is that after installing the graphics card, it ran stably for more than half a year until this Tuesday. The only hardware change during this period was upgrading the memory from 32GB to 64GB last Saturday, which was also normal after the upgrade.
I have enabled IOMMU forced grouping in GRUB:
After unsuccessful grouping, I chose to compile a forced split kernel patch 0004-pci-Enable-overrides-for-missing-ACS-capabilities-4..patch:
This was also ineffective.
Please forgive my poor English expression as my native language is not English, but I have been troubled for a long time, and I am very grateful to all of you.
This year, in order to learn AI, I purchased an NVIDIA 4060Ti 16G and inserted it into the PCIe x4 channel, which was also passed through to the virtual machine, and everything was normal until then.
Until this Tuesday, when I was running Stable Diffusion on a Windows 11 virtual machine, Proxmox crashed. Upon inspection, I found that as soon as the graphics card was passed through to the virtual machine, PVE would crash. After checking the IOMMU groups, I found that the onboard SATA controller and the graphics card were in the same group, which probably caused the system SATA controller to be passed through to the virtual machine and led to the crash.
The attachment contains the information that appears when passing through the GPU.
What troubles me is that after installing the graphics card, it ran stably for more than half a year until this Tuesday. The only hardware change during this period was upgrading the memory from 32GB to 64GB last Saturday, which was also normal after the upgrade.
I have enabled IOMMU forced grouping in GRUB:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction"
GRUB_CMDLINE_LINUX=""
After unsuccessful grouping, I chose to compile a forced split kernel patch 0004-pci-Enable-overrides-for-missing-ACS-capabilities-4..patch:
+ /* Never override ACS for legacy devices or devices with ACS caps */
+ if (!pci_is_pcie(dev))
+ return -ENOTTY;
This was also ineffective.
Please forgive my poor English expression as my native language is not English, but I have been troubled for a long time, and I am very grateful to all of you.
Attachments
Last edited: