I got a new machine with AMD CPU 7950x and GPU 7900XTX setup with Proxmox 8.0.2 a few days ago. After successfully make the dedicated GPU passthrough to a Windows11 guest vm and a Ubuntu 22.4 guest vm, I tried to install the latest AMD ROCM support with Pytorch in the Ubuntu vm.
Although the installation procedure was quite smooth, I got stuck with an error from HIP kernel complaining it can't continue with present state. After searching on Google, I learned that PCIe DevCaps for AtomicOpsCap: Routing+ 32bit+ 64bit+ must be enabled althrough the PCIe tree to the GPU device. However, the default implementation of pci-root-port of kvm didn't enable that and cause the problem.
I found there's a patch on qemu repo already existed for enabling these features. I got it applied and recompile the pve-qemu-kvm: 8.0.2-7 to force enable the DevCaps as required. Now the Pytorch with ROCM could run successfully.
I'm wondering if this patch would be included in the futuer pve-qemu-kvm release.
Best wishes to Proxmox community.
Although the installation procedure was quite smooth, I got stuck with an error from HIP kernel complaining it can't continue with present state. After searching on Google, I learned that PCIe DevCaps for AtomicOpsCap: Routing+ 32bit+ 64bit+ must be enabled althrough the PCIe tree to the GPU device. However, the default implementation of pci-root-port of kvm didn't enable that and cause the problem.
I found there's a patch on qemu repo already existed for enabling these features. I got it applied and recompile the pve-qemu-kvm: 8.0.2-7 to force enable the DevCaps as required. Now the Pytorch with ROCM could run successfully.
I'm wondering if this patch would be included in the futuer pve-qemu-kvm release.
Best wishes to Proxmox community.