GPU and HBA Passthrough Issues on Z370 Asus Prime-A

dizzydre21 · Apr 10, 2023

Hello,

I am having an issue running an Nvidia RTX-2070 Super and an LSI-9211-8i at the same time on separate VMs. I can run them independently without issue, but the second I try to run their VMs at the same time I get the QEMU Exit Code 1 and it fails to start. These devices are not in the same IOMMU groups and my motherboard manual states that PCIE16_1 can run in x16 or x8 with PCIE16_2 also running at x8. Am I misunderstanding the info below?

Copied from Asus manual:

2 x PCIe 3.0/2.0 x16 slots (supports x16, x8/x8, x8/x4+x4*, x8+x4+x4/x0**) - This is PCIE16_1 and PCIE_2
1 x PCI Express 3.0/2.0 x16 slot (max. at x4 mode, compatible with PCIe x1, - This is PCIE16_3
x2 and x4 devices)
4 x PCI Express 3.0/2.0 x1 slots
* For 2 Intel® SSD on CPU support, install a Hyper M.2 X16 card (sold separately)
into the PCIeX16_2 slot, then enable this card under BIOS settings.
** For 3 Intel ® SSD on CPU support, install a Hyper M.2 X16 card (sold separately)
into the PCIeX16_1 slot, then enable this card under BIOS settings

My VMs are the current stable versions of TrueNAS Scale and Ubuntu Server (only running Jellyfin with GPU hardware acceleration). Currently, I am able to run both VMs with the HBA card in PCIE16_3, but it is running at x4. The BIOS states that the GPU is still running at x16 in slot 1 with this setup, which I do not totally understand unless the 4 lanes for PCIE16_3 are just chipset lanes.

leesteken · Apr 11, 2023

If they are not in the same IOMMU group and ou are not using pcie_acs_override and you don't see any error message but only a time-out then it's often a lack of memory.
Try booting both VMs with half (or less) of their memory to test. PCIe passthrough requires that all of the VM memory is pinned into actual RAM (because of possible device initiated DMA).

EDIT: Sharing the outpyut of

cat /proc/cmdline; for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done

, the VM configuration files and how much (free) memory your Proxmox host has could help in troubleshooting.

dizzydre21 · Apr 11, 2023

leesteken said:
If they are not in the same IOMMU group and ou are not using pcie_acs_override and you don't see any error message but only a time-out then it's often a lack of memory.
Try booting both VMs with half (or less) of their memory to test. PCIe passthrough requires that all of the VM memory is pinned into actual RAM (because of possible device initiated DMA).

EDIT: Sharing the outpyut of cat /proc/cmdline; for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done, the VM configuration files and how much (free) memory your Proxmox host has could help in troubleshooting.

I saw similar comments about memory on Sunday, probably in this forum, and tried reducing both. They still would not boot. I have 32 gigs total and orginally had 16GB allotted to TrueNAS and 4GB allotted to Ubuntu Server.

I do have some new information, however. I recreated both of my VMs yesterday and used q35 and OMVF so that the PCIE checkbox would appear during the VM creation. I checked this box and both VMs will now boot and run together. Is there a known conflict when not booting with the q35 machine type? I will try out your command when I get the chance, but I also had an issue with TrueNAS yesterday so I pulled the boot drive for Proxmox and installed TrueNAS on another disk for troubleshooting on bare metal.

leesteken · Apr 11, 2023

dizzydre21 said:
I saw similar comments about memory on Sunday, probably in this forum, and tried reducing both. They still would not boot. I have 32 gigs total and orginally had 16GB allotted to TrueNAS and 4GB allotted to Ubuntu Server.

Sometimes ZFS also takes up to 50% of memory and it only appears to be solved after a reboot.

dizzydre21 said:
I do have some new information, however. I recreated both of my VMs yesterday and used q35 and OMVF so that the PCIE checkbox would appear during the VM creation. I checked this box and both VMs will now boot and run together. Is there a known conflict when not booting with the q35 machine type? I will try out your command when I get the chance, but I also had an issue with TrueNAS yesterday so I pulled the boot drive for Proxmox and installed TrueNAS on another disk for troubleshooting on bare metal.

Sorry, I can't explain that. Really no error messages in Proxmox Syslog (journalctl) before?

dizzydre21 · Apr 12, 2023

leesteken said:
Sometimes ZFS also takes up to 50% of memory and it only appears to be solved after a reboot.

Sorry, I can't explain that. Really no error messages in Proxmox Syslog (journalctl) before?

I just ran it and all the entries were from April 1st. They didn't look like errors though.

Edit - Oops didn't realize I could scrolll. Looking through it now

dizzydre21 · Apr 12, 2023

dizzydre21 said:
I just ran it and all the entries were from April 1st. They didn't look like errors though.

Edit - Oops didn't realize I could scrolll. Looking through it now

Here are some of the yellow and red entries from around the time I was getting the exit code:


Apr 09 21:19:38 Proxmox kernel: ucsi_ccg: probe of 0-0008 failed with error -110
Apr 09 21:19:38 Proxmox kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
Apr 09 21:19:38 Proxmox kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Apr 09 21:19:38 Proxmox kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000
Apr 09 21:19:37 Proxmox kernel: MXM: GUID detected in BIOS
Apr 09 21:19:36 Proxmox systemd-modules-load[452]: Failed to find module 'vfio_virqfd'
Apr 09 21:19:36 Proxmox kernel: Disabling lock debugging due to kernel taint
Apr 09 21:19:36 Proxmox kernel: znvpair: module license 'CDDL' taints kernel.
Apr 09 21:19:36 Proxmox kernel: spl: loading out-of-tree module taints kernel.
Apr 09 21:19:36 Proxmox kernel: mpt2sas_cm0: overriding NVDATA EEDPTagMode setting
Apr 09 21:19:36 Proxmox kernel: mpt3sas 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control

I think this is the correct IOMMU groups


Apr 09 21:19:36 Proxmox kernel: xhci_hcd 0000:05:00.0:    [12] Timeout               
Apr 09 21:19:36 Proxmox kernel: xhci_hcd 0000:05:00.0:   device [1b21:2142] error status/mask=00001000/00002000
Apr 09 21:19:36 Proxmox kernel: xhci_hcd 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: AER:   Error of this Agent is reported first
Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4:    [12] Timeout               
Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4:    [ 0] RxErr                 
Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4:   device [8086:a294] error status/mask=00001001/00002000
Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID)

Looks like the right spot


Apr 09 20:55:11 Proxmox pvedaemon[1461]: <root@pam> end task UPID:Proxmox:00000C58:0001067F:64337A8C:qmstart:101:root@pam: start failed: QEMU exited with code 1
Apr 09 20:55:11 Proxmox pvedaemon[3160]: start failed: QEMU exited with code 1



Apr 09 20:44:14 Proxmox kernel: ucsi_ccg: probe of 0-0008 failed with error -110
Apr 09 20:44:14 Proxmox kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110
Apr 09 20:44:14 Proxmox kernel: ucsi_ccg 0-0008: i2c_transfer failed -110
Apr 09 20:44:14 Proxmox kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000


Apr 09 20:09:23 Proxmox qm[19189]: <root@pam> end task UPID:Proxmox:00004AF6:00096F91:64336FCF:qmstart:101:root@pam: start failed: QEMU exited with code 1
Apr 09 20:09:23 Proxmox qm[19190]: start failed: QEMU exited with code 1


Apr 09 19:57:15 Proxmox pvedaemon[11258]: <root@pam> end task UPID:Proxmox:00004179:0008534E:64336CF8:qmstart:102:root@pam: start failed: QEMU exited with code 1
Apr 09 19:57:15 Proxmox pvedaemon[16761]: start failed: QEMU exited with code 1
Apr 09 19:57:15 Proxmox pvedaemon[11258]: VM 102 qmp command failed - VM 102 not running
Apr 09 19:57:15 Proxmox kernel: vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1e@0x1e0
Apr 09 19:57:15 Proxmox kernel: vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1f@0x1f0
Apr 09 19:57:15 Proxmox kernel: pcieport 0000:00:1c.7: Intel SPT PCH root port ACS workaround enabled

CR500AF · Apr 17, 2024

I have the same issue on a C246 chipset (similar to Z370) SuperMicro X11SCA-F. I can run the VM that passes through the nVidia card or I can run the VM passing though the LSI 9211-8i... not both at the same time. I reduced ram to each VM to 8GB (server has 64GB)... no dice.

Moved the LSI to a different PCI slot and now all is well.

GPU and HBA Passthrough Issues on Z370 Asus Prime-A

dizzydre21

Member

leesteken

Distinguished Member

dizzydre21

Member

leesteken

Distinguished Member

dizzydre21

Member

dizzydre21

Member

CR500AF

New Member

We value your privacy