GPU and HBA Passthrough Issues on Z370 Asus Prime-A

dizzydre21

New Member
Apr 10, 2023
24
0
1
Hello,

I am having an issue running an Nvidia RTX-2070 Super and an LSI-9211-8i at the same time on separate VMs. I can run them independently without issue, but the second I try to run their VMs at the same time I get the QEMU Exit Code 1 and it fails to start. These devices are not in the same IOMMU groups and my motherboard manual states that PCIE16_1 can run in x16 or x8 with PCIE16_2 also running at x8. Am I misunderstanding the info below?

Copied from Asus manual:

2 x PCIe 3.0/2.0 x16 slots (supports x16, x8/x8, x8/x4+x4*, x8+x4+x4/x0**) - This is PCIE16_1 and PCIE_2
1 x PCI Express 3.0/2.0 x16 slot (max. at x4 mode, compatible with PCIe x1,
- This is PCIE16_3
x2 and x4 devices)
4 x PCI Express 3.0/2.0 x1 slots
* For 2 Intel® SSD on CPU support, install a Hyper M.2 X16 card (sold separately)
into the PCIeX16_2 slot, then enable this card under BIOS settings.
** For 3 Intel ® SSD on CPU support, install a Hyper M.2 X16 card (sold separately)
into the PCIeX16_1 slot, then enable this card under BIOS settings



My VMs are the current stable versions of TrueNAS Scale and Ubuntu Server (only running Jellyfin with GPU hardware acceleration). Currently, I am able to run both VMs with the HBA card in PCIE16_3, but it is running at x4. The BIOS states that the GPU is still running at x16 in slot 1 with this setup, which I do not totally understand unless the 4 lanes for PCIE16_3 are just chipset lanes.
 
If they are not in the same IOMMU group and ou are not using pcie_acs_override and you don't see any error message but only a time-out then it's often a lack of memory.
Try booting both VMs with half (or less) of their memory to test. PCIe passthrough requires that all of the VM memory is pinned into actual RAM (because of possible device initiated DMA).

EDIT: Sharing the outpyut of cat /proc/cmdline; for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done, the VM configuration files and how much (free) memory your Proxmox host has could help in troubleshooting.
 
Last edited:
If they are not in the same IOMMU group and ou are not using pcie_acs_override and you don't see any error message but only a time-out then it's often a lack of memory.
Try booting both VMs with half (or less) of their memory to test. PCIe passthrough requires that all of the VM memory is pinned into actual RAM (because of possible device initiated DMA).

EDIT: Sharing the outpyut of cat /proc/cmdline; for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done, the VM configuration files and how much (free) memory your Proxmox host has could help in troubleshooting.
I saw similar comments about memory on Sunday, probably in this forum, and tried reducing both. They still would not boot. I have 32 gigs total and orginally had 16GB allotted to TrueNAS and 4GB allotted to Ubuntu Server.

I do have some new information, however. I recreated both of my VMs yesterday and used q35 and OMVF so that the PCIE checkbox would appear during the VM creation. I checked this box and both VMs will now boot and run together. Is there a known conflict when not booting with the q35 machine type? I will try out your command when I get the chance, but I also had an issue with TrueNAS yesterday so I pulled the boot drive for Proxmox and installed TrueNAS on another disk for troubleshooting on bare metal.
 
I saw similar comments about memory on Sunday, probably in this forum, and tried reducing both. They still would not boot. I have 32 gigs total and orginally had 16GB allotted to TrueNAS and 4GB allotted to Ubuntu Server.
Sometimes ZFS also takes up to 50% of memory and it only appears to be solved after a reboot.
I do have some new information, however. I recreated both of my VMs yesterday and used q35 and OMVF so that the PCIE checkbox would appear during the VM creation. I checked this box and both VMs will now boot and run together. Is there a known conflict when not booting with the q35 machine type? I will try out your command when I get the chance, but I also had an issue with TrueNAS yesterday so I pulled the boot drive for Proxmox and installed TrueNAS on another disk for troubleshooting on bare metal.
Sorry, I can't explain that. Really no error messages in Proxmox Syslog (journalctl) before?
 
Sometimes ZFS also takes up to 50% of memory and it only appears to be solved after a reboot.

Sorry, I can't explain that. Really no error messages in Proxmox Syslog (journalctl) before?
I just ran it and all the entries were from April 1st. They didn't look like errors though.

Edit - Oops didn't realize I could scrolll. Looking through it now
 
Last edited:
I just ran it and all the entries were from April 1st. They didn't look like errors though.

Edit - Oops didn't realize I could scrolll. Looking through it now
Here are some of the yellow and red entries from around the time I was getting the exit code:

Apr 09 21:19:38 Proxmox kernel: ucsi_ccg: probe of 0-0008 failed with error -110 Apr 09 21:19:38 Proxmox kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110 Apr 09 21:19:38 Proxmox kernel: ucsi_ccg 0-0008: i2c_transfer failed -110 Apr 09 21:19:38 Proxmox kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000 Apr 09 21:19:37 Proxmox kernel: MXM: GUID detected in BIOS Apr 09 21:19:36 Proxmox systemd-modules-load[452]: Failed to find module 'vfio_virqfd' Apr 09 21:19:36 Proxmox kernel: Disabling lock debugging due to kernel taint Apr 09 21:19:36 Proxmox kernel: znvpair: module license 'CDDL' taints kernel. Apr 09 21:19:36 Proxmox kernel: spl: loading out-of-tree module taints kernel. Apr 09 21:19:36 Proxmox kernel: mpt2sas_cm0: overriding NVDATA EEDPTagMode setting Apr 09 21:19:36 Proxmox kernel: mpt3sas 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control

I think this is the correct IOMMU groups
Apr 09 21:19:36 Proxmox kernel: xhci_hcd 0000:05:00.0: [12] Timeout Apr 09 21:19:36 Proxmox kernel: xhci_hcd 0000:05:00.0: device [1b21:2142] error status/mask=00001000/00002000 Apr 09 21:19:36 Proxmox kernel: xhci_hcd 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: AER: Error of this Agent is reported first Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: [12] Timeout Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: [ 0] RxErr Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: device [8086:a294] error status/mask=00001001/00002000 Apr 09 21:19:36 Proxmox kernel: pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Transmitter ID)

Looks like the right spot
Apr 09 20:55:11 Proxmox pvedaemon[1461]: <root@pam> end task UPID:Proxmox:00000C58:0001067F:64337A8C:qmstart:101:root@pam: start failed: QEMU exited with code 1 Apr 09 20:55:11 Proxmox pvedaemon[3160]: start failed: QEMU exited with code 1

Apr 09 20:44:14 Proxmox kernel: ucsi_ccg: probe of 0-0008 failed with error -110 Apr 09 20:44:14 Proxmox kernel: ucsi_ccg 0-0008: ucsi_ccg_init failed - -110 Apr 09 20:44:14 Proxmox kernel: ucsi_ccg 0-0008: i2c_transfer failed -110 Apr 09 20:44:14 Proxmox kernel: nvidia-gpu 0000:01:00.3: i2c timeout error e0000000


Apr 09 20:09:23 Proxmox qm[19189]: <root@pam> end task UPID:Proxmox:00004AF6:00096F91:64336FCF:qmstart:101:root@pam: start failed: QEMU exited with code 1 Apr 09 20:09:23 Proxmox qm[19190]: start failed: QEMU exited with code 1

Apr 09 19:57:15 Proxmox pvedaemon[11258]: <root@pam> end task UPID:Proxmox:00004179:0008534E:64336CF8:qmstart:102:root@pam: start failed: QEMU exited with code 1 Apr 09 19:57:15 Proxmox pvedaemon[16761]: start failed: QEMU exited with code 1 Apr 09 19:57:15 Proxmox pvedaemon[11258]: VM 102 qmp command failed - VM 102 not running Apr 09 19:57:15 Proxmox kernel: vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1e@0x1e0 Apr 09 19:57:15 Proxmox kernel: vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1f@0x1f0 Apr 09 19:57:15 Proxmox kernel: pcieport 0000:00:1c.7: Intel SPT PCH root port ACS workaround enabled
 
I have the same issue on a C246 chipset (similar to Z370) SuperMicro X11SCA-F. I can run the VM that passes through the nVidia card or I can run the VM passing though the LSI 9211-8i... not both at the same time. I reduced ram to each VM to 8GB (server has 64GB)... no dice.

Moved the LSI to a different PCI slot and now all is well.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!